I recently read Davis Vaughan’s blog post Semi-automation of 200 pull requests with Claude Code and it really appealed to me, as I’ve been using LLMs for tedious tasks like this for a while. Davis’ key insight: structure = success. When you can tightly define a task and provide clear context, LLMs become really useful tools.
If you follow my work, you know that reproducible pipelines have been my main focus for a while. It’s the reason I wrote {rix} for reproducible R environments, {rixpress} for declarative pipelines, and even a called Python port ryxpress. I truly believe that these tools make data science better: more reproducible, more debuggable, and more shareable.
But I also know that it is difficult to get people to adopt new tools. Learning a new way to structure your code takes time and effort, and most people are busy enough as it is. This is where LLMs come into the picture: they can help you translate your existing scripts into this more structured format. You provide your monolithic script, explain what you want, and the LLM does the heavy lifting of restructuring it.
The typical way we write analysis scripts (long chains of %>% calls in R or method-chaining in Python) works fine for interactive exploration, but quickly turns into spaghetti that is difficult to modify, test, or debug. To take my old analysis of Luxembourg airport As an example: it works, but converting that kind of script into a proper pipeline with caching, explicit dependencies and testability is tedious work.
But here we are in 2026, where LLMs now make this trivial.
From implicit to explicit: translating a script into a rixpress pipeline
Let me show you what I mean by translating that old Luxembourg airport data cleaning code into a {rixpress} pipeline. The original script uses continuous %>% chain, standard neatfresh style. The {rixpress} version makes each intermediate step explicit.
But the thing is, LLMs don’t know anything about it {rixpress} out of the box. It’s a relatively new package and it’s not in their training data (or if it is, it’s outdated). So I wrote another tool called pkgctx that extracts structured, LLM-ready API specifications from R or Python packages (and guess what, I didn’t write a single line of code from that tool, LLMs did that).
The idea is simple: run pkgctx r rixpress > rixpress.ctx.yaml and you get a compact YAML file describing each function, its arguments, and its purpose. Pass that on to an LLM, and suddenly he ‘knows’ the package. All my packages now generate such a file on every push thanks to GitHub Actions. This is how my conversation went:
Me: I want to use a package called rixpress, but I’m not familiar with it. You’re helping me, and here’s a file with a detailed description of the API, for context.
[uploaded rixpress.ctx.yaml]
LLM: I got it. I use this API specification as an authoritative reference for Rixpress and help you write correct, idiomatic pipelines.
A few landmarks to set the frame (no action needed yet):
- rixpress is declarative: you describe distractions (
rxp_r(),rxp_py(),rxp_qmd()…), then materialize them withrxp_populate()Andrxp_make().- Data flows through the DAG via serialized artifacts (RDS/pickle/Julia Serialization), not in-memory objects.
- Language boundaries are explicit (
rxp_r2py(),rxp_py2r()).- Nix is the execution engine;
default.nix(or equivalents) define the runtime environment.When you’re ready, tell me what you want to do […]
Then I simply asked:
Me: Help me translate this R script into a rixpress pipeline: [pasted the old script]
And so I got a working one {rixpress} pipeline. The LLM carried out the nasty restructuring; I reviewed the output, made minor adjustments and done. The combination of pkgctx because of the context and a clear task (“translate this script”) made the LLM really useful.
Now let’s see what the translated pipeline looks like. Let’s first assume:
- The database
avia_par_lu.tsvis located in the project folder - Required R packages are available at
default.nix(we also use an LLM for this) - The project is initialized with
rxp_init()(this will create two skeleton files to get you started quickly)
Click to expand the full rixpress pipeline
library(rixpress)
# Step 0: Load the data
avia <- rxp_r_file(
name = avia,
path = "avia_par_lu.tsv",
read_function = readr::read_tsv
)
# Step 1: Select and reshape (wide → long)
avia_long <- rxp_r(
name = avia_long,
expr =
avia %>%
select("unit,tra_meas,airp_pr\\time", contains("20")) %>%
gather(date, passengers, -`unit,tra_meas,airp_pr\\time`)
)
# Step 2: Split composite key column
avia_split <- rxp_r(
name = avia_split,
expr =
avia_long %>%
separate(
col = `unit,tra_meas,airp_pr\\time`,
into = c("unit", "tra_meas", "air_pr\\time"),
sep = ","
)
)
# Step 3: Recode transport measure
avia_recode_tra_meas <- rxp_r(
name = avia_recode_tra_meas,
expr =
avia_split %>%
mutate(
tra_meas = fct_recode(
tra_meas,
`Passengers on board` = "PAS_BRD",
`Passengers on board (arrivals)` = "PAS_BRD_ARR",
`Passengers on board (departures)` = "PAS_BRD_DEP",
`Passengers carried` = "PAS_CRD",
`Passengers carried (arrival)` = "PAS_CRD_ARR",
`Passengers carried (departures)` = "PAS_CRD_DEP",
`Passengers seats available` = "ST_PAS",
`Passengers seats available (arrivals)` = "ST_PAS_ARR",
`Passengers seats available (departures)` = "ST_PAS_DEP",
`Commercial passenger air flights` = "CAF_PAS",
`Commercial passenger air flights (arrivals)` = "CAF_PAS_ARR",
`Commercial passenger air flights (departures)` = "CAF_PAS_DEP"
)
)
)
# Step 4: Recode unit
avia_recode_unit <- rxp_r(
name = avia_recode_unit,
expr =
avia_recode_tra_meas %>%
mutate(
unit = fct_recode(
unit,
Passenger = "PAS",
Flight = "FLIGHT",
`Seats and berths` = "SEAT"
)
)
)
# Step 5: Recode destination
avia_recode_destination <- rxp_r(
name = avia_recode_destination,
expr =
avia_recode_unit %>%
mutate(
destination = fct_recode(
`air_pr\\time`,
`WIEN-SCHWECHAT` = "LU_ELLX_AT_LOWW",
`BRUSSELS` = "LU_ELLX_BE_EBBR",
`GENEVA` = "LU_ELLX_CH_LSGG",
`ZURICH` = "LU_ELLX_CH_LSZH",
`FRANKFURT/MAIN` = "LU_ELLX_DE_EDDF",
`HAMBURG` = "LU_ELLX_DE_EDDH",
`BERLIN-TEMPELHOF` = "LU_ELLX_DE_EDDI",
`MUENCHEN` = "LU_ELLX_DE_EDDM",
`SAARBRUECKEN` = "LU_ELLX_DE_EDDR",
`BERLIN-TEGEL` = "LU_ELLX_DE_EDDT",
`KOBENHAVN/KASTRUP` = "LU_ELLX_DK_EKCH",
`HURGHADA / INTL` = "LU_ELLX_EG_HEGN",
`IRAKLION/NIKOS KAZANTZAKIS` = "LU_ELLX_EL_LGIR",
`FUERTEVENTURA` = "LU_ELLX_ES_GCFV",
`GRAN CANARIA` = "LU_ELLX_ES_GCLP",
`LANZAROTE` = "LU_ELLX_ES_GCRR",
`TENERIFE SUR/REINA SOFIA` = "LU_ELLX_ES_GCTS",
`BARCELONA/EL PRAT` = "LU_ELLX_ES_LEBL",
`ADOLFO SUAREZ MADRID-BARAJAS` = "LU_ELLX_ES_LEMD",
`MALAGA/COSTA DEL SOL` = "LU_ELLX_ES_LEMG",
`PALMA DE MALLORCA` = "LU_ELLX_ES_LEPA",
`SYSTEM - PARIS` = "LU_ELLX_FR_LF90",
`NICE-COTE D'AZUR` = "LU_ELLX_FR_LFMN",
`PARIS-CHARLES DE GAULLE` = "LU_ELLX_FR_LFPG",
`STRASBOURG-ENTZHEIM` = "LU_ELLX_FR_LFST",
`KEFLAVIK` = "LU_ELLX_IS_BIKF",
`MILANO/MALPENSA` = "LU_ELLX_IT_LIMC",
`BERGAMO/ORIO AL SERIO` = "LU_ELLX_IT_LIME",
`ROMA/FIUMICINO` = "LU_ELLX_IT_LIRF",
`AGADIR/AL MASSIRA` = "LU_ELLX_MA_GMAD",
`AMSTERDAM/SCHIPHOL` = "LU_ELLX_NL_EHAM",
`WARSZAWA/CHOPINA` = "LU_ELLX_PL_EPWA",
`PORTO` = "LU_ELLX_PT_LPPR",
`LISBOA` = "LU_ELLX_PT_LPPT",
`STOCKHOLM/ARLANDA` = "LU_ELLX_SE_ESSA",
`MONASTIR/HABIB BOURGUIBA` = "LU_ELLX_TN_DTMB",
`ENFIDHA-HAMMAMET INTERNATIONAL` = "LU_ELLX_TN_DTNH",
`ENFIDHA ZINE EL ABIDINE BEN ALI` = "LU_ELLX_TN_DTNZ",
`DJERBA/ZARZIS` = "LU_ELLX_TN_DTTJ",
`ANTALYA (MIL-CIV)` = "LU_ELLX_TR_LTAI",
`ISTANBUL/ATATURK` = "LU_ELLX_TR_LTBA",
`SYSTEM - LONDON` = "LU_ELLX_UK_EG90",
`MANCHESTER` = "LU_ELLX_UK_EGCC",
`LONDON GATWICK` = "LU_ELLX_UK_EGKK",
`LONDON/CITY` = "LU_ELLX_UK_EGLC",
`LONDON HEATHROW` = "LU_ELLX_UK_EGLL",
`LONDON STANSTED` = "LU_ELLX_UK_EGSS",
`NEWARK LIBERTY INTERNATIONAL, NJ.` = "LU_ELLX_US_KEWR",
`O.R TAMBO INTERNATIONAL` = "LU_ELLX_ZA_FAJS"
)
)
)
# Step 6: Final cleaned dataset
avia_clean <- rxp_r(
name = avia_clean,
expr =
avia_recode_destination %>%
mutate(passengers = as.numeric(passengers)) %>%
select(unit, tra_meas, destination, date, passengers)
)
# Step 7: Quarterly arrivals
avia_clean_quarterly <- rxp_r(
name = avia_clean_quarterly,
expr =
avia_clean %>%
filter(
tra_meas == "Passengers on board (arrivals)",
!is.na(passengers),
str_detect(date, "Q")
) %>%
mutate(date = yq(date))
)
# Step 8: Monthly arrivals
avia_clean_monthly <- rxp_r(
name = avia_clean_monthly,
expr =
avia_clean %>%
filter(
tra_meas == "Passengers on board (arrivals)",
!is.na(passengers),
str_detect(date, "M")
) %>%
mutate(date = ymd(paste0(date, "01"))) %>%
select(destination, date, passengers)
)
# Populate and build the pipeline
rxp_populate(
list(
avia,
avia_long,
avia_split,
avia_recode_tra_meas,
avia_recode_unit,
avia_recode_destination,
avia_clean,
avia_clean_quarterly,
avia_clean_monthly
)
)
rxp_make()This is a faithful “translation” of the script in one {rixpress} pipeline, however the original data is now no longer available and recent data sets have changed slightly, meaning this script needs further adaptation to the current data source. Otherwise this would be it! You can view the updated script here (I also removed all factor recoding as there seems to be something wrong with how {rixpress} handles`, so writing this blog post really helps me find something to solve!)
Generating the environment
I also used an LLM to do it {rix} script that sets the reproducible environment for this pipeline. I gave it the rix.pkgctx.yaml context file (generated with pkgctx r rix > rix.pkgctx.yamlwhich is also available on the rix GitHub repository) and asked: “Use this knowledge and write me an R script that uses rix to set the correct default.nix for this pipeline.”
The LLM correctly identified the necessary packages from the pipeline code:
readr(forread_tsv)dplyr(forselect,filter,mutate,%>%)tidyr(forgather,separate)forcats(forfct_recode)lubridate(foryq,ymd)stringr(forstr_detect)rixpress(for the pipeline itself)
And produced this script:
library(rix)
rix(
date = "2026-01-10",
r_pkgs = c(
"readr",
"dplyr",
"tidyr",
"forcats",
"lubridate",
"stringr",
"rixpress"
),
ide = "none",
project_path = ".",
overwrite = TRUE
)There’s just one problem with that script: the selected date is not valid, but should be January 12. But that’s actually my fault: the LLM couldn’t have known that. The only way it could have known this is if I told it to look at the csv file that lists all the valid dates {rix}‘s repository. But after changing the date it becomes possible to run this script nix-build to build the environment and nix-shell to drop in. From there you run your pipeline.
What we’ve done here is use LLMs at every step:
- Gave context about rixpress (via
pkgctx) and asked the LLM to translate my old script into a pipeline - Gave context about Rix (via
pkgctx) and asked the LLM to generate the environment settings
The pattern is always the same: context + scoped task = useful output.
Structure + context = outsourced work
The point I’m making here doesn’t really make sense {rixpress} pipelines specifically. It’s about a broader principle that both Davis Vaughan and I have noted: LLMs are really useful if you give them enough structure and context.
Davis pre-cloned repositories, pre-generated .Rprofile files and pre-made to-do lists so Claude could focus on the actual solutions instead of the git management. I used pkgctx to give the LLM a full API specification and provide a clear starting point (my old script). In both cases the formula is the same:
Structure + Context → Scoped Task → LLM can actually help
I have written before about how to outsource grunt work to an LLM, but not expertise. The same applies here. I still needed to know which data transformations I needed. I still had to review the output and make adjustments. But the tedious refactoring (turning a monolithic script into a declarative pipeline) is exactly the kind of work that LLMs can handle if you set them up properly.
If you want LLMs to help you with your data science work:
- Give them context. Use tools like
pkgctxto give them API specifications. Paste your existing code. Show them examples. - Scope the task tightly. “Translating this script into a rixpress pipeline” is a well-defined task. “Make my code better” is not.
- View the output. LLMs do grunt work; you provide expertise.
If you’re not familiar with it {rixpress}check out my announcement post or the CRAN release post. And if you want to give LLMs context about R or Python packages, pkgctx is there to help. For those who want to dive deeper into Nix: {rix}And {rixpress}I recently submitted an article to the Journal of Statistical Software, which you can read here. For more examples of {rixpress} pipelines, view the rixpress_demos warehouse.
LLMs aren’t going anywhere: the genie is out of the bottle. I still see a lot of people online claiming that LLMs aren’t useful, but I really believe it comes down to one of two things:
- They do not provide enough context and do not complete their tasks well enough.
- They have a principled objection to LLMs, AI and automation in general, which, okay, whatever, but it’s not a technical argument about usability.
Some people might even say this to feel good about themselves: what I program is far too complex and important for mere LLMs to be able to help me. Okay maybe, but we don’t all work for NASA or whatever. I will continue to outsource the tedious work to LLMs.
Related
#scripts #pipelines #age #LLMs #bloggers


