Formula 1 is one of the most attractive areas for data analysis in R, as it combines structured results, lap-by-lap timing, pit strategy and driver performance in one of the richest data sets in the sport. For anyone building authority in technical R content, this is an excellent niche: it’s specific enough to stand out, but broad enough to support tutorials, visualizations, predictive models, and long-form analytical writing.
One of the biggest benefits of working in this space is that f1dataR gives R users access to both historical Formula 1 data and richer session-level workflows, linked to the broader Ergast/Jolpica and FastF1 ecosystem. That makes it possible to go from simple race results to much more interesting questions: who had the strongest race pace? Which driver managed tire wear best? Did a pit stop strategy really work? Can we build a basic model to estimate race outcomes?
This is where Formula 1 becomes much more than a sporting topic. It will be a practical case study in data processing, time series thinking, feature engineering, visualization and prediction. And because the R blog space is relatively short on in-depth Formula 1 content compared to more general analysis topics, a strong tutorial here can help position your site as a serious source of expertise.
Why Formula 1 analysis in R is such a strong niche
Most R tutorials on the web focus on standard examples: sales dashboards, house prices, or generic machine learning datasets. Formula 1 is different. The data has context, drama, and a built-in audience. Every race gives you new material to analyze, and every session contains multiple layers of information: qualifying pace, stint length, tire compounds, safety car timing, sector performance, overtaking actions and pit strategy.
That’s part of what makes this topic attractive for long-form content. You don’t just learn code. You show how code helps explain real competitive decisions. A lap time is not just a number. It is evidence of tire wear, traffic, fuel tax, track evolution and driver performance.
For readers who want to delve deeper into these types of workflows, resources such as Racing with data: Formula 1 and NASCAR Analytics with R are useful because they reinforce the idea that race analysis in R can go far beyond basic graphs and deliver serious, code-driven analysis.
Installing the packages
The first step is to set up a workflow that is both reproducible and flexible. For most Formula 1 analysis projects in R you will want f1dataR plus a small set of packages for data cleaning, plotting, reporting and modeling.
install.packages(c( "f1dataR", "tidyverse", "lubridate", "janitor", "scales", "slider", "broom", "tidymodels", "gt", "patchwork" )) library(f1dataR) library(tidyverse) library(lubridate) library(janitor) library(scales) library(slider) library(broom) library(tidymodels) library(gt) library(patchwork)
If you want to work with official timing data at the session level, it is also a good idea to configure FastF1 support and define a local cache.
setup_fastf1()
options(f1dataR.cache = "f1_cache")
dir.create("f1_cache", showWarnings = FALSE)That may seem like a minor detail, but caching matters if you’re building serious analytical content. It makes your workflow faster, cleaner, and much easier to reproduce when you update notebooks, reports, or blog posts later.
Start with the race results
Before you dive into laps and strategy, start with historical race results. They form the backbone for seasonal overviews, driver comparisons, manufacturer trends and predictive functions.
results_2024 <- load_results(season = 2024) results_2024 %>% clean_names() %>% select(round, race_name, driver, constructor, grid, position, points, status) %>% glimpse()
Once the results are loaded, you can create a season summary table that gives readers an instant overview of the competitive picture.
season_table <- results_2024 %>%
clean_names() %>%
group_by(driver, constructor) %>%
summarise(
races = n(),
wins = sum(position == 1, na.rm = TRUE),
podiums = sum(position <= 3, na.rm = TRUE),
avg_finish = mean(position, na.rm = TRUE),
avg_grid = mean(grid, na.rm = TRUE),
points = sum(points, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(points), avg_finish)
season_tableYou can also convert that summary into a more organized publishing table for a blog or report.
season_table %>%
mutate(
avg_finish = round(avg_finish, 2),
avg_grid = round(avg_grid, 2)
) %>%
gt() %>%
tab_header(
title = "2024 Driver Season Summary",
subtitle = "Wins, podiums, average finish, and points"
)These types of summaries are useful, but in themselves they do not explain much about how results were achieved. That’s why the next step is important.
Looking beyond the final position
One of the easiest ways to improve an F1 analysis is to go beyond the final standings. A driver who finished sixth may have put in an excellent performance in a mid-range car, while a podium finish in a dominant car can tell a much simpler story. A stronger framework compares results against starting position, teammates’ performance and race pace.
A good starting point is position gain.
position_gain_table <- results_2024 %>%
clean_names() %>%
mutate(
position_gain = grid - position
) %>%
group_by(driver, constructor) %>%
summarise(
mean_gain = mean(position_gain, na.rm = TRUE),
median_gain = median(position_gain, na.rm = TRUE),
total_gain = sum(position_gain, na.rm = TRUE),
races = n(),
.groups = "drop"
) %>%
arrange(desc(mean_gain))
position_gain_tableThis metric is simple, but still valuable because it provides an initial signal of race execution. Of course it has limits. Front runners have less room to gain places, and midfield races are often influenced by strategy differences, incidents and reliability. Yet it is precisely that nuance that makes the discussion interesting.
Add race and track context
Performance in Formula 1 is always track dependent. Some cars are stronger on high-speed tracks, some drivers thrive on street circuits, and some teams handle tire-sensitive locations better than others. Merging race results with planning data can help you formulate these questions more clearly.
schedule_2024 <- load_schedule(season = 2024) %>%
clean_names()
results_with_schedule <- results_2024 %>%
clean_names() %>%
left_join(
schedule_2024 %>%
select(round, race_name, circuit_name, locality, country, race_date),
by = c("round", "race_name")
)
results_with_schedule %>%
select(round, race_name, circuit_name, country, driver, constructor, grid, position) %>%
slice_head(n = 10)Even at this stage, you already have enough structure to write multiple types of posts: top performing drivers by track type, consistency of constructors throughout the season, gaps in teammates by location, or overperformance relative to starting position.
Lap times: where the analysis gets serious
Race results tell you what happened. Lap times tell you how it happened. This is where Formula 1 analysis becomes much more valuable, as you can start evaluating race pace, traffic impacts, tire wear and the shape of a driver’s performance over the entire event.
It’s usually best to focus on one racing session first, especially if your goal is to explain the process clearly.
session_laps <- load_laps( season = 2024, round = 10, session = "R" ) %>% clean_names() session_laps %>% select(driver, lap_number, lap_time, compound, tyre_life, stint, pit_out_time, pit_in_time) %>% glimpse()
Lap time fields often need to be cleaned before they are suitable for visualization or modeling. Converting them into seconds is usually the most practical approach.
laps_clean <- session_laps %>%
mutate(
lap_time_seconds = as.numeric(lap_time),
sector1_seconds = as.numeric(sector_1_time),
sector2_seconds = as.numeric(sector_2_time),
sector3_seconds = as.numeric(sector_3_time)
) %>%
filter(!is.na(lap_time_seconds)) %>%
filter(lap_time_seconds > 50, lap_time_seconds < 200)
summary(laps_clean$lap_time_seconds)Comparison of race pace per driver
Once the lap data has been cleaned, you can compare selected drivers and visualize how their pace evolves throughout the race.
selected_drivers <- c("VER", "NOR", "LEC", "HAM")
laps_clean %>%
filter(driver %in% selected_drivers) %>%
ggplot(aes(x = lap_number, y = lap_time_seconds, color = driver)) +
geom_line(alpha = 0.8, linewidth = 0.8) +
geom_point(size = 1.2, alpha = 0.7) +
scale_y_continuous(labels = label_number(accuracy = 0.1)) +
labs(
title = "Race pace by lap",
subtitle = "Raw lap times across the Grand Prix",
x = "Lap",
y = "Lap time (seconds)",
color = "Driver"
) +
theme_minimal(base_size = 13)Raw lap time graphs are useful, but they are often noisy because pit laps, outlaps and unusual traffic can distort the pattern. A stronger analysis filters out some of that noise and focuses on the green flag pace.
green_flag_laps <- laps_clean %>%
filter(driver %in% selected_drivers) %>%
filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
group_by(driver) %>%
mutate(
median_lap = median(lap_time_seconds, na.rm = TRUE),
lap_delta = lap_time_seconds - median_lap
) %>%
ungroup() %>%
filter(abs(lap_delta) < 5)
green_flag_laps %>%
ggplot(aes(lap_number, lap_time_seconds, color = driver)) +
geom_line(linewidth = 0.9) +
geom_smooth(se = FALSE, method = "loess", span = 0.25, linewidth = 1.1) +
labs(
title = "Green-flag race pace",
subtitle = "Smoothed lap-time profile after removing pit laps and large outliers",
x = "Lap",
y = "Lap time (seconds)"
) +
theme_minimal(base_size = 13)This type of graph is one of the most useful in F1 analysis, as it shows whether a driver was really fast, was just taking advantage of track position, or faded late in the race.
Tire degradation and stint analysis
One of the best ways to add real authority to an F1 post is to quantify relegation. Rather than simply saying that a driver “managed the tires well,” you can estimate how lap time changed as tire life increased during a stint.
stint_degradation <- laps_clean %>%
filter(driver %in% selected_drivers) %>%
filter(!is.na(stint), !is.na(tyre_life), !is.na(compound)) %>%
filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
group_by(driver, stint, compound) %>%
filter(n() >= 8) %>%
nest() %>%
mutate(
model = map(data, ~ lm(lap_time_seconds ~ tyre_life, data = .x)),
tidied = map(model, broom::tidy)
) %>%
unnest(tidied) %>%
filter(term == "tyre_life") %>%
transmute(
driver,
stint,
compound,
degradation_per_lap = estimate,
p_value = p.value
) %>%
arrange(degradation_per_lap)
stint_degradationA positive slope generally means that the pace drops as the stint ages. A smaller slope indicates better tire retention or a more stable pace. The interpretation is not always easy, because the racing context matters, but the method is very effective at turning racing discussions into evidence.
laps_clean %>%
filter(driver %in% selected_drivers, !is.na(stint), !is.na(tyre_life)) %>%
filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
ggplot(aes(tyre_life, lap_time_seconds, color = driver)) +
geom_point(alpha = 0.5, size = 1.6) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
facet_wrap(~ compound, scales = "free_x") +
labs(
title = "Tyre degradation by compound",
subtitle = "Linear approximation of pace loss as the stint ages",
x = "Tyre life (laps)",
y = "Lap time (seconds)"
) +
theme_minimal(base_size = 13)This is exactly the kind of analysis that makes a technical article memorable, because it goes from “who won?” to “why did the performance pattern look like this?”
Pit stops and strategy
Pit strategy is one of the clearest examples of how Formula 1 combines data and decision-making. A stop is not just an event; it is a trade-off between track position, tire life, race pace and the behavior of nearby competitors.
pit_summary <- session_laps %>%
clean_names() %>%
mutate(
had_pit_event = !is.na(pit_out_time) | !is.na(pit_in_time)
) %>%
group_by(driver) %>%
summarise(
total_laps = n(),
pit_events = sum(had_pit_event, na.rm = TRUE),
stints = n_distinct(stint, na.rm = TRUE),
first_compound = first(na.omit(compound)),
last_compound = last(na.omit(compound)),
.groups = "drop"
) %>%
arrange(desc(pit_events))
pit_summaryA better way to explain strategy is to reconstruct the stints directly.
strategy_table <- session_laps %>%
clean_names() %>%
arrange(driver, lap_number) %>%
group_by(driver, stint) %>%
summarise(
start_lap = min(lap_number, na.rm = TRUE),
end_lap = max(lap_number, na.rm = TRUE),
laps_in_stint = n(),
compound = first(na.omit(compound)),
avg_lap = mean(as.numeric(lap_time), na.rm = TRUE),
median_lap = median(as.numeric(lap_time), na.rm = TRUE),
.groups = "drop"
) %>%
arrange(driver, stint)
strategy_table
strategy_table %>%
ggplot(aes(x = start_lap, xend = end_lap, y = driver, yend = driver, color = compound)) +
geom_segment(linewidth = 6, lineend = "round") +
labs(
title = "Race strategy by driver",
subtitle = "Stint map reconstructed from lap-level data",
x = "Lap window",
y = "Driver",
color = "Compound"
) +
theme_minimal(base_size = 13)Once you have stint cards, your analysis immediately becomes more strategic. You can discuss undercuts, overcuts, long first stints, aggressive early stops and whether a team actually turned tire freshness into meaningful gains.
Measuring the post-stop pace
A useful extension is to investigate whether a driver actually benefits from fresh tires after a stop. That’s one of the easiest ways to move from descriptive well analysis to strategic interpretation.
post_stop_pace <- session_laps %>%
clean_names() %>%
arrange(driver, lap_number) %>%
group_by(driver) %>%
mutate(
pit_out_lap = !is.na(pit_out_time),
laps_since_stop = cumsum(lag(pit_out_lap, default = FALSE))
) %>%
ungroup() %>%
filter(!is.na(lap_time)) %>%
group_by(driver, laps_since_stop) %>%
summarise(
first_laps_avg = mean(as.numeric(lap_time)[1:min(3, n())], na.rm = TRUE),
stint_avg = mean(as.numeric(lap_time), na.rm = TRUE),
.groups = "drop"
)
post_stop_paceThese types of tables help answer a much better question than “when did they pit?” It asks, “Did the stop create a useful pace, and was that pace strong enough to affect the race?”
Teammate comparison as the best benchmark
In Formula 1, comparing teammates is often more informative than comparing the entire grid, because the car is the closest thing to a controlled environment. If one driver consistently beats another in terms of grid position, race finish or pace consistency, that tells you something much more precise than the overall championship table.
teammate_table <- results_2024 %>%
clean_names() %>%
group_by(constructor, round, race_name) %>%
mutate(
teammate_finish_rank = min_rank(position),
teammate_grid_rank = min_rank(grid)
) %>%
ungroup() %>%
group_by(driver, constructor) %>%
summarise(
avg_finish = mean(position, na.rm = TRUE),
avg_grid = mean(grid, na.rm = TRUE),
teammate_beating_rate_finish = mean(teammate_finish_rank == 1, na.rm = TRUE),
teammate_beating_rate_grid = mean(teammate_grid_rank == 1, na.rm = TRUE),
points = sum(points, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(teammate_beating_rate_finish), desc(points))
teammate_tableThat kind of comparison is especially powerful in a technical post, because it gives readers a benchmark they already understand intuitively, while still grounding the discussion in data.
Sector analysis
If lap times tell you the overall pace story, sectors can help reveal where that pace is gained or lost. Even without diving into full telemetry, sector splits can show whether a driver is strong in traction zones, high-speed sections, or parts of the track that require heavy braking.
sector_summary <- laps_clean %>%
filter(driver %in% selected_drivers) %>%
group_by(driver) %>%
summarise(
s1 = mean(sector1_seconds, na.rm = TRUE),
s2 = mean(sector2_seconds, na.rm = TRUE),
s3 = mean(sector3_seconds, na.rm = TRUE),
total = mean(lap_time_seconds, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_longer(cols = c(s1, s2, s3), names_to = "sector", values_to = "seconds")
sector_summary %>%
ggplot(aes(sector, seconds, fill = driver)) +
geom_col(position = "dodge") +
labs(
title = "Average sector times by driver",
subtitle = "A simple way to localize pace differences",
x = "Sector",
y = "Average time (seconds)",
fill = "Driver"
) +
theme_minimal(base_size = 13)This type of breakdown is useful because it shapes the analysis. Instead of saying a driver was faster overall, you can show where the time came from.
From description to prediction
One of the strongest editorial angles for an article like this is to end with a section on predictive modeling. A title like Formula 1 Data Science in R: Predicting Race Results works well because it combines clarity of purpose, technical interest and topic with built-in audience appeal.
The key is to be realistic. The goal is not to promise perfect predictions. It is intended to show how descriptive Formula 1 data can be converted into features for a basic model.
model_data <- results_2024 %>%
clean_names() %>%
arrange(driver, round) %>%
group_by(driver) %>%
mutate(
rolling_avg_finish_3 = slide_dbl(position, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
rolling_avg_grid_3 = slide_dbl(grid, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
rolling_points_3 = slide_dbl(points, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
prev_finish = lag(position),
prev_grid = lag(grid)
) %>%
ungroup() %>%
mutate(
target_top10 = if_else(position <= 10, 1, 0),
target_podium = if_else(position <= 3, 1, 0)
) %>%
select(
round, race_name, driver, constructor, grid, points, position,
rolling_avg_finish_3, rolling_avg_grid_3, rolling_points_3,
prev_finish, prev_grid, target_top10, target_podium
) %>%
drop_na()
glimpse(model_data)This dataset has been deliberately kept simple, but that is a strong point of a tutorial. It makes the logic visible and gives readers something they can actually reproduce and extend.
Predict a top-10 finish
set.seed(42)
split_obj <- initial_split(model_data, prop = 0.8, strata = target_top10)
train_data <- training(split_obj)
test_data <- testing(split_obj)
log_recipe <- recipe(
target_top10 ~ grid + rolling_avg_finish_3 + rolling_avg_grid_3 +
rolling_points_3 + prev_finish + prev_grid,
data = train_data
) %>%
step_impute_median(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
log_spec <- logistic_reg() %>%
set_engine("glm")
log_workflow <- workflow() %>%
add_recipe(log_recipe) %>%
add_model(log_spec)
log_fit <- fit(log_workflow, data = train_data)
top10_predictions <- predict(log_fit, new_data = test_data, type = "prob") %>%
bind_cols(predict(log_fit, new_data = test_data)) %>%
bind_cols(test_data %>% select(target_top10))
top10_predictions
top10_predictions %>%
roc_auc(truth = factor(target_top10), .pred_1)
top10_predictions %>%
accuracy(truth = factor(target_top10), estimate = .pred_class)Predict final position
finish_recipe <- recipe(
position ~ grid + rolling_avg_finish_3 + rolling_avg_grid_3 +
rolling_points_3 + prev_finish + prev_grid,
data = train_data
) %>%
step_impute_median(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
lm_spec <- linear_reg() %>%
set_engine("lm")
lm_workflow <- workflow() %>%
add_recipe(finish_recipe) %>%
add_model(lm_spec)
lm_fit <- fit(lm_workflow, data = train_data)
finish_predictions <- predict(lm_fit, new_data = test_data) %>%
bind_cols(test_data %>% select(position, driver, constructor, race_name, grid))
metrics(finish_predictions, truth = position, estimate = .pred)
finish_predictions %>%
ggplot(aes(position, .pred)) +
geom_point(alpha = 0.7, size = 2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "Predicted vs actual finishing position",
subtitle = "Baseline linear model",
x = "Actual finish",
y = "Predicted finish"
) +
theme_minimal(base_size = 13)A basic model like this is not intended to be a perfect prediction system. Its real value is educational. It shows how to go from results tables to feature engineering, and then from features to a reproducible predictive workflow.
A simple custom driver rating
If you want the article to feel more original, a good option is to create a custom driver score. Composite metrics work well when writing Formula 1 because they combine multiple performance dimensions into one interpretable rankings.
driver_rating <- results_2024 %>%
clean_names() %>%
group_by(driver, constructor) %>%
summarise(
avg_finish = mean(position, na.rm = TRUE),
avg_grid = mean(grid, na.rm = TRUE),
points = sum(points, na.rm = TRUE),
wins = sum(position == 1, na.rm = TRUE),
podiums = sum(position <= 3, na.rm = TRUE),
gain = mean(grid - position, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
finish_score = rescale(-avg_finish, to = c(0, 100)),
grid_score = rescale(-avg_grid, to = c(0, 100)),
points_score = rescale(points, to = c(0, 100)),
gain_score = rescale(gain, to = c(0, 100)),
win_score = rescale(wins, to = c(0, 100)),
rating = 0.30 * finish_score +
0.20 * grid_score +
0.25 * points_score +
0.15 * gain_score +
0.10 * win_score
) %>%
arrange(desc(rating))
driver_ratingThe most important thing here is transparency. Readers do not have to agree with every weight in the formula. What matters is that the method is explicit, interpretable and easy to criticize or improve.
Final thoughts
Formula 1 analytics in R is an unusually strong content niche because it combines technical accuracy with a naturally engaged audience. Of f1dataRyou can start with historical race results, move to lap time and stint analysis, explore pit strategy and driver benchmarking, and then build basic predictive models that complete the workflow.
That reach is exactly what makes this such a good topic for an authority-building article. It’s practical, reproducible and opens the door to a whole series of follow-up posts on telemetry, qualifying, tire degradation, team-mate comparisons and race predictions.
If your goal is to publish technical content that demonstrates real expertise, rather than just covering surface-level examples, Formula 1 data science in R is one of the best domains to choose.
The message Formula 1 analysis in R with f1dataR: lap times, pit stops and driver performance appeared first on R Programming Books.
Related
#Formula #analysis #f1dataR #lap #times #pit #stops #driver #performance #bloggers


