Pledge My Time VII | R bloggers

Pledge My Time VII | R bloggers

8 minutes, 27 seconds Read

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report a problem with the content on this page here)


Want to share your content on R bloggers? click here if you have a blog, or here if you don’t.

Here we go again! I was running the Mainova Frankfurt Marathon 2025 and wanted to check out the race results. How can we do this using R?

I couldn’t see an easy way to download the data, so I used R to scrape it. Note that these times are currently preliminary, but they give us a good idea of ​​what happened.

The results are available with a search function to find an individual’s results. If we leave everything blank and set the number of results to display to the maximum, we get the first page of 16 with all the results. The rule is: if we can see it, we can scrape it!

We can use {rvest} we can scrape this data. The steps are: figure out the format of the items to be extracted (in this case each runner was a list item and the data fields were divs within each item), write a function to extract all the runners on the page, write a function to process a page, call this function for each page! It might be easier to see the code:

library(rvest)
library(dplyr)
library(purrr)
library(stringr)
library(ggforce)


## Functions ----
# retrieves the data frame from the main function
scrape_results_page <- function(url) {
  webpage <- read_html(url)
  df <- scrape_startlist(webpage)
  df <- df[-1, ]
  return(df)
}

# scrapes the data
scrape_startlist <- function(page) {
  rows <- page %>% html_nodes("li.list-group-item.row")
  map_df(rows, function(row) {
    # helper to get text from a selector, remove small labels and trim
    get_text <- function(sel) {
      node <- row %>% html_node(sel)
      if (is.na(node) || length(node) == 0) return(NA_character_)
      # remove the mobile label nodes inside if present
      node %>% html_nodes(".visible-xs-block, .visible-sm-block, .list-label") %>% xml2::xml_remove()
      text <- node %>% html_text(trim = TRUE)
      if (length(text) == 0) return(NA_character_) else return(text)
    }
    
    # place primary/secondary
    place_primary <- get_text(".type-place.place-primary")
    place_secondary <- get_text(".type-place.place-secondary")
    
    # fullname and link
    fullname_a <- row %>% html_node("h4.type-fullname a")
    fullname <- if (length(fullname_a) == 0) NA_character_ else fullname_a %>% html_text(trim = TRUE)
    link <- if (length(fullname_a) == 0) NA_character_ else fullname_a %>% html_attr("href")
    
    # bib, club/city, age class (these are under second column)
    bib <- get_text(".type-field")
    club_city <- get_text(".type-priority")
    age_class <- get_text(".type-age_class")
    
    # finish and gun time: there are multiple .type-time entries; take them in order
    times <- row %>% html_nodes(".type-time") %>% html_text(trim = TRUE)
    times <- times[times != ""] # drop blanks
    finish <- if (length(times) >= 1) times[1] else NA_character_
    gun_time <- if (length(times) >= 2) times[2] else NA_character_
    
    # make data frame. We don't need gun time or link
    data.frame(
      place_primary = place_primary,
      place_secondary = place_secondary,
      fullname = fullname,
      bib = bib,
      club_city = club_city,
      age_class = age_class,
      finish = finish
    )
  })
}

# Specifying the base url for website to be scraped
url <- "https://live.frankfurt-marathon.com/2025/?page="

# the pages are like this:
# "https://live.frankfurt-marathon.com/2025/?page=2&event=L_HCH3BKLB3B8&num_results=1000&pid=startlist_list&pidp=startlist&search%5Bage_class%5D=%25&search%5Bsex%5D=%25&search%5Bnation%5D=%25&search_sort=name"
# we have 1000 results on a page and the first page shows there are 16 pages total
n_pages <- 16
# make a list of all urls to be scraped
urls <- paste0(url, seq(n_pages), "&event=L_HCH3BKLB3B8&num_results=1000&pid=startlist_list&pidp=startlist&search%5Bage_class%5D=%25&search%5Bsex%5D=%25&search%5Bnation%5D=%25&search_sort=name")
# scrape each page one by one and rbind into large df
result  <- do.call(rbind, lapply(urls, scrape_results_page))

The hardest part here is finding out the names of the html nodes that contain the data. I just looked at the source of the page in my browser and noted which div classes were needed.

So now we have a data frame called result which contains all the data. We need to clean things up a bit first:

ages <- c("U18", "JU20", "U23", "H",
          "30", "35", "40", "45", "50", "55", "60", "65", "70", "75", "80", "85", "–")
# order the age_class factor levels
result$age_class <- factor(result$age_class, levels = ages)
# if the bib number starts with F add "Female" to new "gender" column, otherwise assume "Male
result$gender <- ifelse(startsWith(result$bib, "F"), "Female", "Male")
# remove "Finish" text from finish times
result$finish <- str_replace(result$finish, "Finish", "")
# convert string times to hh:mm:ss POSIXct
result$finish_time <- as.POSIXct(result$finish, format = "%H:%M:%S", tz = "UTC")

Here we simply put the age classes in the correct order. There are two genders for this event and we can parse them from the bib numbers. Finally, the end time after scraping is slightly distorted, so we had to correct that. We left the shooting time and the link to the details of each runner in the first function because we don’t need them.

Let’s look up some facts and figures!

## Some facts and figures ----
# total runners
total_runners <- nrow(result)
cat("Total runners:", total_runners, "\n")
# total finishers (those without NA as finish_time)
total_finishers <- sum(!is.na(result$finish_time))
cat("Total finishers:", total_finishers, "\n")
# average finish time
avg_finish_time <- mean(result$finish_time, na.rm = TRUE)
cat("Average finish time:", format(avg_finish_time, "%H:%M:%S"), "\n")
# fastest finish time
fastest_finish_time <- min(result$finish_time, na.rm = TRUE)
cat("Fastest finish time:", format(fastest_finish_time, "%H:%M:%S"), "\n")
# slowest finish time
slowest_finish_time <- max(result$finish_time, na.rm = TRUE)
cat("Slowest finish time:", format(slowest_finish_time, "%H:%M:%S"), "\n")

# break down the same stats by gender
for (g in unique(result$gender)) {
  cat("Gender:", g, "\n")
  res_g <- result[result$gender == g, ]
  total_runners_g <- nrow(res_g)
  cat("  Total runners:", total_runners_g, "\n")
  total_finishers_g <- sum(!is.na(res_g$finish_time))
  cat("  Total finishers:", total_finishers_g, "\n")
  avg_finish_time_g <- mean(res_g$finish_time, na.rm = TRUE)
  cat("  Average finish time:", format(avg_finish_time_g, "%H:%M:%S"), "\n")
  fastest_finish_time_g <- min(res_g$finish_time, na.rm = TRUE)
  cat("  Fastest finish time:", format(fastest_finish_time_g, "%H:%M:%S"), "\n")
  slowest_finish_time_g <- max(res_g$finish_time, na.rm = TRUE)
  cat("  Slowest finish time:", format(slowest_finish_time_g, "%H:%M:%S"), "\n")
}

This gives us:

Total runners: 15456 
Total finishers: 12323 
Average finish time: 03:53:51 
Fastest finish time: 02:06:16 
Slowest finish time: 07:13:07

Gender: Male 
  Total runners: 11913 
  Total finishers: 9497 
  Average finish time: 03:48:09 
  Fastest finish time: 02:06:16 
  Slowest finish time: 07:13:07 
Gender: Female 
  Total runners: 3543 
  Total finishers: 2826 
  Average finish time: 04:12:59 
  Fastest finish time: 02:19:34 
  Slowest finish time: 06:47:55 

Assuming every runner listed in the results has started the event and the lack of a finish time indicates DNF. This means that the completion rate was 80% and was the same for men and women. I’m surprised that 20% of the runners didn’t finish. The course is very flat and although quite windy it was not challenging like marathons. This 20% may include people using DNS.

Let’s look at the end times and how they play out.

## Plots ----

# filter out DNFs and "-" for age class
result <- result %>%
  filter(!is.na(finish_time)) %>%
  filter(age_class != "–")

mycolors <- c(rgb(218,63,65, maxColorValue = 255),
              rgb(11,46,114, maxColorValue = 255))

ggplot(result, aes(x = finish_time)) +
  geom_histogram(binwidth = 60, fill = mycolors[1]) +
  labs(x = "Finish Time",
       y = "Count") +
  # 20 minute ticks on x axis
  scale_x_datetime(date_breaks = "20 min", date_labels = "%H:%M") +
  theme_minimal()
# save plot
ggsave("Output/Plots/frankfurt_marathon_2025_finish_time_histogram.png", width = 10, height = 6, bg = "white")

# plot finish times by age class facet by gender
ggplot(result, aes(x = age_class, y = finish_time, colour = gender)) +
  geom_sina(alpha = 0.2, stroke = 0) +
  scale_y_datetime(date_breaks = "20 min", date_labels = "%H:%M") +
  stat_summary(fun = mean, geom = "point", size = 2, colour = "black", alpha = 0.8) +
  scale_colour_manual(values = mycolors) +
  facet_wrap(~ gender) +
  labs(x = "",
       y = "Finish Time") +
  theme_minimal() +
  theme(legend.position = "none")

# save plot
ggsave("Output/Plots/frankfurt_marathon_2025_finish_times_by_age_class.png", width = 10, height = 6)

This gives us two plots. First of all, the finish times per age class and gender:

The mean (average) time per category is shown as a black circle, otherwise each runner is a red or blue dot. The average time for men appears to be highest in the 35-40 age category, although the fastest times are in the under 35 categories. For women, there is a similar slowdown in average finishing time in older age groups, but less of a peak effect. However, the number of female participants is lower, so we may miss the effect for this reason.

This plot is quite nice because you can see the density of runners in each category to get an idea of ​​participation. There is also a striping effect that is most evident in the data for men.

This is a histogram of the finishing times of all participants. There are peaks at just under 3 o’clock and 4 o’clock. There is also a build-up of runners finishing around 3:30 PM and 5:00 AM. These lap numbers are of course target times for many runners.

Congratulations to everyone who took part and especially to those who achieved the goals they set for themselves.

The title of the message is taken from ‘Pledging My Time’, a song from Bob Dylan’s Blonde on Blonde.


#Pledge #Time #VII #bloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *