Fighting Data Science in R: Proven Boxing Statistics and Models | R bloggers

Fighting Data Science in R: Proven Boxing Statistics and Models | R bloggers

14 minutes, 30 seconds Read

Boxing analysis is no longer just about punch totals or ‘who looked busier’. Modern combat analysis does data science: repeatable pipelines, validated data, explainable models and performance indicators that translate into strategy. This post shows you how to build a professional combat the data science workflow in R– from raw data to statistics, modeling and tactical insights – using code that you can adapt to your own data sets.

You get: a production-style project structure, data contracts, validation checks, technical patterns, lap-by-lap models, fatigue and momentum signals, and powerful visualizations for coaches and analysts. The goal is to help you go from “interesting graphs” to decision degree analyses.


Table of contents


1) Professional design and project structure

A ‘professional’ analysis workflow starts with discipline: consistent folders, reproducible environments, and a clear separation between raw → clean → features → models. Even if you work solo, this structure makes it easier to iterate and publish your work.

# Core libraries for fight data science
pkgs <- c(
  "tidyverse", "janitor", "lubridate", "glue", "cli",
  "arrow", "here", "fs",
  "duckdb", "DBI",
  "slider",
  "rsample", "recipes", "parsnip", "workflows", "tune", "dials",
  "yardstick", "broom",
  "ggrepel", "patchwork"
)

to_install <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(to_install) > 0) install.packages(to_install, dependencies = TRUE)
invisible(lapply(pkgs, library, character.only = TRUE))

# Create a clean project layout (idempotent)
dirs <- c(
  "data/raw",
  "data/clean",
  "data/features",
  "data/models",
  "plots",
  "reports",
  "R"
)

walk(dirs, ~ fs::dir_create(here::here(.x)))

log_info <- function(...) cli::cli_alert_info(glue::glue(...))
log_ok   <- function(...) cli::cli_alert_success(glue::glue(...))

log_ok("Project folders ready at: {here::here()}")

Tip: Save raw files read-only and always write standardized output (e.g. Parquet) to them data/clean/. You immediately speed up your workflow and reduce the number of ‘mystery bugs’.


2) A fight data contract (schemes that prevent chaos)

The quickest way to break a fight analysis project is to drift columns (“fighter” vs. “boxer”, mixed date formats, different naming conventions for the same actions). A data contract prevents that. Below are two useful contracts:

  • Round totals (works with CompuBox style generators)
  • Event-level metadata (for participation and reporting)
round_schema <- tibble::tribble(
  ~column,              ~type,       ~notes,
  "fight_id",           "character",  "Unique fight identifier",
  "event_id",           "character",  "Unique event identifier",
  "event_date",         "date",       "ISO date",
  "weight_class",       "character",  "e.g., Welterweight",
  "fighter",            "character",  "This row's fighter",
  "opponent",           "character",  "Opponent fighter",
  "corner",             "character",  "Red/Blue or A/B",
  "round",              "integer",    "Round number",
  "jabs_landed",        "integer",    "Jabs landed",
  "jabs_attempted",     "integer",    "Jabs attempted",
  "power_landed",       "integer",    "Power shots landed",
  "power_attempted",    "integer",    "Power shots attempted",
  "knockdowns",         "integer",    "Knockdowns in round",
  "stance",             "character",  "orthodox/southpaw/other",
  "result_round",       "integer",    "1 if fighter won the round, 0 if lost (or NA if unknown)"
)

event_schema <- tibble::tribble(
  ~column,         ~type,       ~notes,
  "event_id",      "character", "Unique event identifier",
  "event_name",    "character", "Event name",
  "event_date",    "date",      "ISO date",
  "location",      "character", "City/Country (optional)",
  "promotion",     "character", "Promotion/org (optional)"
)

round_schema

If you don’t have that result_roundyou can still perform great analytics: predict lap results, infer momentum and quantify ‘who was in control’ using validated scoring proxies.


3) Intake and standardization

Here’s a robust capture pattern: read raw CSVs, normalize names, enforce typing, standardize fighter naming, and write Parquet for speed. Adjust paths to your resources.

read_round_totals <- function(path) {
  log_info("Reading raw round totals: {path}")
  readr::read_csv(path, show_col_types = FALSE) %>%
    janitor::clean_names()
}

standardize_round_totals <- function(df) {
  df %>%
    mutate(
      event_date = as.Date(event_date),
      round = as.integer(round),
      across(
        c(jabs_landed, jabs_attempted, power_landed, power_attempted, knockdowns),
        ~ as.integer(replace_na(.x, 0))
      ),
      across(c(fight_id, event_id, fighter, opponent, weight_class, stance, corner), as.character),
      stance = tolower(stance),
      corner = toupper(corner)
    ) %>%
    # Basic name normalization
    mutate(
      fighter  = str_squish(str_replace_all(fighter, "\\s+", " ")),
      opponent = str_squish(str_replace_all(opponent, "\\s+", " ")),
      weight_class = str_squish(weight_class)
    )
}

write_clean_parquet <- function(df, out_path) {
  fs::dir_create(fs::path_dir(out_path))
  arrow::write_parquet(df, out_path)
  log_ok("Wrote Parquet: {out_path}")
}

# Example:
# raw_path  <- here::here("data/raw/round_totals.csv")
# clean_out <- here::here("data/clean/round_totals.parquet")
# rounds_clean <- read_round_totals(raw_path) %>% standardize_round_totals()
# write_clean_parquet(rounds_clean, clean_out)

Parquet is a gamechanger for analytics work: fast I/O, consistent typing, and easy integration with DuckDB for SQL-style queries.


4) Validation, QA and anomaly detection

Fight data is full of subtle errors: shots attempted < shots landed, double rounds, mixed fighter/opponent lines, or “impossible” knockdown counts. Validation must happen automatically.

validate_round_totals <- function(df) {
  # Required columns check
  required <- round_schema$column
  missing_cols <- setdiff(required, names(df))
  if (length(missing_cols) > 0) {
    stop(glue::glue(“Missing required columns: {paste(missing_cols, collapse=”, “)}”))
  }

  # Logical checks
  bad_landed <- df %>%
    filter(jabs_landed > jabs_attempted | power_landed > power_attempted)

  if (nrow(bad_landed) > 0) {
    log_info(“Found {nrow(bad_landed)} rows where landed > attempted (check source or parsing).”)
  }

  # Duplicate round rows (same fight_id, fighter, round)
  dupes <- df %>%
    count(fight_id, fighter, round) %>%
    filter(n > 1)

  if (nrow(dupes) > 0) {
    log_info(“Found duplicates: {nrow(dupes)} fight/fighter/round combinations.”)
  }

  # Suspicious extremes (simple heuristic)
  suspicious <- df %>%
    mutate(total_attempted = jabs_attempted + power_attempted) %>%
    filter(total_attempted > 120 | knockdowns > 3)

  if (nrow(suspicious) > 0) {
    log_info(“Found {nrow(suspicious)} suspicious rows (very high volume or knockdowns).”)
  }

  df
}

# Example:
# rounds_clean <- rounds_clean %>% validate_round_totals()

Validation gives you confidence. And trust is what makes analytics useful, especially when presenting results to coaches, fighters or gamblers who will challenge your assumptions.


5) Function engineering: pace, accuracy, intent, damage proxies

Combat performance is multi-dimensional. A clean feature set usually includes:

  • Pace: attempts per lap, pace change between laps
  • Accuracy: landed/attempted (jabs, force, total)
  • Intention/style: jab share vs power share
  • Damage proxies: power landings, knockdowns, power accuracy
  • Relative dominance: battle stats minus opponent stats
engineer_round_features <- function(df) {
  df %>%
    mutate(
      total_landed    = jabs_landed + power_landed,
      total_attempted = jabs_attempted + power_attempted,
      acc_jab   = if_else(jabs_attempted > 0, jabs_landed / jabs_attempted, NA_real_),
      acc_power = if_else(power_attempted > 0, power_landed / power_attempted, NA_real_),
      acc_total = if_else(total_attempted > 0, total_landed / total_attempted, NA_real_),
      jab_share_attempts = if_else(total_attempted > 0, jabs_attempted / total_attempted, NA_real_),
      power_share_attempts = if_else(total_attempted > 0, power_attempted / total_attempted, NA_real_),
      # Simple damage proxy: power landed + weighted knockdowns
      damage_proxy = power_landed + 8 * knockdowns
    )
}

# Opponent-relative features (requires pairing fighter vs opponent within the same fight_id and round)
add_relative_features <- function(df) {
  df2 <- df %>%
    select(fight_id, round, fighter, opponent,
           total_landed, total_attempted, acc_total,
           power_landed, power_attempted, acc_power,
           damage_proxy, knockdowns) %>%
    rename_with(~ paste0("opp_", .x), -c(fight_id, round, fighter, opponent)) %>%
    rename(fighter_join = opponent, opponent_join = fighter)

  df %>%
    left_join(
      df2,
      by = c("fight_id" = "fight_id", "round" = "round", "fighter" = "fighter_join", "opponent" = "opponent_join")
    ) %>%
    mutate(
      rel_total_landed = total_landed - opp_total_landed,
      rel_acc_total    = acc_total - opp_acc_total,
      rel_power_landed = power_landed - opp_power_landed,
      rel_damage       = damage_proxy - opp_damage_proxy,
      rel_knockdowns   = knockdowns - opp_knockdowns
    )
}

# Example:
# rounds_feat <- rounds_clean %>% engineer_round_features() %>% add_relative_features()

Relative characteristics are where combat analysis becomes tactical: a fighter’s pace means little without context. Dominance is ‘what you have done’ minus ‘what you have absorbed’.


6) Round-by-round modeling (probability of winning a round)

If you have tagged rounds (result_round = 1/0), you can model round outcomes using interpretable classifiers. Even if you don’t, you can label from trusted sources or use proxy labels (careful).

Below is an end-to-end workflow using neatmodels: split, prescription, logistic regression with regularization, alignment and calibration-friendly evaluation.

# Assume you have rounds_feat with result_round (1/0) for some fights
# rounds_feat <- rounds_feat %>% filter(!is.na(result_round))

set.seed(123)
spl <- rsample::initial_split(rounds_feat %>% filter(!is.na(result_round)), prop = 0.8, strata = result_round)
train <- rsample::training(spl)
test  <- rsample::testing(spl)

rec <- recipes::recipe(result_round ~ rel_total_landed + rel_acc_total + rel_power_landed + rel_damage +
                        total_attempted + acc_total + jab_share_attempts + damage_proxy +
                        stance,
                      data = train) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

mod <- parsnip::logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

wf <- workflows::workflow() %>%
  add_recipe(rec) %>%
  add_model(mod)

grid <- dials::grid_regular(dials::penalty(range = c(-6, 0)), levels = 30)

set.seed(123)
folds <- rsample::vfold_cv(train, v = 5, strata = result_round)

metrics <- yardstick::metric_set(yardstick::roc_auc, yardstick::pr_auc, yardstick::accuracy, yardstick::mn_log_loss)

tuned <- tune::tune_grid(
  wf,
  resamples = folds,
  grid = grid,
  metrics = metrics
)

best <- tune::select_best(tuned, metric = "mn_log_loss")
final_wf <- tune::finalize_workflow(wf, best)

final_fit <- final_wf %>% fit(train)

# Evaluate on holdout test set
test_pred <- predict(final_fit, test, type = "prob") %>%
  bind_cols(test %>% select(result_round))

yardstick::roc_auc(test_pred, truth = result_round, .pred_1)
yardstick::mn_log_loss(test_pred, truth = result_round, .pred_1)

Why log loss? Because in combat analysis calibrated probabilities matter. A model that says “0.55” should be right about 55% of the time, not just classify correctly.


7) Combat outcome modeling (interpretable + calibrated)

Fight outcomes can be modeled based on aggregated round characteristics: average dominance, variance (consistency), late round fade and knockdown impact. First summarize per fight and fighter.

summarize_fight_features <- function(df) {
  df %>%
    group_by(fight_id, event_id, event_date, weight_class, fighter, opponent) %>%
    summarise(
      rounds = n(),
      avg_rel_damage = mean(rel_damage, na.rm = TRUE),
      avg_rel_power_landed = mean(rel_power_landed, na.rm = TRUE),
      avg_rel_total_landed = mean(rel_total_landed, na.rm = TRUE),
      avg_rel_acc_total = mean(rel_acc_total, na.rm = TRUE),
      # Volatility/consistency
      sd_rel_damage = sd(rel_damage, na.rm = TRUE),
      # Pace markers
      avg_total_attempted = mean(total_attempted, na.rm = TRUE),
      # Knockdown signal
      total_knockdowns = sum(knockdowns, na.rm = TRUE),
      .groups = "drop"
    )
}

fight_level <- summarize_fight_features(rounds_feat)

# If you have fight outcome label for fighter perspective (win=1/lose=0):
# fight_level <- fight_level %>% left_join(outcomes, by = c("fight_id","fighter"))

Then the model battle wins with an interpretable learner. Logistic regression is often a strong baseline; boosted trees can increase performance if you maintain explainability via feature importance and partial dependency (where applicable).

# Example: win label in fight_level as win (1/0)
set.seed(42)
spl2 <- rsample::initial_split(fight_level %>% filter(!is.na(win)), prop = 0.8, strata = win)
tr2 <- training(spl2)
te2 <- testing(spl2)

rec2 <- recipe(win ~ avg_rel_damage + avg_rel_power_landed + avg_rel_total_landed +
                avg_rel_acc_total + sd_rel_damage + avg_total_attempted + total_knockdowns,
              data = tr2) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

mod2 <- logistic_reg(penalty = tune(), mixture = 1) %>% set_engine("glmnet")

wf2 <- workflow() %>% add_recipe(rec2) %>% add_model(mod2)

grid2 <- grid_regular(penalty(range = c(-7, 0)), levels = 40)

set.seed(42)
folds2 <- vfold_cv(tr2, v = 5, strata = win)

tuned2 <- tune_grid(wf2, resamples = folds2, grid = grid2, metrics = metrics)

best2 <- select_best(tuned2, "mn_log_loss")
final2 <- finalize_workflow(wf2, best2) %>% fit(tr2)

pred2 <- predict(final2, te2, type = "prob") %>% bind_cols(te2 %>% select(win))
roc_auc(pred2, truth = win, .pred_1)
mn_log_loss(pred2, truth = win, .pred_1)

# Inspect coefficients (interpretability)
final2 %>%
  extract_fit_parsnip() %>%
  broom::tidy() %>%
  arrange(desc(abs(estimate))) %>%
  head(20)

A practical coaching readout might be: “Your average relative damage was +4 per round, but volatility was high. You won the highs and lost the lows – work to maintain output in the middle rounds.”


8) Fatigue, momentum and tactical shifts

Fatigue often manifests as a decrease in number of attempts, a decrease in power accuracy, or a shift toward safer output (more jabs, fewer exchanges). Momentum often appears as multi-round stripes in relative dominance.

Below are two useful constructions:

  • Fatigue index: compare late rounds vs. early rounds on pace and accuracy
  • Momentum signal: moving average of relative damage/dominance
fatigue_index <- function(df) {
  df %>%
    group_by(fight_id, fighter) %>%
    mutate(
      early = round <= 3,
      late  = round >= max(round, na.rm = TRUE) - 2
    ) %>%
    summarise(
      early_pace = mean(total_attempted[early], na.rm = TRUE),
      late_pace  = mean(total_attempted[late], na.rm = TRUE),
      early_acc  = mean(acc_total[early], na.rm = TRUE),
      late_acc   = mean(acc_total[late], na.rm = TRUE),
      fatigue_pace_drop = (late_pace - early_pace) / pmax(early_pace, 1),
      fatigue_acc_drop  = (late_acc - early_acc) / pmax(early_acc, 1e-6),
      .groups = "drop"
    ) %>%
    mutate(
      fatigue_score = 0.7 * fatigue_pace_drop + 0.3 * fatigue_acc_drop
    )
}

momentum_signal <- function(df, window = 3) {
  df %>%
    arrange(fight_id, fighter, round) %>%
    group_by(fight_id, fighter) %>%
    mutate(
      rel_damage_roll = slider::slide_dbl(rel_damage, mean, .before = window - 1, .complete = FALSE, na.rm = TRUE),
      rel_landed_roll = slider::slide_dbl(rel_total_landed, mean, .before = window - 1, .complete = FALSE, na.rm = TRUE)
    ) %>%
    ungroup()
}

fatigue_tbl <- fatigue_index(rounds_feat)
rounds_mom <- momentum_signal(rounds_feat, window = 3)

Interpretation tips:

  • Fatigue score negative → late output/accuracy decreased (common)
  • Fatigue score almost zero → stable performance (valuable at elite level)
  • Rolling dominance that crosses zero → tactical turning point (angle adjustments are most important here)

9) Visual analytics for strategy

The “best” plots are the ones that change decisions. Two strategic visuals:

  • Dominance timeline (relative moving average of the damage)
  • Style card (jab share vs power accuracy)
plot_dominance_timeline <- function(df, fight_id_pick, fighter_pick) {
  d <- df %>%
    filter(fight_id == fight_id_pick, fighter == fighter_pick) %>%
    arrange(round)

  ggplot(d, aes(x = round, y = rel_damage_roll)) +
    geom_hline(yintercept = 0, linewidth = 0.6) +
    geom_line(linewidth = 1) +
    geom_point(size = 2) +
    labs(
      x = "Round",
      y = "Rolling Relative Damage (windowed mean)",
      title = "Dominance Timeline",
      subtitle = glue::glue("Fight {fight_id_pick} — {fighter_pick}")
    ) +
    theme_minimal(base_size = 12)
}

plot_style_map <- function(df, fight_id_pick) {
  d <- df %>%
    filter(fight_id == fight_id_pick) %>%
    group_by(fighter) %>%
    summarise(
      jab_share = mean(jab_share_attempts, na.rm = TRUE),
      power_acc = mean(acc_power, na.rm = TRUE),
      pace = mean(total_attempted, na.rm = TRUE),
      .groups = "drop"
    )

  ggplot(d, aes(x = jab_share, y = power_acc, label = fighter, size = pace)) +
    geom_point(alpha = 0.7) +
    ggrepel::geom_text_repel(max.overlaps = 50) +
    labs(
      x = "Jab Share (Attempts)",
      y = "Power Accuracy",
      title = "Style Map (per fight)",
      subtitle = "Higher pace = larger point"
    ) +
    theme_minimal(base_size = 12)
}

# Example:
# p1 <- plot_dominance_timeline(rounds_mom, fight_id_pick = "F123", fighter_pick = "Fighter A")
# p2 <- plot_style_map(rounds_feat, fight_id_pick = "F123")
# p1 + p2

How coaches use these:

  • If dominance wanes after round 4, check opponent’s conditioning or defensive adjustments.
  • If the puncture proportion is high but the accuracy of the energy is low, the puncture may be ‘occupied’ but not create openings.
  • When the pace is fast and the accuracy steady, that’s often a winning profile, especially during long fights.

10) Scalable Pipelines: Parquet, DuckDB, Reproducibility

Once your data grows (multiple events, seasons, amateur + professional, different sources), SQL-style analysis becomes extremely useful. DuckDB allows you to query Parquet directly without a database administrator.

# Connect to DuckDB (in-memory or file-backed)
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = here::here("data/fight_analytics.duckdb"))

# Point DuckDB at a Parquet file (or a folder of Parquet files)
parquet_path <- here::here("data/clean/round_totals.parquet")

DBI::dbExecute(con, glue::glue("
  CREATE OR REPLACE VIEW rounds AS
  SELECT * FROM read_parquet('{parquet_path}')
"))

# Example: top rounds by volume (attempted punches)
top_volume <- DBI::dbGetQuery(con, "
  SELECT fighter, fight_id, round,
         (jabs_attempted + power_attempted) AS total_attempted
  FROM rounds
  ORDER BY total_attempted DESC
  LIMIT 25
")

top_volume %>% as_tibble()

# Close when done
DBI::dbDisconnect(con, shutdown = TRUE)

This makes it easy to build reliable reporting: “Highest-paced fights,” “Biggest late-round fades,” “Most consistent dominance,” and “Knockdown-driven wins.”


11) Wrapping up and next steps

A battle data science workflow in R becomes powerful when you combine the following:

  • Clean contracts so your data doesn’t drift
  • Validation so that your results are reliable
  • Relative characteristics so that statistics become tactical
  • Probability models so the conclusions are calibrated
  • Fatigue/momentum The strategy therefore reflects real turning points

If you want a more structured, end-to-end path with deeper modeling, richer case studies, and a complete workflow designed specifically for boxing, you might like this resource:

a complete hands-on book focused on boxing data science and fight performance strategy in R
.


#Fighting #Data #Science #Proven #Boxing #Statistics #Models #bloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *