- Installation
- Retrieve competition data
- Functional engineering
- Model 1: Poisson targets (baseline)
- Model 2: Dixon-Coles adjustment (improves low scores)
- From score lines to 1X2 opportunities
- Opportunities, Implied Opportunities, and Value
- Backtest: Flat Bet vs. Kelly
- Calibration diagnostics
- Production: weekly pipeline
- Frequently asked questions
Installation
This is a fully reproducible, code-heavy walkthrough for building a football betting model in R: data → features → model → probabilities → value bets → bankroll rules → backtest. If you are new to worldfootballRstart here:
Install and use worldfootballR
and save the
worldfootballR guide
open for reference.
# Core packages
install.packages(c(
"dplyr","tidyr","purrr","stringr","lubridate",
"readr","ggplot2","tibble","janitor","glue"
))
# Modeling + evaluation
install.packages(c("broom","rsample","yardstick","scoringRules","pROC"))
# Optional (for speed / nicer tables)
install.packages(c("data.table","DT"))
# Football data
# worldfootballR is usually installed from GitHub
# See: https://rprogrammingbooks.com/install-use-worldfootballr/
# If needed:
# install.packages("remotes")
# remotes::install_github("JaseZiv/worldfootballR")
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(lubridate)
library(readr)
library(ggplot2)
library(janitor)
library(glue)
# worldfootballR (uncomment after install)
# library(worldfootballR)
set.seed(2026)The basic approach below is the classic Poisson goal model (team attack/defense + home advantage). It’s simple, easy to explain and a great foundation. You can later extend it to xG models, Bayesian hierarchical models, or time-varying strength. If you like Bayesian approaches, see:
Bayesian sports analysis (book/product)
.
Retrieve competition data
The easiest way is to get historical match results from public sources. Of worldfootballRyou can often delete competition seasons from sources such as FBref. The specific function names may vary by package version/source, so below you will see:
- Option A: Usage
worldfootballRimmediately (recommended). - Option B: Use your own CSV export if you already have data.
Option A – Get league season data with worldfootballR
If you need help with installation and authentication issues, see:
Install and use worldfootballR.
# --- Option A (worldfootballR) --- # The exact workflow depends on the data source (FBref / other). # Typical pattern: # 1) Get competition URLs # 2) Pull match results for seasons # # PSEUDOCODE (adjust per your worldfootballR version): # # comp_urls <- fb_league_urls(country = "ENG", gender = "M", season_end_year = 2026) # epl_url <- comp_urls %>% filter(str_detect(Competition_Name, "Premier League")) %>% pull(Seasons_Urls) %>% .[1] # # matches <- fb_match_results(season_url = epl_url) %>% # janitor::clean_names() # # head(matches)
Option B — Use a CSV export (works everywhere)
Your data needs (minimum): date, home_team, away_team, home_goals, away_goals. Save it as matches.csv.
# --- Option B (CSV) ---
matches <- readr::read_csv("matches.csv") %>%
janitor::clean_names() %>%
mutate(date = as.Date(date)) %>%
filter(!is.na(home_goals), !is.na(away_goals)) %>%
arrange(date)
dplyr::glimpse(matches)Standardize columns
# Make sure we have standardized column names
matches <- matches %>%
transmute(
date = as.Date(date),
season = if_else(month(date) >= 7, year(date) + 1L, year(date)), # football season heuristic
home = as.character(home_team),
away = as.character(away_team),
hg = as.integer(home_goals),
ag = as.integer(away_goals)
) %>%
filter(!is.na(date), !is.na(home), !is.na(away), !is.na(hg), !is.na(ag)) %>%
arrange(date)
# Basic sanity checks
stopifnot(all(matches$hg >= 0), all(matches$ag >= 0))
matches %>% count(season) %>% arrange(desc(season)) %>% print(n = 50)Functional engineering
For a basic Poisson model, we estimate team strength using parameters:
attack And defensehas more home field advantage. We will also be building rolling form features as optional enhancements.
Long format for modeling
long <- matches %>%
mutate(match_id = row_number()) %>%
tidyr::pivot_longer(
cols = c(home, away),
names_to = "side",
values_to = "team"
) %>%
mutate(
opp = if_else(side == "home", away, home),
goals = if_else(side == "home", hg, ag),
conceded = if_else(side == "home", ag, hg),
is_home = as.integer(side == "home")
) %>%
select(match_id, date, season, team, opp, is_home, goals, conceded)
head(long)Optional: rolling “shape” functions
These may help a little, but don’t overdo it. Keep them simple and always validate out of sample.
# Rolling averages for goals scored/conceded over last N matches (per team)
add_form_features <- function(df, n = 5) {
df %>%
arrange(team, date, match_id) %>%
group_by(team) %>%
mutate(
gf_roll = zoo::rollapplyr(goals, width = n, FUN = mean, fill = NA, partial = TRUE),
ga_roll = zoo::rollapplyr(conceded, width = n, FUN = mean, fill = NA, partial = TRUE)
) %>%
ungroup()
}
# install.packages("zoo") if needed
# library(zoo)
# long <- add_form_features(long, n = 5)If you cover multiple sports, create a ‘methods hub’ page and link to each sport’s analysis guide:
Sports analysis with R,
N.F.L,
Tennis,
Boxing. This strengthens the current authority and internal PageRank flow.
Model 1: Poisson targets (baseline)
We apply two Poisson regressions: one for home goals and one for away goals, with team attack/defense effects and a home advantage term. A standard approach is:
home_goals ~ home_adv + attack(home) + defense(away)away_goals ~ attack(away) + defense(home)
To avoid identifiability problems, we set a base team as a reference via factor levels.
Train/test divided into time (realistic for betting)
# Time-based split (e.g., last 20% of matches as test) n_total <- nrow(matches) cut_idx <- floor(n_total * 0.80) train <- matches %>% slice(1:cut_idx) test <- matches %>% slice((cut_idx + 1):n_total) # Ensure consistent factor levels teams <- sort(unique(c(matches$home, matches$away))) train <- train %>% mutate(home = factor(home, levels = teams), away = factor(away, levels = teams)) test <- test %>% mutate(home = factor(home, levels = teams), away = factor(away, levels = teams)) # Fit models home_mod <- glm(hg ~ 1 + home + away, data = train, family = poisson()) away_mod <- glm(ag ~ 1 + away + home, data = train, family = poisson()) summary(home_mod) summary(away_mod)
Interpretation: The simplest version above uses team fixed effects as factors. It works, but it combines attack/defense. We then apply the more interpretable attack/defense parameterization.
Attack/defense parameterization (better interpretable)
# Build a modeling frame in the classic attack/defense form
# We model home goals:
# log(lambda_home) = home_adv + attack_home - defense_away
# And away goals:
# log(lambda_away) = attack_away - defense_home
# Create team factors
train2 <- train %>%
mutate(
home = factor(home, levels = teams),
away = factor(away, levels = teams)
)
# We'll encode attack and defense as separate factors by prefixing labels
mk_attack <- function(team) factor(paste0("att_", team), levels = paste0("att_", teams))
mk_def <- function(team) factor(paste0("def_", team), levels = paste0("def_", teams))
train_home <- train2 %>%
transmute(
goals = hg,
is_home = 1L,
att = mk_attack(home),
def = mk_def(away)
)
train_away <- train2 %>%
transmute(
goals = ag,
is_home = 0L,
att = mk_attack(away),
def = mk_def(home)
)
train_long <- bind_rows(train_home, train_away)
# Fit a single Poisson model with:
# goals ~ is_home + att + def
# Note: to reflect "- defense" we can include def and allow coefficients to learn direction;
# For stricter structure you can re-code defense sign, but this works well in practice.
ad_mod <- glm(goals ~ is_home + att + def, data = train_long, family = poisson())
summary(ad_mod)Predict the expected goals (lambda) for each match
predict_lambdas <- function(df, model, teams) {
df2 <- df %>%
mutate(
home = factor(home, levels = teams),
away = factor(away, levels = teams),
att_home = factor(paste0("att_", home), levels = paste0("att_", teams)),
def_away = factor(paste0("def_", away), levels = paste0("def_", teams)),
att_away = factor(paste0("att_", away), levels = paste0("att_", teams)),
def_home = factor(paste0("def_", home), levels = paste0("def_", teams))
)
# home lambda
new_home <- df2 %>%
transmute(is_home = 1L, att = att_home, def = def_away)
# away lambda
new_away <- df2 %>%
transmute(is_home = 0L, att = att_away, def = def_home)
lam_home <- predict(model, newdata = new_home, type = "response")
lam_away <- predict(model, newdata = new_away, type = "response")
df2 %>%
mutate(lambda_home = lam_home, lambda_away = lam_away)
}
test_pred <- predict_lambdas(test, ad_mod, teams)
head(test_pred)Model 2: Dixon-Coles adjustment (improves low scores)
The independent Poisson assumption can underestimate/overestimate the probabilities of low scores (0–0, 1–0, 0–1, 1–1). Dixon-Coles introduces a small correction factor. Below is a clean implementation.
# Dixon-Coles tau adjustment for low-score dependence
tau_dc <- function(x, y, lam_x, lam_y, rho) {
# x = home goals, y = away goals
# rho is the dependence parameter
if (x == 0 && y == 0) return(1 - (lam_x * lam_y * rho))
if (x == 0 && y == 1) return(1 + (lam_x * rho))
if (x == 1 && y == 0) return(1 + (lam_y * rho))
if (x == 1 && y == 1) return(1 - rho)
return(1)
}
# Scoreline probability matrix up to max_goals
score_matrix <- function(lam_h, lam_a, rho = 0, max_goals = 10) {
xs <- 0:max_goals
ys <- 0:max_goals
ph <- dpois(xs, lam_h)
pa <- dpois(ys, lam_a)
# outer product for independent probabilities
P <- outer(ph, pa)
# apply DC tau correction
for (i in seq_along(xs)) {
for (j in seq_along(ys)) {
P[i, j] <- P[i, j] * tau_dc(xs[i], ys[j], lam_h, lam_a, rho)
}
}
# renormalize
P / sum(P)
}
# Example
P_ex <- score_matrix(lam_h = 1.4, lam_a = 1.1, rho = 0.05, max_goals = 8)
round(P_ex[1:5,1:5], 4) How do we choose? rho? You can estimate this by maximizing the probability based on training data. Here is a lightweight optimizer:
# Estimate rho by maximizing log-likelihood on train set given lambdas
train_pred <- predict_lambdas(train, ad_mod, teams)
dc_loglik <- function(rho, df, max_goals = 10) {
# clamp rho to a reasonable range to avoid numerical issues
rho <- max(min(rho, 0.3), -0.3)
ll <- 0
for (k in seq_len(nrow(df))) {
lam_h <- df$lambda_home[k]
lam_a <- df$lambda_away[k]
hg <- df$hg[k]
ag <- df$ag[k]
P <- score_matrix(lam_h, lam_a, rho = rho, max_goals = max_goals)
# if score exceeds max_goals, treat as tiny prob (or increase max_goals)
if (hg > max_goals || ag > max_goals) {
ll <- ll + log(1e-12)
} else {
ll <- ll + log(P[hg + 1, ag + 1] + 1e-15)
}
}
ll
}
opt <- optimize(
f = function(r) -dc_loglik(r, train_pred, max_goals = 10),
interval = c(-0.2, 0.2)
)
rho_hat <- opt$minimum
rho_hatFrom score lines to 1X2 opportunities
Once you have a scoreline probability matrix Pcalculate:
- Home win: sum of chances where home goals > away goals
- To draw: sum of diagonal
- Away win: sum where home goals < away goals
p1x2_from_matrix <- function(P) {
max_g <- nrow(P) - 1
xs <- 0:max_g
ys <- 0:max_g
p_home <- 0
p_draw <- 0
p_away <- 0
for (i in seq_along(xs)) {
for (j in seq_along(ys)) {
if (xs[i] > ys[j]) p_home <- p_home + P[i, j]
if (xs[i] == ys[j]) p_draw <- p_draw + P[i, j]
if (xs[i] < ys[j]) p_away <- p_away + P[i, j]
}
}
tibble(p_home = p_home, p_draw = p_draw, p_away = p_away)
}
predict_1x2 <- function(df, rho = 0, max_goals = 10) {
out <- vector("list", nrow(df))
for (k in seq_len(nrow(df))) {
P <- score_matrix(df$lambda_home[k], df$lambda_away[k], rho = rho, max_goals = max_goals)
out[[k]] <- p1x2_from_matrix(P)
}
bind_rows(out)
}
test_1x2 <- bind_cols(
test_pred,
predict_1x2(test_pred, rho = rho_hat, max_goals = 10)
)
test_1x2 %>%
select(date, home, away, hg, ag, lambda_home, lambda_away, p_home, p_draw, p_away) %>%
head(10)Opportunities, Implied Opportunities, and Value
Betting decisions must be made by expected value (EV). If the market offers decimal opportunities O and your model probability is pThan:
- Expected value per unit stake:
EV = p*(O-1) - (1-p) - Value condition:
p > 1/O
Usually you have an odds feed. Below we assume you have a file odds.csv:
date, home, away, odds_home, odds_draw, odds_away.
odds <- readr::read_csv("odds.csv") %>%
janitor::clean_names() %>%
mutate(date = as.Date(date)) %>%
transmute(
date,
home = as.character(home),
away = as.character(away),
o_home = as.numeric(odds_home),
o_draw = as.numeric(odds_draw),
o_away = as.numeric(odds_away)
)
df <- test_1x2 %>%
mutate(home = as.character(home), away = as.character(away)) %>%
left_join(odds, by = c("date","home","away"))
# Implied probs (no vig removal yet)
df <- df %>%
mutate(
imp_home = 1 / o_home,
imp_draw = 1 / o_draw,
imp_away = 1 / o_away,
overround = imp_home + imp_draw + imp_away
)
# Simple vig removal by normalization
df <- df %>%
mutate(
mkt_home = imp_home / overround,
mkt_draw = imp_draw / overround,
mkt_away = imp_away / overround
)
# EV per 1 unit stake
ev <- function(p, o) p*(o - 1) - (1 - p)
df <- df %>%
mutate(
ev_home = ev(p_home, o_home),
ev_draw = ev(p_draw, o_draw),
ev_away = ev(p_away, o_away)
)
df %>%
select(date, home, away, p_home, p_draw, p_away, o_home, o_draw, o_away, ev_home, ev_draw, ev_away) %>%
head(10)Choose bets with thresholds (avoid noise)
# Practical filters: require at least some edge and avoid tiny probabilities
EDGE_MIN <- 0.02 # 2% EV edge
P_MIN <- 0.05 # avoid extreme longshots unless you model them well
df_bets <- df %>%
mutate(
pick = case_when(
ev_home == pmax(ev_home, ev_draw, ev_away, na.rm = TRUE) ~ "H",
ev_draw == pmax(ev_home, ev_draw, ev_away, na.rm = TRUE) ~ "D",
TRUE ~ "A"
),
p_pick = case_when(pick == "H" ~ p_home, pick == "D" ~ p_draw, TRUE ~ p_away),
o_pick = case_when(pick == "H" ~ o_home, pick == "D" ~ o_draw, TRUE ~ o_away),
ev_pick = case_when(pick == "H" ~ ev_home, pick == "D" ~ ev_draw, TRUE ~ ev_away)
) %>%
filter(!is.na(o_pick)) %>%
filter(p_pick >= P_MIN, ev_pick >= EDGE_MIN)
df_bets %>% count(pick)Backtest: Flat Bet vs. Kelly
Backtesting is where most people fool themselves. Use a time-based distribution, realistic bet selection rules and conservative bankroll sizes.
Calculate the betting results
# Outcome label
df_bets <- df_bets %>%
mutate(
result = case_when(
hg > ag ~ "H",
hg == ag ~ "D",
TRUE ~ "A"
),
win = as.integer(pick == result)
)
# Profit per 1 unit stake
df_bets <- df_bets %>%
mutate(
profit_flat = if_else(win == 1, o_pick - 1, -1)
)
df_bets %>% summarise(
n_bets = n(),
hit_rate = mean(win),
avg_odds = mean(o_pick),
roi_flat = mean(profit_flat)
)Kelly strike (fractional Kelly recommended)
Full Kelly is often too aggressive. Use fractional Kelly (e.g. 0.25 Kelly). If you want a deeper treatment of Kelly and Bayesian uncertainty, see:
Bayesian sports analysis (book/product)
.
kelly_fraction <- function(p, o) {
# Decimal odds o. Net odds b = o - 1
b <- o - 1
q <- 1 - p
f <- (b*p - q) / b
pmax(0, f)
}
KELLY_MULT <- 0.25 # fractional Kelly
df_bets <- df_bets %>%
mutate(
f_kelly = kelly_fraction(p_pick, o_pick),
stake_kelly = KELLY_MULT * f_kelly
)
# Simulate bankroll
simulate_bankroll <- function(df, start_bankroll = 100, stake_col = "stake_kelly") {
br <- start_bankroll
path <- numeric(nrow(df))
for (i in seq_len(nrow(df))) {
stake <- df[[stake_col]][i]
stake_amt <- br * stake
# Profit = stake_amt*(o-1) if win else -stake_amt
prof <- if (df$win[i] == 1) stake_amt*(df$o_pick[i] - 1) else -stake_amt
br <- br + prof
path[i] <- br
}
path
}
df_bets <- df_bets %>% arrange(date)
# Flat staking: e.g., 1% bankroll per bet
df_bets <- df_bets %>% mutate(stake_flat = 0.01)
df_bets$br_flat <- simulate_bankroll(df_bets, start_bankroll = 100, stake_col = "stake_flat")
df_bets$br_kelly <- simulate_bankroll(df_bets, start_bankroll = 100, stake_col = "stake_kelly")
tail(df_bets %>% select(date, home, away, pick, o_pick, win, br_flat, br_kelly), 10)Draw bankroll curves
plot_df <- df_bets %>% select(date, br_flat, br_kelly) %>% pivot_longer(cols = c(br_flat, br_kelly), names_to = "strategy", values_to = "bankroll") ggplot(plot_df, aes(x = date, y = bankroll, group = strategy)) + geom_line() + labs(x = NULL, y = "Bankroll", title = "Backtest Bankroll: Flat vs Fractional Kelly")
Calibration diagnostics
The profit is noisy. Calibration tells you whether the odds are sensible. A good scoring rule for 1X2 is multi-class log loss. We calculate the log loss for your 1X2 probabilities on the test set.
# Create a probability matrix and truth labels
df_eval <- df %>%
filter(!is.na(p_home), !is.na(p_draw), !is.na(p_away)) %>%
mutate(
truth = case_when(hg > ag ~ "H", hg == ag ~ "D", TRUE ~ "A"),
truth = factor(truth, levels = c("H","D","A"))
)
# Log loss (manual)
log_loss_1x2 <- function(pH, pD, pA, y) {
eps <- 1e-15
p <- ifelse(y=="H", pH, ifelse(y=="D", pD, pA))
-mean(log(pmax(p, eps)))
}
ll <- log_loss_1x2(df_eval$p_home, df_eval$p_draw, df_eval$p_away, df_eval$truth)
llReliability plot (binning)
# Example: calibration for HOME-win probability
calib_home <- df_eval %>%
mutate(bin = ntile(p_home, 10)) %>%
group_by(bin) %>%
summarise(
p_mean = mean(p_home),
freq = mean(truth == "H"),
n = n(),
.groups = "drop"
)
ggplot(calib_home, aes(x = p_mean, y = freq)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(x = "Predicted P(Home win)", y = "Observed frequency", title = "Calibration (Home win)") General upgrade path:
Add xG features (if you have them) and/or time decay (recent matches are more important). Then re-validate the calibration. Don’t pursue ROI without checking probability quality.
Production: weekly pipeline
Here’s a practical skeleton you can run on a weekly basis: get the latest matches → refit/refresh → generate odds for next matches → compare with odds.
# 1) Load historical matches (from worldfootballR or your CSV)
matches <- readr::read_csv("matches.csv") %>% janitor::clean_names()
# 2) Train up to cutoff date (e.g., yesterday)
cutoff_date <- Sys.Date() - 1
hist <- matches %>%
mutate(date = as.Date(date)) %>%
filter(date <= cutoff_date) %>%
transmute(date, home = home_team, away = away_team, hg = home_goals, ag = away_goals) %>%
arrange(date)
teams <- sort(unique(c(hist$home, hist$away)))
# 3) Fit attack/defense model
train_long <- bind_rows(
hist %>%
transmute(goals = hg, is_home = 1L,
att = factor(paste0("att_", home), levels = paste0("att_", teams)),
def = factor(paste0("def_", away), levels = paste0("def_", teams))),
hist %>%
transmute(goals = ag, is_home = 0L,
att = factor(paste0("att_", away), levels = paste0("att_", teams)),
def = factor(paste0("def_", home), levels = paste0("def_", teams)))
)
ad_mod <- glm(goals ~ is_home + att + def, data = train_long, family = poisson())
# 4) Predict for upcoming fixtures (you need a fixtures table)
fixtures <- readr::read_csv("fixtures.csv") %>%
janitor::clean_names() %>%
mutate(date = as.Date(date)) %>%
transmute(date, home = home_team, away = away_team)
# Compute lambdas
fixtures2 <- fixtures %>%
mutate(
home = factor(home, levels = teams),
away = factor(away, levels = teams)
)
fixtures_pred <- predict_lambdas(
df = fixtures2 %>% mutate(hg = 0L, ag = 0L), # placeholders
model = ad_mod,
teams = teams
)
# 5) Estimate rho (optional) on recent history only (faster)
hist_pred <- hist %>% mutate(home = factor(home, levels=teams), away=factor(away, levels=teams))
hist_pred <- predict_lambdas(hist_pred, ad_mod, teams)
opt <- optimize(
f = function(r) -dc_loglik(r, hist_pred %>% mutate(lambda_home=lambda_home, lambda_away=lambda_away),
max_goals = 10),
interval = c(-0.2, 0.2)
)
rho_hat <- opt$minimum
# 6) Convert to 1X2
fixtures_1x2 <- bind_cols(
fixtures_pred,
predict_1x2(fixtures_pred, rho = rho_hat, max_goals = 10)
) %>%
select(date, home, away, lambda_home, lambda_away, p_home, p_draw, p_away)
write_csv(fixtures_1x2, "model_probs.csv") At this point you have model_probs.csv ready to merge with the bookmaker’s odds and create a shortlist of bets. In your content strategy, link this post to your broader methods pages:
Sports analysis with R
and the relevant sports hubs (NFL, tennis, boxing) to strengthen the internal link.
Frequently asked questions
Is a Poisson model “good enough” for football betting?
It’s a strong baseline. It captures team strength and home field advantage with minimal complexity. Many upgrades (xG, time decay, Bayesian partial pooling) improve robustness, but the baseline can already be useful.
How do I prevent over-fertilization?
Use time-based validation, keep functions simple, and prioritize calibration and log loss. Don’t match thresholds with the same data you’re evaluating on.
What is the simplest value betting rule?
Only bet when p_model > p_implied and you have a buffer (e.g. EV > 2%), then bet conservatively (fixed bankroll of 0.5%–1% or fractional Kelly).
Where can I learn more advanced Bayesian sports modeling in R?
If you want Bayesian approaches, uncertainty-aware deployment, and a deeper treatment of the Kelly criterion, see:
Bayesian sports analysis (book/product)
.
#Football #Betting #Model #StepbyStep #Guide #bloggers


