Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.
Time to finally patch a hole in the leaking roof of my knowledge: what are generalized linear models anyway?
Groundwork: what are linear models anyway?
Generalized linear models (GLMs) are a short step of linear models, provided that you have the correct understanding of linear models. There are also some jargon statistics to get a grip on. So we start with an intuitive explanation of linear models.
With a linear model we have a quantity (the response variable) that we try to predict from some other quantities (the predictors). The predictors are known things that you can measure with good precision, such as age and weight. The response variable is something with natural variation and/or measurement uncertainty, such as maximum jump height. Different people of the same age and weight will have different maximum jumps. We want to predict the average jumping height for every age and weight combination.
The natural variation usually (but not always) follows one normal Distribution for given predictors. That is, we have to expect a call curve of Jumphogten for people aged 80 years and 80 kg. For people aged 18 and 60 kg we have to expect a Belburve with another (probably higher) average but the same variance. With the variance in jump height not Change for different ages and weights is important, we will come back to that later.
What we do in a linear model is to come up with a weighted sum of our predictors (a linear combination, in the Lingo) that best predicts the average of the Belcurve. Here is its general form:
can be age and
Can be weight. The
Values are the interception and coefficients that we learn.
Here
means the expectation From y, who in this case is simply the average. Strictly speaking, we have to say that we predict the average Conditioned on the predictors Because it is not the average of all maximum observations of the jump height, this is the average of the maximum jumping heights for a certain age and weight. So we should write
.
To determine those coefficients, we use an algorithm such as normal smallest squares (OLS) that works to find the coefficients that give the means with the smallest (square) residues, that is, minimal square distance between the observations and the average for all predictors given. We don’t have to know too much about how this works today, only that it is efficient and useful.
Next level: the GLM
With that under our belt, GLMs are not that scary. There are only two things to be recognized.
Firstly, a normal distribution is not always suitable for natural variation. What if instead of jumping height, our response variable was something discreet, such as the number of pets? You cannot have a non-integer or negative number of pets. We would like to change the normal distribution for a discreet distribution of the same exponential family, probably a Poisson distribution.
That leads directly to the second thing: the average of a Poisson distribution must be positive, so we have to transform the result of that weighted sum in one way or another that allocates it in the valid reach of Poisson resources. This transformation will be the clutch function. The form of this would be:
![]()
Where
Is the link function, and
is the average of the goal distribution. If we had a normal distribution and not need the result of the weighted sum,
Would just be the identity function. The usual choice is for a Poisson distribution
Because it is the opposite of a function
That maps out each number to a positive number.
For different types of data, choose a different type of distribution:
- Poisson for counts
- Binomial for binary opportunities
- Gamma for positive continuous data
- Normally for normally distributed data
Each of these has a “canonic” (standard) link function:
- Poisson: Log
- Binomial: Logit
- Gamma: Inverse
- Normal: Identity
You can choose a different link function, but let’s leave that for another day.
Do you remember that the linear model assumed that the variance was constant? GLMs do not need this, because for distributions other than the normal distribution, the variance is a function of the average. The technical term for this characteristic is heteroscedasticity (while a constant average homoschedasticity).
Another important consequence: because of the link function we do not necessarily have an average of the response variable that varies linearly with the predictors. In other words, we can model non-linear relationships, which is of course powerful.
Playing time
Time to learn by playing. We can use a quirky dataset that in the late 1800s a quirky dataset compiled by the Russian statistics Ladislaus von Bortkiewicz: kill per horses per year for regiments in the Prussian army. In February 2025, Antony Unwin and Bill Venables updated the dataset with additional data and realized von Bortkiewicz’s original dream of Publish it to Cran.
library(tidyverse)
library(knitr)
library(Horsekicks)
hk.tidy <- hkdeaths |>
pivot_longer(
c(kick, drown, fall),
names_to = "death.type",
values_to = "death.count"
)
hk.tidy |> head() |> kable()Table 1: sample of the horse kick data (tidy).
| 1875 | G | 8 | 0 | 0 | staircase | 0 |
| 1875 | G | 8 | 0 | 0 | drown | 3 |
| 1875 | G | 8 | 0 | 0 | fall | 0 |
| 1875 | I | 6 | 0 | 0 | staircase | 0 |
| 1875 | I | 6 | 0 | 0 | drown | 5 |
| 1875 | I | 6 | 0 | 0 | fall | 0 |
Unwin and Venables added dead by drowning and killing by falling from a horse, among other things. The data cover 14 corps, each with a fixed number of regiments, for 33 years. A regiment is around 500 soldiers, as far as I can see. I have no idea where Prussia was or where it is gone.
Let us visually explore data. The release of deaths over time shows that Ruiter seems to be quite stable, but the dead by drowning have been reduced. We fit a GLM for this time series.
prussian.colours <- c("#333333", "#F9BE11", "#BE0007")
hk.year <- hk.tidy |>
group_by(year, death.type) |>
summarise(death.count = sum(death.count))
ggplot(hk.year) +
aes(x = year, y = death.count, group = death.type, colour = death.type) +
geom_line() +
scale_colour_manual(values = prussian.colours, name = "Death type") +
labs(x = "Year", y = "Number of deaths") +
theme_minimal()
Figure 1: Accidental Death per type over time.
Note that for every type of death and every year we actually have a distribution over all regiments, not just a single number. We could choose to plot the distributions as a series of box plots.
ggplot(hk.tidy) + aes(x = year, y = death.count, group = year, fill = death.type) + facet_wrap(~death.type, dir = "v") + geom_boxplot(alpha = 0.5) + scale_fill_manual(values = prussian.colours, name = "Death type") + theme_minimal() + labs(x = "Year", y = "Number of deaths")

Figure 2: Distributions of dead for each year and the mortality type.
Time for that model. Because you cannot literally be kicked half dead, at least not as far as Prussian statistics Was worried, all death statistics are entire numbers. That is why the obvious choice for the Poisson distribution is. We have no reason to offer a different link function, so we just take the canonical link function for Poisson (LOG).
R makes this trivial to do. glm Is a built -in function:
glm( formula = drown ~ year, # i.e. drownings are the response variable, year is the predictor family = poisson, data = hkdeaths )
Call: glm(formula = drown ~ year, family = poisson, data = hkdeaths) Coefficients: (Intercept) year 44.05820 -0.02256 Degrees of Freedom: 461 Total (i.e. Null); 460 Residual Null Deviance: 854.3 Residual Deviance: 766.9 AIC: 2170
The result is an interception (which we don’t care about much) and a coefficient for the year of -0.022, which tells us that the model predicts that drowning will decrease 2.2% every year.
ggplot2 makes it even easier by having us bundle the above model stat_smooth And perform it for any type of death.
ggplot(hk.year) +
aes(x = year, y = death.count, group = death.type, colour = death.type) +
geom_point() +
stat_smooth(
method = "glm",
formula = y ~ x,
method.args = list(family = poisson)
) +
scale_colour_manual(values = prussian.colours, name = "Death type") +
theme_minimal() +
labs(x = "Year", y = "Number of deaths")
Figure 3: Death per type over time, with GLM fits.
Et voila. That is really everything there was!
Related
#Introduction #generalized #linear #models #Prussian #horse #kicks #RBloggers

