So this is a blog post about making and finding a mistake in the modeling strategy, in a seemingly simple situation.
I saw this post (toot?) on Mastodon by Daniel Lakens who mentions in passing: “On the other hand, the number of homicides around the world is well modeled as exp(k * income_gini_coefficient).” I thought this was worth checking; So I did! It turns out to be more or less correct.
Here’s a graph I drew of murders per 100,000 residents on the vertical axis versus income or consumption inequality on the horizontal axis. Especially in poorer countries, consumption is the best way to measure poverty and inequality. Therefore, this data, which comes from the World Bank, is used when available.
It’s not a major point, but the model implied by Professor Lakens would be a straight diagonal line, which is what you would get if you just used geom_smooth(method = lm, formula = y ~ x - 1). I have shown not his, but two slightly more complex models.
Here’s the code that downloads the data, fits those models, and draws the graph. A few things to note:
- It takes a bit of complicated footwork to determine which countries I label on the map, and I also ended up choosing a few manually and ad hoc because I was interested in them.
- Although the y-axis in this chart uses a logarithm transformation, for the model itself I did not transform the response variable with a logarithm, but with a BoxCox transformation which still works if countries have zero homicides in a given year. I played with alternatives, such as using a generalized linear model with a log link (which still works if the data is observed to be zero, as it is predicted/expected to be slightly more), but decided to avoid this so I could use
lmewhich allows me to specify autocorrelated error terms. - Since the model and plot are both a bit complicated, I set up my own prediction grid and calculated the expected values and rough (i.e. assuming asymptotic normality) confidence intervals instead of using
marginaleffectspackage to take something out of the box. - The drawn ribbons show the expected murder rate for the country with a random country effect closest to zero, which happens to be the Czech Republic with the data at the time of writing (October 17, 2025).
- The ribbons are also based on the assumption that the hypothetical country they refer to has an average inequality (over all years for which observations exist) as shown on the horizontal axis. This point is quite crucial.
library(tidyverse)
library(WDI)
library(nlme)
library(forecast) # for chosing box cox parameters
library(AICcmodavg) # for predictSE with lmer
library(ggrepel)
library(GGally)
# Analysis to follow up this remark:
# https://mastodon.sdf.org/@dlakelan/115055789165555934
# "On the other hand, the homicide rate across the world
# is well modeled as exp(k * income_gini_coefficient)."
#--------Download World Bank data and prep it------------
# These searches were used to find series with a lot of data for country-year combinations:
# WDIsearch("homicide")
# WDIsearch("Gini")
# Metadata for SI.POV.GINI at https://data360files.worldbank.org/data360-data/metadata/WB_WDI/WB_WDI_SI_POV_GINI.pdf
# Key points include:
# * individuals (not household)
# * adjusted for household size
# * can be income or consumption depending on what was available
d <- WDI(indicator = c(homicide = "VC.IHR.PSRC.P5",
gini = "SI.POV.GINI")) |>
as_tibble()
# Countries we are going to want to highlight in our chart
highlights <- c("United States", "Russian Federation",
"Samoa", "Australia", "United Kingdom",
"Fiji", "Mexico", "Papua New Guinea")
# There are a few country-years with zero homicides so can't use a simple
# logarithm transformation, but can have a Box-Cox that has a similar result
# Choose a lambda that gives a distribution after the transformation where
# changes are relatively stable in absolute terms:
lambda <- forecast::BoxCox.lambda(d$homicide)
# version of the data we will use for plotting and modelling:
d2 <- d |>
drop_na() |>
group_by(country) |>
mutate(type = ifelse(year == max(year), "Latest", "Earlier")) |>
mutate(ctry_avg_gini = mean(gini)) |>
ungroup() |>
mutate(label = ifelse(type == "Latest" & (gini > 53 | homicide > 35 |
gini < 26 | homicide < 0.26 | country %in% highlights),
country, ""),
hom_tran = BoxCox(homicide, lambda = lambda),
country = as.factor(country)) |>
arrange(year)
#------------Modelling and predictions (for drawing 95% confidence intervals on charts)----
# you could have random slopes too but a fair number of countries have only one
# or two observations
# see https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2017.00058/full
# for example which says you want at least 6 observations per group to have
# a random slope. Not sure how many you need, but 1 isn't enough :)
# Barchart of number of observations per country, used later in the blog post
p1 <- d2 |>
count(country, name = "n_obs") |>
count(n_obs, name = "n_countries") |>
ggplot(aes(x = n_obs, y = n_countries)) +
geom_col(fill = "steelblue") +
labs(x = "Number of observations (ie years)",
y = "Number of countries",
title = "Complete data for Gini coefficient and homicide rates")
print(p1)
# super simple model, not mentioned in the actual blog post
model1 <- lm(hom_tran ~ gini, data = filter(d2, type == "Latest"))
# multilevel model with a country random intercept, and inequality only at the lowest level
model2 <- lme(hom_tran ~ gini, random = ~1 | country,
data = d2, correlation = corAR1())
# multilevel model with country random intercept and countries' average inequality, as well
# as the lowest level (country-year) granularity inequality:
model3 <- lme(hom_tran ~ gini + ctry_avg_gini, random = ~1 | country,
data = d2, correlation = corAR1())
# Fairly high (and similar) coefficients, about 0.11, but for differ
summary(model1) # 0.12 for gini
summary(model3) # 0.11 for ctry_avg_gini; gini not significant
# Relatively low coefficients - the country randomness
# soaks up a lot of the randomness:
summary(model2) # gini not significant
# Is the average random country effect basically zero? - check:
stopifnot(round(mean(ranef(model2)[[1]]), 10) == 0)
stopifnot(round(mean(ranef(model3)[[1]]), 10) == 0)
# Find the country with random effect closest to zero. Needed for predictions
# to draw an average country ribbon on the chart, even though the country effect
# is actually not taken into account in predictSE
avg_country <- ranef(model3) |>
arrange(abs(`(Intercept)`)) |>
slice(1) |>
row.names()
pred_grid <- tibble(gini = 20:65,
ctry_avg_gini = 20:65,
country = avg_country)
pse1 <- predict(model1, newdata = pred_grid, se = TRUE)
pse2 <- predictSE(model2, newdata = pred_grid, se = TRUE)
pse3 <- predictSE(model3, newdata = pred_grid, se = TRUE)
pred_grid <- pred_grid |>
mutate(predicted1 = pse1$fit,
lower1 = predicted1 - 1.96 * pse1$se.fit,
upper1 = predicted1 + 1.96 * pse1$se.fit) |>
mutate(predicted2 = pse2$fit,
lower2 = predicted2 - 1.96 * pse2$se.fit,
upper2 = predicted2 + 1.96 * pse2$se.fit) |>
mutate(predicted3 = pse3$fit,
lower3 = predicted3 - 1.96 * pse3$se.fit,
upper3 = predicted3 + 1.96 * pse3$se.fit) |>
mutate(across(predicted1:upper3, function(x){InvBoxCox(x, lambda = lambda)}))
#------------------Draw chart--------------------
mod_cols <- c("purple", "darkgreen", "brown", "pink")
p2 <- d2 |>
ggplot(aes(x = gini, y = homicide)) +
scale_y_log10() +
xlim(18, 70) +
scale_shape_manual(values = c(1, 19)) +
scale_colour_manual(values = c("steelblue", "black")) +
labs(x = "Inequality (Gini coefficient), based on individual income or in some cases, consumption. Higher means more inequality.",
y = "Homicide rate (per 100,000)",
colour = "Observation year:",
shape = "Observation year:",
title = "Higher inequality countries have more homicides.",
caption = "Source: World Bank, World Development Indicators, series VC.IHR.PSRC.P5 and SI.POV.GINI. Analysis by freerangestats.info.")
p2a <- p2 +
# model2 - just country-year level data, country random effect
geom_ribbon(data = pred_grid, aes(ymin = lower2, ymax = upper2, y = NA),
fill = mod_cols[2], alpha = 0.2) +
#model3 - country-year but also country average data, country random effect
geom_ribbon(data = pred_grid, aes(ymin = lower3, ymax = upper3, y = NA),
fill = mod_cols[3], alpha = 0.2) +
geom_point(aes(shape = type, colour = type)) +
geom_label_repel(aes(label = label), max.overlaps = Inf, colour = "grey10", size = 2.8,
seed = 123, label.size = unit(0, "mm"), fill = rgb(0,0,0, 0.04)) +
# annotations - text and arrows - for models 2 and 3
annotate("text", x = 61.5, y = 0.9, colour = mod_cols[2], hjust = 0, vjust = 1,
label = str_wrap("Model with no such average country inequality effect.", 24)) +
annotate("text", x = 61.5, y = 200, colour = mod_cols[3], hjust = 0, vjust = 1,
label = str_wrap("Model including an effect for average inequality in each
country over time.", 26)) +
annotate("segment", x = 64, xend = 64, y = 1, yend = 2, colour = mod_cols[2],
arrow = arrow(length = unit(2, "mm"))) +
annotate("segment", x = 64, xend = 64, y = 75, yend = 45, colour = mod_cols[3],
arrow = arrow(length = unit(2, "mm"))) +
labs(subtitle = "Selected countries highlighted.
Shaded ribbons show 95% confidence interval of mixed effects models with random country effect and autocorrelated error terms.",
)
print(p2a)Actually, at first I only had that green line…model2 in the code, which is quite flat. Of course, I noticed that it doesn’t seem to go through the data points. At first I thought this was because I was smart and built in a random effect at the country level, allowing for autocorrelation over time for the multiple observations of each country. That is, I thought I had revealed a fundamental truth about a non-relationship that more naive models seemed to demonstrate a relationship.
To extract that model (the one with the green line prominently below the data) from the chaos of code above, it is model2 and it looks like this:
model2 <- lme(hom_tran ~ gini, random = ~1 | country,
data = d2, correlation = corAR1())This seemed the most obvious to me. Why not just do something ordinary with least squares? Because many of the countries have multiple observations, as we see in this plot (which was created in the code snippet we already saw above):
I would say that what is most often done in this situation is to filter the data to just one observation per country – perhaps taking the last observation for each country, or the observation closest to a year chosen precisely because most countries have an observation in that year or close to it.
That seemed like a sad waste of data to me (although you can see in my code above that I fit such a model, model1and if you want to look at it, it avoids the problem I’m about to talk about!). So my idea was to use all the data in a mixed-effects model, with a random intercept at the country level to take into account the fact that any new observation about a country, while certainly worth somethingis not worth as much as an independent observation because it will be comparable to the other observations for that country. To do this even better, in addition to the country-level random intercept, I throw in an autocorrelation of the residuals, which shows that the values in one year are expected to be correlated with those in the previous year (which indeed turns out to be the case).
That’s all well and good, but in this case it backfired on me. When I first drew this plot, with just the green line of model2 When we showed this, I thought, “Oh, that’s cool, if we correct for the fact that many of these observations are low-value repeats of the previous observations from the same country, it turns out that there is very little connection between inequality and murder anyway.” But more thinking, and looking at some residuals, and comparing it with the more positive results of the much simpler ones model1quickly made me realize I was barking up the wrong tree.
The problem is best illustrated by this map I drew somewhere along the way. I realized that since I have a random intercept that is different for each country, the expected value for that country from the model will vary greatly from country to country. But then the modeling of homicide inequality will start from that point – the model will try to explain the variance in homicide rates. in this particular country based on his observations.
If we also had a random slope, the result would look closer to the dark gray lines in these graphs – a different relationship for each country. But I had not done this because I felt that there were simply not enough observations per country for a random slope (many countries actually have only one observation: economic inequality is difficult and expensive to measure, and requires a special survey of household consumption if possible). So my models look more like the light red ribbon shown here:
What happens is that there are two things going on:
- Inequality between countries do a pretty good job of explaining the differences in homicide rates.
- Inside in any country that makes observations over several years, there is much less correlation between the two variables; and where there is a connection, it varies from country to country. Judging from the black dot representing the most recent point, Australia and the United States appear to have become more unequal over time, but have lower homicide rates; Mexico has seen less inequality and more murders; whereas Russia, Ukraine and Britain have seen both inequality and homicide rates decline. In other words, for Russia, Ukraine, and Britain, the within-country data are consistent with the positive between-country relationship between inequality and homicide, but Australia, the US, and Mexico have a negative relationship between the two variables.
Another way to look at this is to say that the random intercept of countries absorbs variation in homicide rates, which could instead be explained by that country’s persistent average inequality. But we do not have “permanent, average inequality” as a variable in model2. If we calculate a proxy for that in the obvious way (the average of the inequality values we have for that country, regardless of the fact that they come from different years for each country), we can look at the relationship between this “permanent inequality” and the country that intercepts it. model2and we see a strong positive correlation. That’s the middle left panel (a scatterplot) in the graph below, as well as the middle top correlation (0.608, clearly different from zero).
The other variable in that scatter plot matrix is the country effects model3which is specified as follows:
model3 <- lme(hom_tran ~ gini + ctry_avg_gini, random = ~1 | country,
data = d2, correlation = corAR1())The difference of model2 is that the case now ctry_avg_gini there as our estimate of persistent average inequality across countries. We still have that giniwhat is the value of inequality that varies year by year for this country. This will strengthen the average effect within the countries.
In the scatterplot matrix above, we can see that once we include each country’s average inequality in the model, there is no longer a relationship between that variable and the country-level random effects. This is to be expected, precisely because the model adjustment process is designed for this.
Here you will find the summary results for model3:
> summary(model3)
Linear mixed-effects model fit by REML
Data: d2
AIC BIC logLik
2542.301 2574.914 -1265.15
Random effects:
Formula: ~1 | country
(Intercept) Residual
StdDev: 0.8371227 0.7065144
Correlation Structure: AR(1)
Formula: ~1 | country
Parameter estimate(s):
Phi
0.7574314
Fixed effects: hom_tran ~ gini + ctry_avg_gini
Value Std.Error DF t-value p-value
(Intercept) -3.193199 0.3950271 1554 -8.083494 0.0000
gini 0.007850 0.0056463 1554 1.390213 0.1647
ctry_avg_gini 0.113641 0.0117143 141 9.701081 0.0000
Correlation:
(Intr) gini
gini 0.001
ctry_avg_gini -0.858 -0.482
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-11.47176745 -0.44184320 -0.04595294 0.42362704 3.06647091
Number of Observations: 1698
Number of Groups: 143
We see the strong effect of the average inequality effect at the country level, and no significant evidence of an effect of inequality within countries. Interesting! But it is beyond my scope to go into that today.
Finally, I’d like to conclude with a clean plot of my best model and the phenomenon in question: the strong empirical relationship between a country’s economic inequality and its homicide rate.
Here’s the code for that final clean plot:
p2b <- p2 +
#model3 - country-year but also country average data, country random effect
geom_ribbon(data = pred_grid, aes(ymin = lower3, ymax = upper3, y = NA),
fill = mod_cols[3], alpha = 0.2) +
geom_point(aes(shape = type, colour = type)) +
geom_label_repel(aes(label = label), max.overlaps = Inf, colour = "grey10", size = 2.8,
seed = 123, label.size = unit(0, "mm"), fill = rgb(0,0,0, 0.04))
print(p2b)Note that this builds on the object p2 that was the basis of the first plot, which showed both model2 And model3to prevent recurrence.
And here is the code for playing with residuals at different levels and drawing that facet plot version with 12 selected countries:
#---------------residuals from model2 and model3---------------
# Individual level residuals - plot not used in blog
p3 <- d2 |>
mutate(`Residuals from Model 2` = residuals(model2),
`Residuals from Model 3` = residuals(model3)) |>
select(gini, `Residuals from Model 2`, `Residuals from Model 3`) |>
gather(variable, value, -gini) |>
ggplot(aes(x = gini, y = value)) +
facet_wrap(~variable) +
geom_point() +
labs(x = "Inequality",
y = "Residuals (on Box-Cox transformed scale)",
title = "Residuals from two mixed-effects models of homicide",
subtitle = "Residuals at the most granular level ie year-country are uncorrelated with inequality")
svg_png(p3, "..https://freerangestats.info/img/0303-residuals", w = 8, h = 4)
# out of curiousity what are those low outlier residuals? - Iceland and Malta
round(sort(residuals(model2)),2)[1:5]
round(sort(residuals(model3)),2)[1:5]
# Country level effects plot - pairs plot used in blog
p4 <- function(){
p <- d2 |>
distinct(country, ctry_avg_gini) |>
arrange(country) |>
mutate(`Country effects in Model 2` = ranef(model2)[[1]],
`Country effects in Model 3` = ranef(model3)[[1]]) |>
select(-country) |>
rename(`Countries' average inequality` = ctry_avg_gini) |>
ggpairs() +
labs(title = "Comparison of country-level random effects in two models",
subtitle = "Model 3 has a fixed effect for countries' average inequality; Model 2 doesn't.
The country-level residuals are correlated with this variable in Model 2, but not Model 3.")
print(p)
}
p4()
#---------------further illustrations - facets----------------------
d3 <- d2 |>
filter(country %in% c(highlights, "South Africa", "France", "Japan", "Ukraine")) |>
group_by(country) |>
summarise(r = coef(lm(log10(homicide) ~ gini))[2]) |>
mutate(r = replace_na(r, 0)) |>
mutate(country = fct_reorder(country, r) |> fct_drop()) |>
arrange(country)
pred_grid2 <- d2 |>
filter(country %in% d3$country) |>
distinct(country, ctry_avg_gini) |>
expand_grid(gini = 20:70)
# Note that predictSE excludes random country effect....
more_preds2 <- predictSE(model2, newdata = pred_grid2, se = TRUE)
more_preds3 <- predictSE(model3, newdata = pred_grid2, se = TRUE)
pred_grid2 <- pred_grid2 |>
mutate(predicted2 = more_preds2$fit,
lower2 = predicted2 - 1.96 * more_preds2$se.fit,
upper2 = predicted2 + 1.96 * more_preds2$se.fit,
predicted3 = more_preds3$fit,
lower3 = predicted3 - 1.96 * more_preds3$se.fit,
upper3 = predicted3 + 1.96 * more_preds3$se.fit) |>
mutate(across(predicted2:upper3, function(x){InvBoxCox(x, lambda = lambda)}))
p5 <- d2 |>
mutate(country = factor(country, levels = levels(d3$country))) |>
drop_na() |>
ggplot(aes(x = gini, y = homicide)) +
scale_y_log10(label = comma, limits = c(0.1, 100)) +
xlim(18, 70) +
scale_shape_manual(values = c(1, 19)) +
scale_colour_manual(values = c("steelblue", "black")) +
geom_ribbon(data = pred_grid2, aes(ymin = lower3, ymax = upper3, y = NA),
fill = mod_cols[3], alpha = 0.2) +
geom_smooth(method = lm, se = FALSE, colour = "grey50", fullrange = TRUE, linewidth = 0.5) +
geom_point(aes(shape = type, colour = type)) +
facet_wrap(~country, ncol = 3) +
labs(x = "Inequality (Gini coefficient), based on individual income or in some cases, consumption. Higher means more inequality.",
y = "Homicide rate (per 100,000)",
colour = "Observation year:",
shape = "Observation year:",
title = "Higher inequality countries have more homicides.",
subtitle = "But for any given country, the relationship over time may well be the reverse.
Pale ribbons show a model's confidence interval for homicide for a country's average over time level of inequality.
Dark line shows simple model fit to just this country's data.",
caption = "Source: World Bank, World Development Indicators, series VC.IHR.PSRC.P5 and SI.POV.GINI. Analysis by freerangestats.info.")
print(p5)The complete code set, in order, can be found on GitHub if the above presentation is confusing.
Related
#Inequality #murder #countries #ellis2013nz #bloggers


