Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.
If this message is useful for you, I kindly request a minimum donation Buy a coffee for me. It will be used to continue my open source efforts. The complete explanation is here: A personal message from an Open Source employee.
You can send me questions for the blog using This form And subscribe to receive an e -mail when there is a new message.
I made a new R -package with the name Valious Correlations This is intended to help educators explain why correlation does not implies a causal link. What I had in mind were AP statistics courses and introductory statistics courses at university level.
The package contains a data set with 15 false correlations. You can install it from Cran with:
# install.packages("spuriouscorrelations", repos = "https://cran.rstudio.com")
library(spuriouscorrelations)Let’s turn off one of the false correlations, for example the correlation between the number of people drowned by falling into a pool and the number of films that Nicolas Cage appeared in:
library(dplyr) library(ggplot2) spurious_correlations %>% distinct(var1_short, var2_short)
# A tibble: 15 × 2 var1_short var2_short1 US spending on science Suicides 2 Falling into a pool drownings Nicholas Cage 3 Cheese consumed Bedsheet tanglings 4 Divorce rate in Maine Margarine consumed 5 Age of Miss America Murders by steam 6 Arcade revenue Computer science doctorates 7 Space launches Sociology doctorates 8 Mozzarella cheese consumption Engineering doctorates 9 Fishing boat deaths Kentucky marriages 10 US crude oil imports from Norway Railway train collisions 11 Chicken consumption US crude oil imports 12 Swimming pool drownings Nuclear power plants 13 Japanese cars sold Suicides by crashing 14 Spelling bee letters Spider deaths 15 Math doctorates Uranium stored
nic_cage <- filter(spurious_correlations, var2_short == "Nicholas Cage") cor(nic_cage$var1_value, nic_cage$var2_value)
ggplot(nic_cage, aes(x = var1_value, y = var2_value)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Spurious Correlation: Drownings vs. Nicolas Cage Films",
x = "Number of Drownings by Falling into a Pool",
y = "Number of Films Nicolas Cage Appeared In"
) +
theme_minimal()
With a correlation of 67%, we can even fit a linear model:
lm_model <- lm(var2_value ~ var1_value, data = nic_cage) summary(lm_model)
Call:
lm(formula = var2_value ~ var1_value, data = nic_cage)
Residuals:
Min 1Q Median 3Q Max
-0.9308 -0.5926 -0.1020 0.4836 1.6026
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.37515 2.86726 -1.875 0.0936 .
var1_value 0.07620 0.02845 2.678 0.0253 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8678 on 9 degrees of freedom
Multiple R-squared: 0.4436, Adjusted R-squared: 0.3817
F-statistic: 7.174 on 1 and 9 DF, p-value: 0.02527With a declared variance of 38% (adapted R-squared) we can say that the number of drowning is a statistically significant predictor of the number of films in which Nicolas Cage appeared. However, this is a false correlation, and there is no causal relationship between these two variables, even with a p-value of 0.025.
Now let’s compare with a double y-axle plot:
# install.packages("tintin", repos = "https://cran.rstudio.com")
library(tidyr)
library(tintin)
cor_val <- cor(nic_cage$var1_value, nic_cage$var2_value)
# Align the two series visually
v1 <- nic_cage$var1_value
v2 <- nic_cage$var2_value
fun_adjust <- function(v1, v2) {
s1 <- sd(v1, na.rm = TRUE)
s2 <- sd(v2, na.rm = TRUE)
a <- ifelse(s2 == 0, 1, s1 / s2)
b <- mean(v1, na.rm = TRUE) - a * mean(v2, na.rm = TRUE)
c(a = as.numeric(a), b = as.numeric(b))
}
adjust <- fun_adjust(v1, v2)
scale_a <- adjust["a"]
scale_b <- adjust["b"]
y1_title <- as.character(unique(nic_cage$var1))
y2_title <- as.character(unique(nic_cage$var2))
nic_cage_long <- nic_cage %>%
select(year, var1_value, var2_value) %>%
pivot_longer(
cols = c(var1_value, var2_value),
names_to = "variable",
values_to = "value"
) %>%
mutate(
variable_label = case_when(
variable == "var1_value" ~ y1_title,
variable == "var2_value" ~ y2_title
),
# apply transform to var2 for plotting: plot_value = a * var2 + b
plot_value = ifelse(variable == "var2_value", value * scale_a + scale_b, value)
)
# make a double y axis plot with year on the x axis
ggplot(nic_cage_long, aes(x = year)) +
geom_line(aes(y = plot_value, color = variable_label, group = variable_label), linewidth = 1.5) +
geom_point(aes(y = plot_value, color = variable_label), size = 3) +
labs(
x = "Year",
y = y1_title,
title = sprintf("%s\nvs\n%s\n", y1_title, y2_title),
subtitle = sprintf("Correlation: %.2f", cor_val),
color = ""
) +
# display all years on the x axis
scale_x_continuous(breaks = nic_cage$year) +
# primary y axis is the var1 scale
# secondary shows var2 original scale by inverse-transforming
scale_y_continuous(
sec.axis = sec_axis(~ (. - scale_b) / scale_a, name = y2_title)
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top") +
# use tintin color palette
scale_colour_manual(
values = tintin_pal(option = "the black island")(2),
name = ""
) +
# center title and subtitle
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5)
)
What about other false correlations? Here there is one:
engineering_doctorates <- filter(spurious_correlations, var2_short == "Engineering doctorates")
cor_val <- cor(engineering_doctorates$var1_value, engineering_doctorates$var2_value)
v1 <- engineering_doctorates$var1_value
v2 <- engineering_doctorates$var2_value
adjust <- fun_adjust(v1, v2)
scale_a <- adjust["a"]
scale_b <- adjust["b"]
y1_title <- as.character(unique(engineering_doctorates$var1))
y2_title <- as.character(unique(engineering_doctorates$var2))
engineering_doctorates_long <- engineering_doctorates %>%
select(year, var1_value, var2_value) %>%
pivot_longer(
cols = c(var1_value, var2_value),
names_to = "variable",
values_to = "value"
) %>%
mutate(
variable_label = case_when(
variable == "var1_value" ~ y1_title,
variable == "var2_value" ~ y2_title
),
# apply transform to var2 for plotting: plot_value = a * var2 + b
plot_value = ifelse(variable == "var2_value", value * scale_a + scale_b, value)
)
# make a double y axis plot with year on the x axis
ggplot(engineering_doctorates_long, aes(x = year)) +
geom_line(aes(y = plot_value, color = variable_label, group = variable_label), linewidth = 1.5) +
geom_point(aes(y = plot_value, color = variable_label), size = 3) +
labs(
x = "Year",
y = y1_title,
title = sprintf("%s\nvs\n%s\n", y1_title, y2_title),
subtitle = sprintf("Correlation: %.2f", cor_val),
color = ""
) +
# display all years on the x axis
scale_x_continuous(breaks = nic_cage$year) +
# primary y axis is the var1 scale
# secondary shows var2 original scale by inverse-transforming
scale_y_continuous(
sec.axis = sec_axis(~ (. - scale_b) / scale_a, name = y2_title)
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top") +
# use tintin color palette
scale_colour_manual(
values = tintin_pal(option = "the black island")(2),
name = ""
) +
# center title and subtitle
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5)
)
I can go ad -sickness With these false correlations. The point is that Correlation does not mean causal connectionAnd we must be careful when interpreting correlations.
Related
#Valish #correlations #Correlation #causal #connection #RBloggers

