Valish correlations in R – Correlation is not a causal connection | R-Bloggers

Valish correlations in R – Correlation is not a causal connection | R-Bloggers

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report problems here about the content on this page)


Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.

If this message is useful for you, I kindly request a minimum donation Buy a coffee for me. It will be used to continue my open source efforts. The complete explanation is here: A personal message from an Open Source employee.

You can send me questions for the blog using This form And subscribe to receive an e -mail when there is a new message.

I made a new R -package with the name Valious Correlations This is intended to help educators explain why correlation does not implies a causal link. What I had in mind were AP statistics courses and introductory statistics courses at university level.

The package contains a data set with 15 false correlations. You can install it from Cran with:

# install.packages("spuriouscorrelations", repos = "https://cran.rstudio.com")
library(spuriouscorrelations)

Let’s turn off one of the false correlations, for example the correlation between the number of people drowned by falling into a pool and the number of films that Nicolas Cage appeared in:

library(dplyr)
library(ggplot2)

spurious_correlations %>%
  distinct(var1_short, var2_short)
# A tibble: 15 × 2
   var1_short                       var2_short                 
                                                     
 1 US spending on science           Suicides                   
 2 Falling into a pool drownings    Nicholas Cage              
 3 Cheese consumed                  Bedsheet tanglings         
 4 Divorce rate in Maine            Margarine consumed         
 5 Age of Miss America              Murders by steam           
 6 Arcade revenue                   Computer science doctorates
 7 Space launches                   Sociology doctorates       
 8 Mozzarella cheese consumption    Engineering doctorates     
 9 Fishing boat deaths              Kentucky marriages         
10 US crude oil imports from Norway Railway train collisions   
11 Chicken consumption              US crude oil imports       
12 Swimming pool drownings          Nuclear power plants       
13 Japanese cars sold               Suicides by crashing       
14 Spelling bee letters             Spider deaths              
15 Math doctorates                  Uranium stored             
nic_cage <- filter(spurious_correlations, var2_short == "Nicholas Cage")

cor(nic_cage$var1_value, nic_cage$var2_value)
ggplot(nic_cage, aes(x = var1_value, y = var2_value)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Spurious Correlation: Drownings vs. Nicolas Cage Films",
    x = "Number of Drownings by Falling into a Pool",
    y = "Number of Films Nicolas Cage Appeared In"
  ) +
  theme_minimal()

With a correlation of 67%, we can even fit a linear model:

lm_model <- lm(var2_value ~ var1_value, data = nic_cage)
summary(lm_model)
Call:
lm(formula = var2_value ~ var1_value, data = nic_cage)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9308 -0.5926 -0.1020  0.4836  1.6026 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -5.37515    2.86726  -1.875   0.0936 .
var1_value   0.07620    0.02845   2.678   0.0253 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8678 on 9 degrees of freedom
Multiple R-squared:  0.4436,    Adjusted R-squared:  0.3817 
F-statistic: 7.174 on 1 and 9 DF,  p-value: 0.02527

With a declared variance of 38% (adapted R-squared) we can say that the number of drowning is a statistically significant predictor of the number of films in which Nicolas Cage appeared. However, this is a false correlation, and there is no causal relationship between these two variables, even with a p-value of 0.025.

Now let’s compare with a double y-axle plot:

# install.packages("tintin", repos = "https://cran.rstudio.com")
library(tidyr)
library(tintin)

cor_val <- cor(nic_cage$var1_value, nic_cage$var2_value)

# Align the two series visually
v1 <- nic_cage$var1_value
v2 <- nic_cage$var2_value

fun_adjust <- function(v1, v2) {
  s1 <- sd(v1, na.rm = TRUE)
  s2 <- sd(v2, na.rm = TRUE)
  a <- ifelse(s2 == 0, 1, s1 / s2)
  b <- mean(v1, na.rm = TRUE) - a * mean(v2, na.rm = TRUE)
  c(a = as.numeric(a), b = as.numeric(b))
}

adjust <- fun_adjust(v1, v2)
scale_a <- adjust["a"]
scale_b <- adjust["b"]

y1_title <- as.character(unique(nic_cage$var1))
y2_title <- as.character(unique(nic_cage$var2))

nic_cage_long <- nic_cage %>%
  select(year, var1_value, var2_value) %>%
  pivot_longer(
    cols = c(var1_value, var2_value),
    names_to = "variable",
    values_to = "value"
  ) %>%
  mutate(
    variable_label = case_when(
      variable == "var1_value" ~ y1_title,
      variable == "var2_value" ~ y2_title
    ),
  # apply transform to var2 for plotting: plot_value = a * var2 + b
  plot_value = ifelse(variable == "var2_value", value * scale_a + scale_b, value)
  )

# make a double y axis plot with year on the x axis
ggplot(nic_cage_long, aes(x = year)) +
  geom_line(aes(y = plot_value, color = variable_label, group = variable_label), linewidth = 1.5) +
  geom_point(aes(y = plot_value, color = variable_label), size = 3) +
  labs(
    x = "Year",
    y = y1_title,
    title = sprintf("%s\nvs\n%s\n", y1_title, y2_title),
    subtitle = sprintf("Correlation: %.2f", cor_val),
    color = ""
  ) +
  # display all years on the x axis
  scale_x_continuous(breaks = nic_cage$year) +
  # primary y axis is the var1 scale
  # secondary shows var2 original scale by inverse-transforming
  scale_y_continuous(
    sec.axis = sec_axis(~ (. - scale_b) / scale_a, name = y2_title)
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top") +
  # use tintin color palette
  scale_colour_manual(
    values = tintin_pal(option = "the black island")(2), 
    name = ""
  ) +
  # center title and subtitle
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )

What about other false correlations? Here there is one:

engineering_doctorates <- filter(spurious_correlations, var2_short == "Engineering doctorates")

cor_val <- cor(engineering_doctorates$var1_value, engineering_doctorates$var2_value)

v1 <- engineering_doctorates$var1_value
v2 <- engineering_doctorates$var2_value

adjust <- fun_adjust(v1, v2)
scale_a <- adjust["a"]
scale_b <- adjust["b"]

y1_title <- as.character(unique(engineering_doctorates$var1))
y2_title <- as.character(unique(engineering_doctorates$var2))

engineering_doctorates_long <- engineering_doctorates %>%
  select(year, var1_value, var2_value) %>%
  pivot_longer(
    cols = c(var1_value, var2_value),
    names_to = "variable",
    values_to = "value"
  ) %>%
  mutate(
    variable_label = case_when(
      variable == "var1_value" ~ y1_title,
      variable == "var2_value" ~ y2_title
    ),
  # apply transform to var2 for plotting: plot_value = a * var2 + b
  plot_value = ifelse(variable == "var2_value", value * scale_a + scale_b, value)
  )

# make a double y axis plot with year on the x axis
ggplot(engineering_doctorates_long, aes(x = year)) +
  geom_line(aes(y = plot_value, color = variable_label, group = variable_label), linewidth = 1.5) +
  geom_point(aes(y = plot_value, color = variable_label), size = 3) +
  labs(
    x = "Year",
    y = y1_title,
    title = sprintf("%s\nvs\n%s\n", y1_title, y2_title),
    subtitle = sprintf("Correlation: %.2f", cor_val),
    color = ""
  ) +
  # display all years on the x axis
  scale_x_continuous(breaks = nic_cage$year) +
  # primary y axis is the var1 scale
  # secondary shows var2 original scale by inverse-transforming
  scale_y_continuous(
    sec.axis = sec_axis(~ (. - scale_b) / scale_a, name = y2_title)
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top") +
  # use tintin color palette
  scale_colour_manual(
    values = tintin_pal(option = "the black island")(2), 
    name = ""
  ) +
  # center title and subtitle
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )

I can go ad -sickness With these false correlations. The point is that Correlation does not mean causal connectionAnd we must be careful when interpreting correlations.


#Valish #correlations #Correlation #causal #connection #RBloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *