Efficiently merging multiple data sets | R-Bloggers

Efficiently merging multiple data sets | R-Bloggers

5 minutes, 53 seconds Read

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report problems here about the content on this page)


Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.

Due to delays with my stock market payment, if this message is useful for you, I kindly request a minimal donation Buy a coffee for me That will be used to continue my open source efforts. If you need an R package or a shiny dashboard for your team, you can e -mail me or ask me Fiverr. The complete explanation is here: A personal message from an Open Source employee

You can send me questions for the blog using This form And subscribe to receive an e -mail when there is a new message.

I tend to see that my students repeat a lot of code when we have divided data over different tables. For example, when we have to read multiple CSV files that result in a structure comparable to the following tables made by deliberately subscribing the original Penguins set set set:

library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':

    penguins, penguins_raw
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(purrr)

species <- penguins %>%
  distinct(species) %>%
  pull()

map(
  species,
  function(s) {
    assign(
      paste0("penguins_", tolower(s)),
      filter(penguins, species == s),
      envir = .GlobalEnv
    )
  }
)
[[1]]
# A tibble: 152 Ă— 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                 
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex , year 

[[2]]
# A tibble: 124 Ă— 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                              
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex , year 

[[3]]
# A tibble: 68 Ă— 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                
 1 Chinstrap Dream            46.5          17.9               192        3500
 2 Chinstrap Dream            50            19.5               196        3900
 3 Chinstrap Dream            51.3          19.2               193        3650
 4 Chinstrap Dream            45.4          18.7               188        3525
 5 Chinstrap Dream            52.7          19.8               197        3725
 6 Chinstrap Dream            45.2          17.8               198        3950
 7 Chinstrap Dream            46.1          18.2               178        3250
 8 Chinstrap Dream            51.3          18.2               197        3750
 9 Chinstrap Dream            46            18.9               195        4150
10 Chinstrap Dream            51.3          19.9               198        3700
# ℹ 58 more rows
# ℹ 2 more variables: sex , year 
penguins_adelie
# A tibble: 152 Ă— 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                 
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex , year 
penguins_chinstrap
# A tibble: 68 Ă— 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                
 1 Chinstrap Dream            46.5          17.9               192        3500
 2 Chinstrap Dream            50            19.5               196        3900
 3 Chinstrap Dream            51.3          19.2               193        3650
 4 Chinstrap Dream            45.4          18.7               188        3525
 5 Chinstrap Dream            52.7          19.8               197        3725
 6 Chinstrap Dream            45.2          17.8               198        3950
 7 Chinstrap Dream            46.1          18.2               178        3250
 8 Chinstrap Dream            51.3          18.2               197        3750
 9 Chinstrap Dream            46            18.9               195        4150
10 Chinstrap Dream            51.3          19.9               198        3700
# ℹ 58 more rows
# ℹ 2 more variables: sex , year 
penguins_gentoo
# A tibble: 124 Ă— 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                              
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex , year 

A common operation that I have seen as my students use DPLYR is a variation on this:

penguins_adelie %>%
  bind_rows(penguins_chinstrap) %>% 
  bind_rows(penguins_gentoo)
# A tibble: 344 Ă— 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                 
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex , year 
bind_rows(penguins_adelie, penguins_chinstrap, penguins_gentoo)
# A tibble: 344 Ă— 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                 
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex , year 

A similar pattern appears when they decide to use base R with something like this

rbind(penguins_adelie, penguins_chinstrap, penguins_gentoo)
# A tibble: 344 Ă— 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                                 
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex , year 

There are many ways to achieve the same result in R, and an option is to use the operator “{}”:

# either
{penguins_adelie; penguins_chinstrap; penguins_gentoo} %>%
  list() %>%
  bind_rows()
# A tibble: 124 Ă— 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                              
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex , year 
# or
{
  penguins_adelie
  penguins_chinstrap
  penguins_gentoo
} %>%
  bind_rows()
# A tibble: 124 Ă— 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                                              
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex , year 

The earlier examples quickly become impractical when dealing with a large number of data sets. In such cases we can purrr Package to streamline the process.

A practical example is the UN Comtrade data, which are divided per country and year into thousands of CSV files (at some point I saved them as RDS to save disk space). To be used map_df To efficiently merge multiple data sets without repeating code, we can do the following:

# list Canadian data 1990-1999
canada_files <- list.files("~/Documents/uncomtrade/input/data", pattern = "_CAN_199",
  full.names = TRUE)

head(canada_files)

canada <- map_df(
  canada_files,
  readRDS
)

canada

> canada
# A tibble: 1,971,197 Ă— 34
   dataset_code      type_code freq_code ref_period_id ref_year ref_month period
                                             
 1 20124199001052100 C         A         19900101      1990     52        1990  
 2 20124199001052100 C         A         19900101      1990     52        1990  
 3 20124199001052100 C         A         19900101      1990     52        1990  
 4 20124199001052100 C         A         19900101      1990     52        1990  
 5 20124199001052100 C         A         19900101      1990     52        1990  
 6 20124199001052100 C         A         19900101      1990     52        1990  
 7 20124199001052100 C         A         19900101      1990     52        1990  
 8 20124199001052100 C         A         19900101      1990     52        1990  
 9 20124199001052100 C         A         19900101      1990     52        1990  
10 20124199001052100 C         A         19900101      1990     52        1990  
# ℹ 1,971,187 more rows
# ℹ 27 more variables: reporter_code , flow_code ,
#   partner_code , partner2code , classification_search_code ,
#   classification_code , is_original_classification ,
#   cmd_code , customs_code , mos_code , mot_code ,
#   qty_unit_code , qty , is_qty_estimated ,
#   alt_qty_unit_code , alt_qty , is_alt_qty_estimated , …
# ℹ Use `print(n = ...)` to see more rows

Then it is possible to perform various operations, such as total bilateral export that FOB and CIF reports to see the mismatch between the exporter and the importer:

canada %>%
  select(partner_code, fobvalue, cifvalue) %>%
  mutate(
    fobvalue = as.numeric(fobvalue),
    cifvalue = as.numeric(cifvalue)
  ) %>%
  group_by(partner_code) %>%
  summarise(
    fobvalue = sum(fobvalue, na.rm = TRUE),
    cifvalue = sum(cifvalue, na.rm = TRUE)
  )

> canada %>%
+   select(partner_code, fobvalue, cifvalue) %>%
+   mutate(
+     fobvalue = as.numeric(fobvalue),
+     cifvalue = as.numeric(cifvalue)
+   ) %>%
+   group_by(partner_code) %>%
+   summarise(
+     fobvalue = sum(fobvalue, na.rm = TRUE),
+     cifvalue = sum(cifvalue, na.rm = TRUE)
+   )
# A tibble: 219 Ă— 3
   partner_code fobvalue cifvalue
                  
 1 0             1.34e13        0
 2 100           1.44e 9        0
 3 104           4.18e 8        0
 4 108           2.58e 7        0
 5 112           1.26e 8        0
 6 116           1.77e 8        0
 7 12            2.22e10        0
 8 120           7.34e 8        0
 9 132           6.95e 6        0
10 136           2.32e 8        0
# ℹ 209 more rows
# ℹ Use `print(n = ...)` to see more rows


#Efficiently #merging #multiple #data #sets #RBloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *