gssrdoc updates | R bloggers

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report a problem with the content on this page here)

Want to share your content on R bloggers? click here if you have a blog, or here if you don’t.

Regular readers know I persevere gssr And gssrdoctwo packages for R. The former makes the General social research’s annual, cumulative, and panel datasets available in a way that’s easy to use in R. The latter makes the survey codebook available in R’s integrated help system in a way that documents each GSS variable as if it were a function or object in R, so you can query them in exactly the same way as you would any function from the R console or in the IDE of your choice. As a bonus, because I use pkgdown to document the packages, I get a website as a side effect. In the case of gssrdoc this means a searchable index of all GSS variables. The GSS is the Hubble Space Telescope of American social science; our longest-running representative look at many aspects of the character and opinions of American households. The data is available free from NORCbut they distribute it in SPSS, SAS and STATA formats. I’ve written these packages in an effort to make them more readily available in R. If you want to know the relationship between these different platforms, I’ve got you covered. But the most important thing is that R is a free and open source project, and the others are not.

This week I spent some time updating gssrdoc a bit to clean up how the help pages looked and make some other improvements. For example, within R you can say: ?govdook on the console and make this appear in the help:

Yes, govdook is an abbreviation of ‘Gov Do OK’, not ‘Dook’.

The package also includes gss_doca data frame that contains all the information that makes up the help pages. I included it because it can be useful to work with directly, for example when you want to extract summary information about a subset of variables.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(tibble)
library(gssrdoc)

gss_doc
#> # A tibble: 6,694 × 10
#>    variable description                           question         value_labels var_yrtab yrballot_df module_df subject_df norc_id norc_url
#>                                                                                      
#>  1 year     GSS year for this respondent          "GSS year"       "[NA(d)] do…                 1 https:/…
#>  2 id       Respondent id number                  "Respondent id … ""                           2 https:/…
#>  3 wrkstat  labor force status                    "Last week were… "[1] workin…                  3 https:/…
#>  4 hrs1     number of hours worked last week      "Last week were… "[89] 89+ h…                 4 https:/…
#>  5 hrs2     number of hours usually work a week   "Last week were… "[89] 89+ h…                  5 https:/…
#>  6 evwork   ever work as long as one year         "Last week were… "[1] yes / …                  6 https:/…
#>  7 occ      R's census occupation code (1970)     "A. What kind o… "[NA(d)] do…                 7 https:/…
#>  8 prestige r's occupational prestige score(1970) "A. What kind o… "[NA(d)] do…                  8 https:/…
#>  9 wrkslf   r self-emp or works for somebody      "A. What kind o… "[1] self-e…                  9 https:/…
#> 10 wrkgovt  govt or private employee              "A. What kind o… "[1] govern…                 10 https:/…
#> # ℹ 6,684 more rows

The gss_doc object has regular columns, but also a series list columns to put (insert meme here, you know the one) dataframes into your dataframes. (They are labeled as “snackshere; basically the same).

Why a list column? Why a list? Well, a list is one of the basic ways to store data of any kind. Lists are useful because they can contain heterogeneous elements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
items <- list(
  todo_home = c("Laundry", "Clean bathroom", "Feed cat", "Bring out rubbish bins"),
  important_dates = as.Date(c("1776-07-04", "1788-06-21", "2025-01-18")),
  keycode = 8675309,
  storage_tiers = c(128, 256, 512, 1024)
)

items
#> $todo_home
#> [1] "Laundry"                "Clean bathroom"         "Feed cat"               "Bring out rubbish bins"
#> 
#> $important_dates
#> [1] "1776-07-04" "1788-06-21" "2025-01-18"
#> 
#> $keycode
#> [1] 8675309
#> 
#> $storage_tiers
#> [1]  128  256  512 1024

One thing to note with a list like this is that it doesn’t really make sense to display it as a table. This is partly because the elements of the list have different lengths, but in reality it is because we did were to represent it as a table, it would mean nothing to read across the rows:

1
2
3
4
5
6
7
8
items_df
#> # A tibble: 4 × 4
#>   todo_home              important_dates keycode storage_tiers
#>                                          
#> 1 Laundry                1776-07-04      8675309           128
#> 2 Clean bathroom         1788-06-21           -            256
#> 3 Feed cat               2025-01-18           -            512
#> 4 Bring out rubbish bins -                    -           1024

The rows are not ‘cases’ of anything. We only have four unrelated categories with different pieces of information in them.

Lists are also useful because they lend themselves easily to being nested:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
items <- list(
  todo_home = list(
    tasks = c("Laundry", "Clean bathroom", "Feed cat", "Bring out rubbish bins"),
    tobuy = c("Cat Food", "Burritos"), 
    wifi_password = "p@ssw0rd!"
  ),
  important_dates = as.Date(c("1776-07-04", "1788-06-21", "2025-01-18")),
  keycode = 8675309,
  storage_tiers = list(
    ssd = c(128, 256, 512, 1024),
    ram = c(1, 4, 8)
  )
)

items
#> $todo_home
#> $todo_home$tasks
#> [1] "Laundry"                "Clean bathroom"         "Feed cat"               "Bring out rubbish bins"
#> 
#> $todo_home$tobuy
#> [1] "Cat Food" "Burritos"
#> 
#> $todo_home$wifi_password
#> [1] "p@ssw0rd!"
#> 
#> 
#> $important_dates
#> [1] "1776-07-04" "1788-06-21" "2025-01-18"
#> 
#> $keycode
#> [1] 8675309
#> 
#> $storage_tiers
#> $storage_tiers$ssd
#> [1]  128  256  512 1024
#> 
#> $storage_tiers$ram
#> [1] 1 4 8

Essentially, R is a LISP/Scheme-like list processing language, combined with features of classical array languages such as APL. This is because in the world of data analysis we are constantly dealing with rectangular tables, or arrays, where rows are cases and columns are different types of variables. The wrinkle is that, unlike a beautiful set of pure numbers, each column can measure something (a date, a true/false answer, a location, a score, a nationality) that we’d rather not represent directly as a number. Sure, at the bottom of the computer everything is just ones and zeros. (Or more accurately, electromagnetic patterns in a physical substrate that we can interpret as ones and zeros.) And if we want to do any kind of data analysis that requires us to treat our table as a matrix, then we need numerical representations of all the columns. But for many applications we would like to see ‘France’ or ‘Strongly agree’ instead of ’33’ or ‘5’. Just a table with rows and columns, where different things can be displayed in columns, but each specific column is all the same.

Such a rectangular table is called a data frame. One way to think of a data frame is as a special case of a list. A data frame is a list in which you can place all list elements side by side and treat them as columns, and where all these elements are made up of vectors of the same length. In addition, it is a list in which the nth element of each vector refers to a property of the same underlying entity, i.e. the thing in the queue, or case; the thing whose columns show you the dimensions or properties. You can have blank entries if necessary, such as when a piece of data is missing. The important thing is that each column has as many boxes as there are files, and that you enter the values for each file in the same box in each column. When you look at a table of data, one of your first questions should always be, “What is a row in this table?” In this case, each row is a variable in the full GSS data set and each column describes a property of that variable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(tibble)
library(gssrdoc)

gss_doc
#> # A tibble: 6,694 × 10
#>    variable description                           question         value_labels var_yrtab yrballot_df module_df subject_df norc_id norc_url
#>                                                                                      
#>  1 year     GSS year for this respondent          "GSS year"       "[NA(d)] do…                 1 https:/…
#>  2 id       Respondent id number                  "Respondent id … ""                           2 https:/…
#>  3 wrkstat  labor force status                    "Last week were… "[1] workin…                  3 https:/…
#>  4 hrs1     number of hours worked last week      "Last week were… "[89] 89+ h…                 4 https:/…
#>  5 hrs2     number of hours usually work a week   "Last week were… "[89] 89+ h…                  5 https:/…
#>  6 evwork   ever work as long as one year         "Last week were… "[1] yes / …                  6 https:/…
#>  7 occ      R's census occupation code (1970)     "A. What kind o… "[NA(d)] do…                 7 https:/…
#>  8 prestige r's occupational prestige score(1970) "A. What kind o… "[NA(d)] do…                  8 https:/…
#>  9 wrkslf   r self-emp or works for somebody      "A. What kind o… "[1] self-e…                  9 https:/…
#> 10 wrkgovt  govt or private employee              "A. What kind o… "[1] govern…                 10 https:/…
#> # ℹ 6,684 more rows

Because R was designed by statisticians, R is a descendant of Swhich, like everything else in the computer world, has its origins Bell Labs– it has this concept of a data frame built into its core rather than being bolted on after the fact, which is extremely useful. Normally data frames are just plain rectangles, but there’s no reason why a given column itself can’t be seen as a list of something else. That’s what we have here. The yr_vartab column contains data frames of cross-tabulation of the answers to each question by year. Except where this is not the case (e.g id), and this is fine because lists don’t have to be internally homogeneous. The same way yrballot_df has a little table on which ballots, or internal parts of the survey, a question was asked for each year it was asked.

The result is that after mounting the gss_doc object we can use it to broadcast seven thousand pages of documentation on the GSS’s many, many questions over the years. We can build them as standardized R help pages, like above. On the website That pgkdown builds for us, we get this:

Website display.

New in this version are the cross-references to other relevant variables in the ‘See also’ section. It is due to the GSS’s own information on survey modules and an ad hoc subject index they maintain for the variables. I’m only using a subset of possible cross-references because, for example, we don’t want every single question in the GSS core to be cross-referenced to every other core question on a given help page. I collect these in one on the website single page:

Subject index page.

The GSS has its own handy data explorer which is very useful to quickly check certain trends and get a quick graph of what the data looks like, or a summary overview of the contents of certain variables. Every help page gssrdoc now links to the GSS Data Explorer page for that variable, in case you want to go there and take a look. Of course, the gssrdoc package is not intended to replace the Data Explorer; it’s just a different view of the same information, with a different use case in mind.

#gssrdoc #updates #bloggers

gssrdoc updates | R bloggers

Related

Like this:

Related

Similar Posts

Flags | R bloggers

Louisa & Filippo Catenacci got married in the Town Hall: ‘We celebrated with an intimate dinner’

Leave a Reply Cancel reply

Related

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply