Open REDAAM one year after publication | R bloggers

Open REDAAM one year after publication | R bloggers

First of all, I would like to apologize for the lack of updates on the blog. My financing problems persist; I’m currently working two part-time jobs on top of the heavy burden of the PhD, and I’m barely managing to pay for rent and food. I appreciate the patience of those reading this space.

Exactly one year ago we published our article about Open REDAAM together with Lital Barkai. The path to publication was curious: the article was rejected by many journals that publish in Spanish and have a regional focus. The “paradox” is that it was ultimately accepted Data and policy from Cambridge University Press (Q1), without a regional focus, and whose place in the academic indexes is much better than the journals that initially rejected it.

During this time we have received many emails from users around the world who are grateful for the software. We often receive questions such as: “I have this 1960 census and I cannot export a certain table with the original software or with Open REDATAM, what can I do?”. Fortunately, we were able to help most users resolve their issues and access their historical data.

Furthermore, I am pleased to inform you that the formal review of the software will continue at the end of this first year rOpenSciwhich will provide an additional guarantee for the quality and robustness of our work.

Article summary

For those who haven’t read The REDAAM format and its challenges for data access and information creation in public policyhere is a detailed summary of the central arguments.

What is REDATAM and what is the problem?

REDAAM (Retrieval of Data for Small Areas by Microcomputer) is ECLAC’s standard format for the dissemination of microdata from censuses in Latin America. Countries such as Argentina, Chile, Colombia and Mexico have been using it for decades. The central problem is that REDATAM is a closed binary format: it has no official public specification, the files cannot be opened with a text editor as can be the case with a CSV, and the official software (REDATAM R+SP) does not allow statistical analysis beyond simple tables. This means that it is not possible to perform regressions, hypothesis testing or advanced visualizations directly from REDATAM.

Data versus information

In the article we make an important conceptual distinction: data are raw facts, while information is processed data that can be used to make decisions. REDAAM creates a bottleneck in this transformation. Governments have already paid the costs of producing census data; Using closed formats to distribute them prevents NGOs, advocacy groups and technical teams from converting them into useful information for designing public policies. The economic impact of open format publishing would be marginal for governments, but the benefit for citizen participation and evidence in public policy would be enormous.

Security through ambiguity and privacy risks

Another serious problem is that the lack of an official specification amounts to what in computer security is called ‘security by obscurity’: relying on the secrecy of the format as a protection mechanism. This practice is generally discouraged as it is only a matter of time before the format is reverse engineered, as was the case with the DVD format. Furthermore, when reviewing the 2011 Uruguay census, available on the ECLAC website, we found that the file was labeled “for internal use of the INE Uruguay,” suggesting that data that is supposed to be confidential could be inadvertently made public. The right solution is not the secrecy of the format, but the encryption and anonymization of individual data.

Incompatibility with modern tools

REDAAM is not compatible with R, Python, Excel, SPSS and Stata. To export variables you must use a graphical interface that is slow and does not scale well when multiple filters or variables are needed. There is a SQL type query language within the same software, but it also does not allow statistical testing. This severely limits the quantitative analysis that is now standard in the social sciences, economics and political sciences.

Open REDAAM as a solution

We have developed to solve this Open REDATAMa cross-platform tool (Linux, Mac and Windows) written in C++ that converts REDATAM files to CSV. We built on the previous work of Pablo de Grande, who had created a C# converter for Windows, and rewrote it in C++ for complete portability. In addition, we create R and Python packages that allow data to be read directly in those environments without going through the command line, following Tidy Data principles. The software is licensed under the Apache License, which permits commercial use and derivative works as long as attribution of authorship is maintained.

Validation with IPUMS

To verify that our tool extracts the data correctly, we compare our results with those of the IPUMS International service, which provides harmonized census microdata for multiple countries. We did it for Bolivia (2012), Chile (2017), Dominican Republic (2002), Ecuador (2010), El Salvador (2007), Peru (2017) and Uruguay (2011). The observed differences are explainable: IPUMS works with a 10% sample and applies its own cleaning and harmonization processes, while Open REDATAM reads the data as distributed by each government. The results are consistent and provide confidence in the accuracy of the software.

Preservation of historical data

Another issue mentioned in the article is long-term storage. For example, the 2001 Argentine census installer no longer works on Windows 10, but it does work on Ubuntu 22.04 with Wine. As closed formats and proprietary software become outdated, historical census data may become inaccessible. This reinforces the need for standardized open specifications, such as those for XLSX (ISO 29500), which allow multiple tools to read the same format without being dependent on a single software vendor.

Example of use

For a particular census, such as Chilean census 2017just run:

library(redatam)

chl17 <- read_redatam("input-dir/CPV2017-16.dicx")

This returns a list of data frames, one for each hierarchical level of the census (region, state, municipality, household, person, etc.), which can then be combined with dplyr to get aggregated tables:

According to this idea, in the specific case of the 2017 Chilean census, we could obtain the number of people per region with the following code:

library(dplyr)

chl17$zonas %>%
    mutate(region = substr(as.character(geocodigo), 1, 2)) %>%
    select(region, geocodigo, zonaloc_ref_id) %>%
    inner_join(
        chl17$viviendas %>%
            group_by(zonaloc_ref_id, cant_per) %>%
            summarise(cant_per = sum(cant_per, na.rm = TRUE), .groups = "drop"),
        by = "zonaloc_ref_id"
    ) %>%
    group_by(region) %>%
    summarise(cant_per = sum(cant_per, na.rm = TRUE), .groups = "drop")

For a more detailed explanation, including how to bridge levels and calculate indicators such as overcrowding, see the official package vignette.

Technical updates

The software was improved in various respects both before and after the publication of the article. Most important is expanding support for historical censuses that are kept by universities and institutional archives and are not always easy to find on the Internet. This has been an iterative process of trial and error: as users send us files with older variants of the format, we identify patterns that differ from the most recent counts and extend the code to cover them, without changing the already verified core logic. The result is that today Open REDAAM can read a significantly wider range of files than in the original version.

In addition, we have organized together with users a collection of cssus microdata converted to CSV, which is now publicly available at github.com/pachadotdev/redatam-microdata. The repository contains counts from different countries and years that users have sent us themselves, and which can now be downloaded directly without the need to install additional software. We hope this will further lower the barrier to entry for researchers and technical teams with limited resources.

Finally, on the most immediate technical side, I migrated the C++ interface from the R package. cpp11 A cpp4r (you can read more about this tool here). This decision was primarily based on improving the portability of the code and facilitating its long-term maintenance.

It is also worth noting that the package was recently published in CRAN redatamxwhich takes a different approach than ours: instead of re-implementing format reading, it acts as an interface to the official REDATAM application, which must have been installed previously. This makes it officially available for Ubuntu, but in practice I’ve had trouble getting it to work: it won’t run on my laptop with Manjaro, nor have I been able to get it to work from virtual machines with Ubuntu. Open REDATAM, on the other hand, does not rely on a remote installation and works on any platform where a C++ compiler is available.

Institutional updates

Regarding the institutional interface, we have tried on several occasions to reach out to ECLAC – the United Nations agency that develops and maintains REDATAM – to explore possible avenues for collaboration or integration of Open REDATAM into the official work of the organization. Unfortunately, we have not received a response to any of these messages to date. However, we publicly announce that the invitation is open: if anyone at ECLAC or a national statistical institute would like to contact us to discuss how we can work together, we are happy to do so.

If you enjoyed this article, please consider donating to support my open source work: https://buymeacoffee.com/pacha.


#Open #REDAAM #year #publication #bloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *