Introduction

This website is intended as a quick reference for some techniques that I think people may need to keep their code reproducible when cleaning, analyzing, or presenting data. The main audience is me, but I think it may be useful for other people as well, especialy those doing work similar to me: analyzing household survey data. It is not intended as a quick-start guide to R. For that, try the R for Social Science Data Carpentry Workshop , on which some of this website is based.

Why reproducibility?

My work is often one-off: I analyze the data, write the report or paper, and never look back. Why would I care about reproducibility then? First of all: most work isn’t that one-off. Parts of my code can always be reused in new projects. Making sure my code can easily be repurposed for other projects by changing a few parameters is thus worthwhile just because of the time it saves me. But this has important positive side effects as well: it forces me to write code in a more general way that makes less assumptions about the data. Often I may get some additional data after writing my code, or I may make some changes to my cleaning script, which may break some of the assumptions I made in my analysis code: perhaps a few observations are now dropped, a few outliers are now included, or a few variables are now missing, changing the results of my analysis, potentially without me noticing it! General code tends to make fewer assumptions about the data, and is thus less likely to cause such errors.

Code written with reproducibility in mind is also easier to read and understand. This makes it easier for collobarotors to join your project, or review your work. This again improves the quality of your work, while reducing stress for you.

Without reproducibility, research and analysis can become opaque, error-prone, and difficult to validate. In short: you run the risk of sloppy science.

Packages used in this site

Throughout this site, I will mostly make use of tidyverse packages. While there are alternatives, such as data.table which is faster, tidyverse makes writing and reading code easy and quick, which is imortant from a reproducibility perspective. Moreover, we are not so concerned with performance, since household survey datasets are almost never large enough for that matter.

To provide easy access to the pacakges used in this site, as well as all data, everything is available on github. So if you want all code and data follow the following steps:

  • Open RStudio
  • Click file, and select “New project…”
  • Select Version Control, Git, and enter https://github.com/kleuveld/r_cheatsheet as the repository URL.
  • The Project will open.
  • In the console, type renv::restore() to install all packages.

(See how easy that was? Reproducible workflows, baby!)

Road map

The site is organized in two parts: the first cover advice on general best practices: project oriented workflows, general R techniques such as loops and functions that facilitate reproducibility and RMarkdown. The chapters in part two each cover a part of the research cycle: design, data cleaning and data analysis (including reporting the results). For each, there’s specific code examples using survey data.