Data Exploration

The first step of using data is exploring it. I will use Stata data because it has labels, making it easy to get a sense of the data once you get the hang of how labels are dealt with in R.

Setup

Download the cars data set from here or run the code below:

download.file(
  "https://raw.githubusercontent.com/kleuveld/r_cheatsheet/main/data/cars.dta",
  here("data/cars.dta"), mode = "wb"
)

Making a codebook from a Stata .dta

In Stata variables have labels, which is great because they’re more informative than variable names. In R, it can be a bit tricky to access the labels of imported dta’s, but making a code book isn’t that hard.

First, load the cars data set:

library(tidyverse)
library(haven)
library(here)


cars <- read_dta(here("data/cars.dta"))

The variable labels are stored as attributes of the variables. The attributes() function returns all attributes:

attributes(cars$mpg)
## $label
## [1] "miles per gallon"
## 
## $format.stata
## [1] "%9.0g"

To see only the label use:

attributes(cars$mpg)$label
## [1] "miles per gallon"

To create a data frame with all variable labels we can apply attributes() to all variables using map_chr() from the purrr package:

codebook <- 
  tibble(var = colnames(cars),
         label = map_chr(cars,~attributes(.x)$label)) 

codebook
## # A tibble: 4 × 2
##   var   label                              
##   <chr> <chr>                              
## 1 mpg   miles per gallon                   
## 2 cyl   number of cylinders                
## 3 eng   engine displacement in cubic inches
## 4 wgt   vehicle weight in pounds

To make it slightly more useful, we can add some summary statistics. I can apply a number of functions to a data frame using map_dbl(), which returns a named vector:

list_of_functions <- list(mean=mean,sd=sd,min=min,max=max)

list_of_functions %>%
  map_dbl(~.x(cars$mpg, na.rm = TRUE))
##      mean        sd       min       max 
## 23.445918  7.805007  9.000000 46.599998

To do this for every column in a dataframe, I wrap the code above in a function, and use map() to apply that function to the columns.

stats_to_tibble <- function(var,funs) {
  funs %>%
    map_dbl(~ifelse(is.numeric(var),.x(var,na.rm = TRUE),NA)) %>%
    as_tibble_row()
}

summ_stats <- 
  cars %>%
    map(~stats_to_tibble(.x,list_of_functions)) %>%
    list_rbind()
summ_stats
## # A tibble: 4 × 4
##      mean     sd   min    max
##     <dbl>  <dbl> <dbl>  <dbl>
## 1   23.4    7.81     9   46.6
## 2    5.47   1.71     3    8  
## 3  194.   105.      68  455  
## 4 2978.   849.    1613 5140

I can bind that with the codebook I had before to get a nice overview of all the variables in my dataset:

bind_cols(codebook, summ_stats)   
## # A tibble: 4 × 6
##   var   label                                  mean     sd   min    max
##   <chr> <chr>                                 <dbl>  <dbl> <dbl>  <dbl>
## 1 mpg   miles per gallon                      23.4    7.81     9   46.6
## 2 cyl   number of cylinders                    5.47   1.71     3    8  
## 3 eng   engine displacement in cubic inches  194.   105.      68  455  
## 4 wgt   vehicle weight in pounds            2978.   849.    1613 5140

Here’s a re-usable function that add more columns, handles empty labels (using coalesce()) and rounds the output so it’s human-readable:

create_codebook <- function(.df,stats = list(mean=mean,sd=sd,min=min,max=max,
                                            prop_miss=prop_miss)) {
  labels <- tibble(var = colnames(.df),
                   label = map_chr(.df,function(x) coalesce(attributes(x)$label,"")),
                   type = map_chr(.df, typeof))


  prop_miss <- function(x,na.rm = TRUE) {
    mean(is.na(x))
  }

  stats_to_tibble <- function(var,stats) {
    map_dbl(stats,~ifelse(is.numeric(var),.x(var,na.rm = TRUE),NA)) %>%
    as_tibble_row()
  }

  sumstats <-
    .df %>%
    map(~stats_to_tibble(.x,stats)) %>%
    list_rbind() %>%
    mutate(across(where(is.numeric),
                ~round(.x,2)))  

   bind_cols(labels,sumstats)
}

create_codebook(cars) 
## # A tibble: 4 × 8
##   var   label                         type    mean     sd   min    max prop_miss
##   <chr> <chr>                         <chr>  <dbl>  <dbl> <dbl>  <dbl>     <dbl>
## 1 mpg   miles per gallon              doub… 2.35e1   7.81     9   46.6         0
## 2 cyl   number of cylinders           doub… 5.47e0   1.71     3    8           0
## 3 eng   engine displacement in cubic… doub… 1.94e2 105.      68  455           0
## 4 wgt   vehicle weight in pounds      doub… 2.98e3 849.    1613 5140           0

Correlogram

Another great data exploration tool is the correlogram, which displays the correlations between many variables. To create one, I use ggpairs() from the GGally package:

library(GGally)

ggpairs(cars) 

You can also split the correlogram by a variable, like I do with the number of cylinders below:

cars %>%
  ggpairs(columns = c(1,3,4), 
          ggplot2::aes(colour=factor(cyl)))