2. Programming with R

Before diving into specific R techniques about managing social science research, this chapter will cover some general programming concepts that will help you make your R code more reusable. These techniques help you move away from the specifics of your data, and make generalizable solutions that you can then reapply to other data sets. This also prevents mistakes, where you copy-paste code and fail to make the necessary changes. But perhaps most importantly, it saves you time: these concepts allow you to do a lot of work with just a few simple building blocks of code.

This chapter will cover ways to iterate code (using either loops or the map() function), functions (which will let you easily reuse code snippets you often need), and making code more reusable with the box package.

Loops

Loops are a fundamental programming concept that allow you to repeat code multiple times without duplicating it. In R, a loop has the following structure:

for (i in 1:4){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4

This consists of two parts:

  • The for statement, which tells R to start a loop.
  • The code block within the curly braces {} that contains the code to be repeated.

The for statement should be read as: “for each element in the sequence 1 to 4, assign that element to the variable i, and then run the code block”. Therefore, this code is equivalent to:

print(1)
print(2)
print(3)
print(4)
## [1] 1
## [1] 2
## [1] 3
## [1] 4

But it’s much easier to maintain! Imagine having to do 86 iterations: you would need to copy paste the code 85 times every time you want to update something. Using loops, you don’t need to copy-paste a thing, leading to fewer errors.

While loops can be a perfectly fine way of doing repeated work, most R programmers prefer not to use loops, but instead use functions such as lapply() and map(). However, for this you will need to know how to write functions.

Functions

Basic Functions

A function is an R object that’s essentially a shortcut to run a bit of code. Usually, a function takes an argument (the input), and returns an output. The argument is always provided within parentheses. The sqrt() function, for example, takes a number as its argument and returns the square root of that number.

sqrt(4)
## [1] 2

The beauty of functions is that they allow you to define a set of operations in one place, and re-use it as often as you want. If you then need to update the operations, all you need to do is update the function, instead of hunting down every instance of the operations throughout your script.

Let’s say I have a report, where I run the same model many times, each time with different variables:

library(tidyverse)

df <- tibble(y = rnorm(50),x1 = rnorm(50),x2 = rnorm(50), x3 = rnorm(50))

# lots of code in between...

lm(y ~ x1, data = df)

# lots of code in between...

lm(y ~ x2, data = df)

# lots of code in between...

lm(y ~ x3, data = df)

# lots of code in between...

lm(y ~ x1 + x2 + x3 , data = df)
## 
## Call:
## lm(formula = y ~ x1, data = df)
## 
## Coefficients:
## (Intercept)           x1  
##    -0.01446     -0.16611  
## 
## 
## Call:
## lm(formula = y ~ x2, data = df)
## 
## Coefficients:
## (Intercept)           x2  
##    -0.03567     -0.09095  
## 
## 
## Call:
## lm(formula = y ~ x3, data = df)
## 
## Coefficients:
## (Intercept)           x3  
##    -0.03327      0.01276  
## 
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = df)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##    -0.01480     -0.17073     -0.08891      0.02352

If I now want to slightly change the model (e.g. use robust standard errors), I would have to find the places the model is used, and change it each time. Instead, I can put the model code in a function:

my_model <- function(independent_variable){

  results <- lm(formula = reformulate(independent_variable, response = "y"), data = df)

  return(results)
}

This has the following structure: - The function statement tells R to create a function. It also list the arguments the function takes. In this case just one: independent_variable, which should be passed a string (that reformulate() can convert to a formula). - The code block within the curly braces {} contains the code that will be run when the function is called. It ends with the return() function, which tells R what to return as output of the function. This will be printed to to the console, or can be assigned to an object.

I can then run my model four times:

my_model("x1")

# lots of code in between...

my_model("x2")

# lots of code in between...

my_model("x3")

# lots of code in between...

my_model(c("x1","x2","x3"))
## 
## Call:
## lm(formula = reformulate(independent_variable, response = "y"), 
##     data = df)
## 
## Coefficients:
## (Intercept)           x1  
##    -0.01446     -0.16611  
## 
## 
## Call:
## lm(formula = reformulate(independent_variable, response = "y"), 
##     data = df)
## 
## Coefficients:
## (Intercept)           x2  
##    -0.03567     -0.09095  
## 
## 
## Call:
## lm(formula = reformulate(independent_variable, response = "y"), 
##     data = df)
## 
## Coefficients:
## (Intercept)           x3  
##    -0.03327      0.01276  
## 
## 
## Call:
## lm(formula = reformulate(independent_variable, response = "y"), 
##     data = df)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##    -0.01480     -0.17073     -0.08891      0.02352

I don’t need to touch these anymore if I want to change something in the model. I only need to update the function definition.

Note that any variables we created in the function are kept within the function, and you can’t access them later:

model <- my_model("x3")

results
## Error: object 'results' not found

This keeps your working environment nice and clean, which again prevents problems.

Above I used the return function. That was purely to to make it explicit what the output is for explanation purposes, it is not actually needed. Anything on the last line of the function that would normally be returned to the console, is used a return value of the function:

my_model <- function(independent_variable){

  lm(formula = reformulate(independent_variable, response = "y"), data = df)

}

my_model("x1")
## 
## Call:
## lm(formula = reformulate(independent_variable, response = "y"), 
##     data = df)
## 
## Coefficients:
## (Intercept)           x1  
##    -0.01446     -0.16611

Functions and the tidyverse

Tidyverse can be a bit tricky when writing functions, because the unquoted variable names often break stuff.

Let’s say I want to have standard scatter plots like this:

df %>% 
  ggplot(aes(y = y, x = x1)) +
  geom_point()

But I don’t want to copy all that, so I make a function:

my_plot <- function(x_variable){
  df %>%
    ggplot(aes(y = y, x = x_variable)) +
    geom_point()
}

my_plot(x1)
## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error:
## ! object 'x1' not found

Why can’t it find x1? It’s right there in the data frame! But when evaluating my_plot(x1), R doesn’t look for x1 in the dataframe (why would it?), it looks for x1 in the global environment. So we have to make sure R waits until it’s inside ggplot() to look for x1 in the data frame. This is actually quite easy to do, simply wrap the variable name in {{ }} in the function body:

my_plot <- function(x_variable){
  df %>%
    ggplot(aes(y = y, x = {{ x_variable }})) +
    geom_point()
}

my_plot(x1)

However easy that solution was, the explanation behind it is a bit more complex, see here. If that’s all black magic to you, just remember that if you want to use unquoted variable names in your functions, wrap them in {{ }} in the function body.

Map()

map() works just like loop, but generally faster. It takes two arguments:

  • An iterable object, like a vector or list
  • A function

It will then apply the function to each element of the object. It will return a list with the results of each iteration.

For example, to get the the square root for each of the numbers in a vector:

vector <- c(4,9,16) 
map(vector,sqrt)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 4

You may not quite like the fact that map returns a list, but quite a few things in R use lists. modelsummary() for example. Suppose I want to put all the results from my_model() above in a nice table. Suppose I want to regress the same y on a number of combinations of independent variables, I can put the combination of variables in a list, and then run my model for each element of the list.

library(modelsummary)


# define a list of models: 
# the left-hand sides are the labels
# the right-hand sides the independent variables I will pass to reformulate.

models <-
list("X1 only" = "x1",
     "X2 only" = "x2",
     "X3 only" = "x3",
     "Full" = c("x1", "x2", "x3"))


# map() works nicely in  dplyr pipe!

models %>%
  map(my_model) %>%
  modelsummary(output = "flextable")

X1 only

X2 only

X3 only

Full

(Intercept)

-0.014

-0.036

-0.033

-0.015

(0.129)

(0.129)

(0.130)

(0.131)

x1

-0.166

-0.171

(0.136)

(0.139)

x2

-0.091

-0.089

(0.132)

(0.135)

x3

0.013

0.024

(0.127)

(0.130)

Num.Obs.

50

50

50

50

R2

0.030

0.010

0.000

0.041

R2 Adj.

0.010

-0.011

-0.021

-0.021

AIC

135.6

136.6

137.1

139.0

BIC

141.3

142.4

142.9

148.6

Log.Lik.

-64.792

-65.318

-65.557

-64.513

F

1.503

0.471

0.010

0.657

RMSE

0.88

0.89

0.90

0.88

If you want to specify more arguments to the function you are using in map(), you can do so by using an anonymous function.

Let’s say I want to calculate the mean of a number of vectors, but I want to remove missing values.

vectors <- list(c(1,NA,3,-99,5),
               c(10,20,-99,30,40,50),
               c(-99,NA,200,300))

# the ~ creates a purrr-style anonymous function, where .x is current element of the list
vectors %>% 
  map(mean)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] 8.5
## 
## [[3]]
## [1] NA

That didn;t work, since mean doesn’t remove NAs by default. I can make a new function that does remove NAs by defaults:

mean_narm <- function(x){
  mean(x, na.rm = TRUE)
}

vectors %>% 
  map(mean_narm)
## [[1]]
## [1] -22.5
## 
## [[2]]
## [1] 8.5
## 
## [[3]]
## [1] 133.6667

Or, if I don’t want this new function cluttering up my environment, I can make an anonymous function on the fly:

vectors %>% 
  map(function(x) mean(x, na.rm = TRUE))
## [[1]]
## [1] -22.5
## 
## [[2]]
## [1] 8.5
## 
## [[3]]
## [1] 133.6667

Note it will be quite difficult to use map() with data frame columns to edit multiple variables. For this, use across(), as seen in the data cleaning chapter

Other variants of map()

map() is an entire family of functions. One that I like is pmap(), which allows you to iterate over a number of things at the same time.

Suppose, like before I want to run a model a few times, but now I want to not only change one thing (like the control), but more things, like the dependent variable.

First, I put the parameters of each specification in a data frame like this:

specs <- 
tribble(
  ~label, ~depvar, ~controls,
  "X1 only" ,"y", "x1",
  "X2 only" ,"y", "x2",
   "X3 only" , "y","x3",
  "Full" ,"y", c("x1", "x2", "x3")
)

I can print this in my markdown file, so it’s easily seen what I am doing:

library(flextable)
specs %>% rowwise %>% mutate(controls = paste(controls, collapse = ",")) 
## # A tibble: 4 × 3
## # Rowwise: 
##   label   depvar controls
##   <chr>   <chr>  <chr>   
## 1 X1 only y      x1      
## 2 X2 only y      x2      
## 3 X3 only y      x3      
## 4 Full    y      x1,x2,x3
  mutate(
    controls = map_chr(controls, ~ paste(.x, collapse = ","))
  ) %>% 
  flextable()
## Error: object 'controls' not found

The mutate bit there is a bit complex. The “controls” column contains lists, and those don’t print nicely: it will just say “[[character]]”. So I need to convert the lists to a string: paste(controls, collapse = ",") does that, but it will take the entire column; and I want it for each cell individually. To do that, I can either use rowwise(), or pass the column to map(). The map_chr() variant is especially handy, as it outputs a vector of strings (rather than a list), which is what we need for mutate to work.

Anyway, back to pmap(). The code blow throws my spec in pmap(), which will then iterate row-wise through the dataframe. For each row, it passes the columns as arguments to the function inside pmap:

specs %>% 
  pmap(
    function(depvar, controls, ... ){
      lm(reformulate(controls, response = depvar), data =df)  
    }
  ) %>%
  set_names(specs$label) %>% 
  modelsummary(output = "flextable", stars = TRUE)

X1 only

X2 only

X3 only

Full

(Intercept)

-0.014

-0.036

-0.033

-0.015

(0.129)

(0.129)

(0.130)

(0.131)

x1

-0.166

-0.171

(0.136)

(0.139)

x2

-0.091

-0.089

(0.132)

(0.135)

x3

0.013

0.024

(0.127)

(0.130)

Num.Obs.

50

50

50

50

R2

0.030

0.010

0.000

0.041

R2 Adj.

0.010

-0.011

-0.021

-0.021

AIC

135.6

136.6

137.1

139.0

BIC

141.3

142.4

142.9

148.6

Log.Lik.

-64.792

-65.318

-65.557

-64.513

F

1.503

0.471

0.010

0.657

RMSE

0.88

0.89

0.90

0.88

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Notes:

  • The column names of spec will be passed to the function as arguments.
  • If spec has columns that my function does not accept as arguments, it will throw an error. The ... prevents this, as that captures all unused arguments.

This particular function uses only 2 arguments, so map2() would have worked fine here as well, but this can be more easily expanded to more complex things than map2().

Making functions re-usable with box

Now that you have a bunch of functions, it’s a good idea to make them re-usable across projects. The box package makes this easy.

You define your functions in a script, say box_modules/functions.R:

prop_miss <- function(x,na.rm = TRUE) {
  box::use(r/core[...])  
  mean(is.na(x))
}

This is just a function that returns the proportion of missing values in a vector. box::use(r/core[...]) loads all functions from the core package, so you can use them in your function (box modules don’t load anything you don’t tell them to).
You can then use this function in any project loading the entire script:

box::use(./box_modules/functions)

This looks in the folder where your script is located for a folder called box_modules, and loads all functions in the functions.R script.

You can run the function by:

functions$prop_miss(c(1,2,NA,4))
## [1] 0.25

Box even allows you to load packages:

box::use(dplyr)


dplyr$starwars |>
  dplyr$filter(species == "Droid")
## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

In fact, when writing your own box modules, this is the only way to load packages (hence thr box::use(r/core[...] in the function above). Note that you have to use the $ operator to access functions and data from the package or module you loaded. If you dont want that you can also attach specific functions:

box::use(dplyr[filter, `%>%`])


dplyr$starwars %>%
  filter(species == "Droid")
## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Or just attach everything just like you would with library():

box::use(dplyr[...])

starwars %>%
  filter(species == "Droid")
## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Alright, let’s unload our own function, so we can load it in another way, using:

box::unload(functions)

Alternatively, you can specify a search path, where box will look for modules. I will put my box modules in C:/Users/koen/documents/box_modules.

The following code will allow box to look in that folder for modules:

options('box.path' = "C:/Users/koen/documents/") # <-- the parent folder!!

Note that I have used the parent folder of box_modules, because for whatever reason you have to do that (the .R files need to be in a subfolder). By adding this to your [.Rprofile](https://docs.posit.co/ide/user/ide/guide/environments/r/managing-r.html) file, it will be run any time you open R.

Now I can just use the following code in any project to access my functions:

box::use(box_modules/functions)

functions$prop_miss(c(1,2,NA,4,5,NA))
## [1] 0.3333333

Of course you are no longer working project-oriented: your box modules aren’t in your project folder!

If you want to use box modules within your project folder, best to add the root folder as search path:

options('box.path' = c(here::here(), getOption('box.path'))

The function getOption('box.path') gets your current box search path, and c() combines it with your project root folder returned by here().

One way I like to organize my box modules is to have them in a git repository and use it as a submodule in my projects. That way you can have a specific version or branch of all your box modules in every project. You can also merge changes you make to you box modules of one project, back to all your other active projects. It’s fantastic, but a bit finnicky to set up.