Programming with R
You can make your R code more reproducible by using programming features such as loops and functions. This prevents errors, saves you time and allows you to share you code more easily.
Loops
Loops repeat code a number of time, they have the following structure:
## [1] 1
## [1] 2
## [1] 3
## [1] 4
The i is the iterator,
and each iteration of the loop it takes on the value of a different element of
the object that follows the in;
in this case, i takes on 1,2,3 and finally 4 as its values.
Therefore, this code is equivalent to:
## [1] 1
## [1] 2
## [1] 3
## [1] 4
But it’s much easier to maintain! Imagine having to do 86 iterations: you would need to copy paste the code 85 times every time you want to update something. Using loops, you don’t need to copy-paste a thing, leading to fewer errors.
While loops can be a perfectly fine way of doing repeated work, most R
programmers prefer not to use loops, but instead use functions such as
lapply() and map().
However, for this you will need to know how to write functions.
Functions
Basic Functions
A function is an R object that’s essentially a shortcut to run a bit of code.
Usually, a function takes an argument (the input), and returns an output. The
argument is always provided within parentheses. The sqrt() function, for
example, takes a number as its argument and returns the square root of that number.
## [1] 2
The beauty of functions is that they allow you to define a set of operations in one place, and re-use it as often as you want. If you then need to update the operations, all you need to do is update the function, instead of hunting down every instance of the operations throughout your script.
Take for example the following code that cleans missing data.
library(tidyverse) # needed for if_else()
#define an example variable
variable1 = c(1,2,3,-99,5,6)
#change missing
variable1_cleaned = if_else(variable1 == -99, NA,variable1)
variable1_cleaned ## [1] 1 2 3 NA 5 6
This works, but I need to do this at many different places in my code, and when copying the code to other places, it’s easy to make small mistakes. Moreover, if I notice my procedure for cleaning missing data is wrong, I will have to hunt down all the places I’ve copied this code to, and change it (which, again, is an error prone process).
A common piece of advice is to put any code that you have to copy-paste more than twice in a function. Once I do that, I can just call the function wherever I need to clean data; if I want to change how I deal with missings, I now only have to make changes in one place:
clean_missing <- function(input_data) {
returned_data <- if_else(input_data == -99, NA, input_data)
return(returned_data)
}I’ve pretty much copied the code I had above. Note that I could have used any
name for my argument (defined in the parentheses after function); I just thought
input_data made sense. I use the return() function to return data.
I can now just use the clean_data() wherever I want:
## [1] 1 2 3 NA 5 6
Note that any variables we created in the function are kept within the function, and you can’t access them later:
## Error: object 'returned_data' not found
This keeps your working environment nice and clean, which again prevents problems.
Above I used the return function. This is not needed. Anything on the last line of the function that would normally be returned to the console, is used a return value of the function:
clean_missing <- function(input_data) {
if_else(input_data == -99, NA, input_data)
}
clean_missing(variable1)## [1] 1 2 3 NA 5 6
Functions and the tidyverse
In the previous function I passed a vector to clean, what if want to use it in a data cleaning pipe?
It’s useful to start with the code without a function, and then to generalize from there:
library(tidyverse)
data_frame <- tibble(variable1 = variable1)
data_frame %>%
mutate(variable1 = if_else(variable1 == -99,NA,variable1))## # A tibble: 6 × 1
## variable1
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 NA
## 5 5
## 6 6
Let’s try to put that in a function. The function will need to take a data frame and a variable name as its arguments. Results from one step in the pipe, are passed as the first argument for the next step. So make sure the data is the first argument!
clean_missing_df <- function(input_dataframe, variablename){
input_dataframe %>%
mutate(variablename = if_else(variablename == -99,NA, variablename))
}
data_frame %>%
clean_missing_df(variable1)## # A tibble: 6 × 2
## variable1 variablename
## <dbl> <dbl>
## 1 1 1
## 2 2 2
## 3 3 3
## 4 -99 NA
## 5 5 5
## 6 6 6
Wait, something went wrong! R just created a new variable called “variablename”.
For a technical note on why this is, see
here. In short, we want
R too look what’s in variablename, not just use it as is. To do so, wrap the
variable name in your function definition
in {{ and }}, and change the = in mutatate to :=.
clean_missing_df <- function(input_dataframe, variablename){
input_dataframe %>%
mutate({{ variablename }} := if_else({{ variablename }} == -99,
NA,
{{ variablename }}))
}
data_frame %>%
clean_missing_df(variable1)## # A tibble: 6 × 1
## variable1
## <dbl>
## 1 1
## 2 2
## 3 3
## 4 NA
## 5 5
## 6 6
Map()
map() works just like loop, but generally faster. It takes two arguments:
- An iterable object, like a vector or list
- A function
It will then apply the function to each element of the object. It will return a list with the results of each iteration.
For example, to get the the square root for each of the numbers in a vector:
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
You may not quite like the fact that map returns a list, but quite a few things in
R use lists. modelsummary() for example. Suppose I want to regress the same y on
a number of combinations of independent variables, I can put the combination of
varaibles in a list, and then run my model for each element of the list.
I use reformulate() to convert strings into a formula ‘lm()’ can use. Also,
I use ~ to create
a purrr-style inline anonymous function
within map().
library(modelsummary)
# generate some sample data
df <- tibble(y = rnorm(50),x1 = rnorm(50),x2 = rnorm(50))
# define a list of models:
# the left-hand sides are the labels
# the right-hand sides the independent variables I will pass to reformulate.
models <-
list("X1 only" = "x1",
"X2 only" = "x2",
"Both" = c("x1","x2"))
# map() works nicely in dplyr pipe!
# the ~ creates a purrr-style anonymous function, where .x is current element of the list
models %>%
map(~lm(reformulate(.x, response="y"),data = df)) %>%
modelsummary(output = "flextable")
| X1 only | X2 only | Both |
|---|---|---|---|
(Intercept) | -0.022 | -0.022 | -0.025 |
(0.145) | (0.146) | (0.147) | |
x1 | 0.124 | 0.124 | |
(0.150) | (0.151) | ||
x2 | 0.042 | 0.042 | |
(0.156) | (0.157) | ||
Num.Obs. | 50 | 50 | 50 |
R2 | 0.014 | 0.001 | 0.016 |
R2 Adj. | -0.006 | -0.019 | -0.026 |
AIC | 148.3 | 148.9 | 150.2 |
BIC | 154.0 | 154.7 | 157.9 |
Log.Lik. | -71.149 | -71.469 | -71.112 |
F | 0.690 | 0.071 | 0.374 |
RMSE | 1.00 | 1.01 | 1.00 |
Making functions re-usable with box
Now that you have a bunch of functions, it’s a good idea to make them re-usable across projects. The box package makes this easy.
You define your functions in a script, say box_modules/functions.R:
This is just a fucntion that returns the proportion of missing values in a vector.
box::use(r/core[...]) loads all functions from the core package,
so you can use them in your function (box modules don’t load anything you don’t tell them to).
You can then use this function in any project loading the entire script:
This looks in the folder where your script is located for a folder called box_modules, and loads all functions in the functions.R script.
You can run the function by:
## [1] 0.25
Box even allows you to load packages:
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculi…
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculi…
## 3 R5-D4 97 32 <NA> white, red red NA none masculi…
## 4 IG-88 200 140 none metal red 15 none masculi…
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine
## 6 BB8 NA NA none none black NA none masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
In fact, when writing your own box modules,
this is the only way to load packages (hence thr box::use(r/core[...] in the function above).
Note that you have to use the $ operator to access functions and data
from the package or module you loaded.
If you dont want that you can also attach specific functions:
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculi…
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculi…
## 3 R5-D4 97 32 <NA> white, red red NA none masculi…
## 4 IG-88 200 140 none metal red 15 none masculi…
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine
## 6 BB8 NA NA none none black NA none masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Or just attach everything just like you would with library():
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculi…
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculi…
## 3 R5-D4 97 32 <NA> white, red red NA none masculi…
## 4 IG-88 200 140 none metal red 15 none masculi…
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine
## 6 BB8 NA NA none none black NA none masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Alright, let’s unload our own function, so we can load it in another way, using:
Alternatively, you can specify a search path,
where box will look for modules.
I will put my box modules in C:/Users/koen/documents/box_modules.
The following code will allow box to look in that folder for modules:
Note that I have used the parent folder of box_modules,
because for whatever reason you have to do that
(the .R files need to be in a subfolder).
By adding this to your [.Rprofile](https://docs.posit.co/ide/user/ide/guide/environments/r/managing-r.html) file,
it will be run any time you open R.
Now I can just use the following code in any project to access my functions:
## [1] 0.3333333
Of course you are no longer working project-oriented: your box modules aren’t in your project folder!
If you want to use box modules within your project folder, best to add the root folder as search path:
The function getOption('box.path') gets your current box search path,
and c() combines it with your project root folder returned by here().
One way I like to organize my box modules is to have them in a git repository and use it as a submodule in my projects. That way you can have a specific version or branch of all your box modules in every project. You can also merge changes you make to you box modules of one project, back to all your other active projects. It’s fantastic, but a bit finnicky to set up.