Data Wrangling
Make sure you have the tidyverse
installed, and the SAFI data set
downloaded to your data folder by running the code from the Set-up section
Basic Data Manipulation
library(tidyverse)
library(here)
read_csv(here("data/SAFI_clean.csv"), na = "NULL") %>%
filter(village == "Chirodzo") %>%
mutate(people_per_room = no_membrs / rooms,
years_liv = if_else(years_liv > 90, NA, years_liv),
respondent_wall_type = as_factor(respondent_wall_type),
respondent_wall_type = fct_recode(respondent_wall_type,
"Burned bricks" = "burntbricks",
"Mud Daub" = "muddaub",
"Sun bricks" = "sunbricks"),
conflict_yn = case_when(affect_conflicts == "frequently" ~ 1,
affect_conflicts == "more_once" ~ 1,
affect_conflicts == "once" ~ 1,
affect_conflicts == "never" ~ 0,
.default = NA),
day = day(interview_date),
month = month(interview_date),
year = year(interview_date)) %>%
select(key_ID:rooms, day:people_per_room, -village) %>%
filter(interview_date > "2016-11-16" & interview_date < '2017-01-01')
## # A tibble: 10 × 9
## key_ID interview_date no_membrs years_liv respondent_wall_type rooms
## <dbl> <dttm> <dbl> <dbl> <fct> <dbl>
## 1 10 2016-12-16 00:00:00 12 23 Burned bricks 5
## 2 34 2016-11-17 00:00:00 8 18 Burned bricks 3
## 3 35 2016-11-17 00:00:00 5 45 Mud Daub 1
## 4 36 2016-11-17 00:00:00 6 23 Sun bricks 1
## 5 37 2016-11-17 00:00:00 3 8 Burned bricks 1
## 6 43 2016-11-17 00:00:00 7 29 Mud Daub 1
## 7 44 2016-11-17 00:00:00 2 6 Mud Daub 1
## 8 45 2016-11-17 00:00:00 9 7 Mud Daub 1
## 9 46 2016-11-17 00:00:00 10 42 Burned bricks 2
## 10 47 2016-11-17 00:00:00 2 2 Mud Daub 1
## # ℹ 3 more variables: day <int>, conflict_yn <dbl>, people_per_room <dbl>
Pivoting (or reshaping)
In tidyverse, reshaping is called pivoting. Here’s how you pivot a household roster (reshape wider) so you can merge it with the household data.
Creating fake data
First, I create sa fake household roster, based on the SAFI data,
making sure that the household roster has a number of lines
for each household that is equal to the household size, and has two
randomly generated variables: female
and age
. Note that age
may
be -99
, which should be considered missing.
long_data <-
read_csv(here("data/SAFI_clean.csv"), na = "NULL") %>%
select(key_ID,no_membrs ) %>%
uncount(no_membrs) %>%
group_by(key_ID) %>%
mutate(member_ID = row_number()) %>%
rowwise() %>%
mutate(female = sample(0:1,1),
age = case_when(member_ID == 1 ~ sample(18:86,1),
.default = sample(c(0:86,-99),1))) %>%
ungroup()
long_data
## # A tibble: 942 × 4
## key_ID member_ID female age
## <dbl> <int> <int> <dbl>
## 1 1 1 1 49
## 2 1 2 0 31
## 3 1 3 0 28
## 4 2 1 1 77
## 5 2 2 1 80
## 6 2 3 0 45
## 7 2 4 0 54
## 8 2 5 1 68
## 9 2 6 0 68
## 10 2 7 0 19
## # ℹ 932 more rows
Pivoting long to wide
To merge this into our main data set, we need to make sure we go back to having 1 observation per household. We will do this by using pivot_wider()
:
wide_data <-
long_data %>%
pivot_wider(names_from = member_ID,
values_from = !ends_with("_ID")) %>%
relocate(key_ID, ends_with(paste0("_",1:19)))
wide_data
## # A tibble: 131 × 39
## key_ID female_1 age_1 female_2 age_2 female_3 age_3 female_4 age_4 female_5
## <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>
## 1 1 1 49 0 31 0 28 NA NA NA
## 2 2 1 77 1 80 0 45 0 54 1
## 3 3 0 78 0 76 1 23 0 47 0
## 4 4 1 29 0 71 1 84 1 55 0
## 5 5 1 73 0 48 1 39 0 4 1
## 6 6 0 37 0 75 1 7 NA NA NA
## 7 7 1 85 0 20 1 61 1 42 0
## 8 8 0 31 0 16 1 23 1 45 1
## 9 9 1 33 1 52 0 -99 1 27 0
## 10 10 0 24 0 44 1 50 0 81 1
## # ℹ 121 more rows
## # ℹ 29 more variables: age_5 <dbl>, female_6 <int>, age_6 <dbl>,
## # female_7 <int>, age_7 <dbl>, female_8 <int>, age_8 <dbl>, female_9 <int>,
## # age_9 <dbl>, female_10 <int>, age_10 <dbl>, female_11 <int>, age_11 <dbl>,
## # female_12 <int>, age_12 <dbl>, female_13 <int>, age_13 <dbl>,
## # female_14 <int>, age_14 <dbl>, female_15 <int>, age_15 <dbl>,
## # female_16 <int>, age_16 <dbl>, female_17 <int>, age_17 <dbl>, …
We only needed to specify two options:
- names_from: this is the column that contains the names (or usually numbers)
for each of our units of analysis. In this case, the
member_ID
. - values_from: the variables containing the data. All variables you specify
here, will get one column for each possible value of names_from.
In our case, these variables
female
andage
. I used tidy select syntax to specify all variables except the ones ending in_ID
. - I use
relocate()
to reorder the variables.
Pivoting wide to long
If we had started with wide data, and had wanted to transform to
long data, we’d have to use pivot_longer()
:
## # A tibble: 4,978 × 3
## key_ID name value
## <dbl> <chr> <dbl>
## 1 1 female_1 1
## 2 1 age_1 49
## 3 1 female_2 0
## 4 1 age_2 31
## 5 1 female_3 0
## 6 1 age_3 28
## 7 1 female_4 NA
## 8 1 age_4 NA
## 9 1 female_5 NA
## 10 1 age_5 NA
## # ℹ 4,968 more rows
This was easy since the syntax of pivot_longer()
is the exact opposite of
pivot_wider()
, but the result is pretty useless:
- The
name
column contains two things: a variable name and amember_ID
; - The data is too long: I’d like
age
andfemale
to be two separate variables ; and - There’s many empty rows: there’s and age and female row for 19 possible members for each household, but most households are smaller than that.
I could use separate_wider_delim()
,
pivot_wider()
, and filter(!is.na())
to address those, but that’s not elegant
at all.
I can do all of this within the pivot_longer()
call by using the names_to
and names_sep
options:
wide_data %>%
pivot_longer(!key_ID,
names_to = c(".value", "member_ID"),
names_sep="_",
values_drop_na = TRUE,
names_transform = list(member_ID = as.integer))
## # A tibble: 942 × 4
## key_ID member_ID female age
## <dbl> <int> <int> <dbl>
## 1 1 1 1 49
## 2 1 2 0 31
## 3 1 3 0 28
## 4 2 1 1 77
## 5 2 2 1 80
## 6 2 3 0 45
## 7 2 4 0 54
## 8 2 5 1 68
## 9 2 6 0 68
## 10 2 7 0 19
## # ℹ 932 more rows
In this case, the syntax is a bit harder to understand. It’s good to think first
what the original data looks like, and how I intend to transform it.
The wide data has columns key_ID, age_1-19 and female_1-19.
I don’t really want to touch the key_ID column.
I want to turn the columns age_1-19 and female_1-19 into three columns:
female
, age
and member_ID
.
This translates to the options we passed to pivot_longer()
as follows:
!key_ID
: We want to pivot the data that’s in all columns except key_ID.names_to = c(".value", "member_ID")
: this specifies the new columns we want to create. It basically says that the existing column names consist of two parts: one part (i.e. female and age) that we wish to keep as column names of variables that will contain my values, and one part (i.e. the numbers 1-19) which should be put into a new column which we will “member_ID”.names_sep=
: this indicates how the two parts mentioned above are separated. In more difficult cases, you’ll have to use thenames_pattern
option. This requires some knowledge of regular expressions, so here’s two examples:- If there is no seperator (
age1
,female1
etc…):names_pattern = "(.*\\D)([0-9]+)$"
. In this regular expression,.*\\D
matches a string of any length, of any characters, as long as it ends with something other than a digit. The[0-9]+$
matches any number of digits at the end of the string. The parentheses indicate how the string should be separated to form variable names and member_ID. - If the separator is used in other places in variable names (
member_age_1
etc…):names_pattern = "(.*)_([0-9]+)$"
.
- If there is no seperator (
values_drop_na = TRUE
: tells R to drop rows that have missing data for all variables. This prevents the issue where we hadd too many rows.names_transform
: by default, allname
columns will be character types, butmember_ID
only contains integers, so we transform it to integer. This is completely optional.
Joining (or merging) data
Tidyverse has four functions to join (or merge, as Stata calls it) two
data sets. The functions that differ in the way they treat observations that are in one data set but not the other.
Consider the diagram below.
It has two data sets, x
(in Stata terms, this is the master data set) and y
(the using
data set in Stata terms). They have overlapping rows (area B), but also
rows that are only in x
(area A) or only in y
(area C).
The four join functions work as follows:
inner_join(x,y)
will only keep area B.left_join(x,y)
will keep areas A and B.right_join(x,y)
will keep areas B and C.full_join(x,y)
will keep areas A, B, and C.
In our case, the data sets match perfectly, i.e. we only have an area B, so there is no practical difference. I chose left_join()
so the number of
observations in my household survey is guaranteed to remain the same.
To merge the roster to the household data, we use the join_by function:
## Rows: 131 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): village, respondent_wall_type, memb_assoc, affect_conflicts, items...
## dbl (6): key_ID, no_membrs, years_liv, rooms, liv_count, no_meals
## dttm (1): interview_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Joining with `by = join_by(key_ID)`
## # A tibble: 131 × 52
## key_ID village interview_date no_membrs years_liv respondent_wall_type
## <dbl> <chr> <dttm> <dbl> <dbl> <chr>
## 1 1 God 2016-11-17 00:00:00 3 4 muddaub
## 2 2 God 2016-11-17 00:00:00 7 9 muddaub
## 3 3 God 2016-11-17 00:00:00 10 15 burntbricks
## 4 4 God 2016-11-17 00:00:00 7 6 burntbricks
## 5 5 God 2016-11-17 00:00:00 7 40 burntbricks
## 6 6 God 2016-11-17 00:00:00 3 3 muddaub
## 7 7 God 2016-11-17 00:00:00 6 38 muddaub
## 8 8 Chirodzo 2016-11-16 00:00:00 12 70 burntbricks
## 9 9 Chirodzo 2016-11-16 00:00:00 8 6 burntbricks
## 10 10 Chirodzo 2016-12-16 00:00:00 12 23 burntbricks
## # ℹ 121 more rows
## # ℹ 46 more variables: rooms <dbl>, memb_assoc <chr>, affect_conflicts <chr>,
## # liv_count <dbl>, items_owned <chr>, no_meals <dbl>, months_lack_food <chr>,
## # instanceID <chr>, female_1 <int>, age_1 <dbl>, female_2 <int>, age_2 <dbl>,
## # female_3 <int>, age_3 <dbl>, female_4 <int>, age_4 <dbl>, female_5 <int>,
## # age_5 <dbl>, female_6 <int>, age_6 <dbl>, female_7 <int>, age_7 <dbl>,
## # female_8 <int>, age_8 <dbl>, female_9 <int>, age_9 <dbl>, …
Note that we didn’t specify identifiers, like we would in Stata. R
assumed that the variables that appear in both data frames are the
identifiers, in this case key_ID
. Use the by
option to change this.
Going the other way around, joining the household data to the roster data, is equally easy:
long_data %>%
left_join(
read_csv(here("data/SAFI_clean.csv"), na = "NULL") %>%
select(key_ID,village,interview_date))
## Rows: 131 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): village, respondent_wall_type, memb_assoc, affect_conflicts, items...
## dbl (6): key_ID, no_membrs, years_liv, rooms, liv_count, no_meals
## dttm (1): interview_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Joining with `by = join_by(key_ID)`
## # A tibble: 942 × 6
## key_ID member_ID female age village interview_date
## <dbl> <int> <int> <dbl> <chr> <dttm>
## 1 1 1 1 49 God 2016-11-17 00:00:00
## 2 1 2 0 31 God 2016-11-17 00:00:00
## 3 1 3 0 28 God 2016-11-17 00:00:00
## 4 2 1 1 77 God 2016-11-17 00:00:00
## 5 2 2 1 80 God 2016-11-17 00:00:00
## 6 2 3 0 45 God 2016-11-17 00:00:00
## 7 2 4 0 54 God 2016-11-17 00:00:00
## 8 2 5 1 68 God 2016-11-17 00:00:00
## 9 2 6 0 68 God 2016-11-17 00:00:00
## 10 2 7 0 19 God 2016-11-17 00:00:00
## # ℹ 932 more rows
Note that here I only merged in two variables, by using select
and a pipe within the left_join()
function.
Summarizing over groups (or collapsing data)
To compute summary statistics (sums, counts, means etc.) over a group,
we use the group_by()
and summarize()
functions. For example, to
compute the household size, number of women and average age in each household.
But before doing anything, I make sure the -99
s in the age
variable are treated
as missing, using a simple mutate()
to conver them to NA
.
long_data %>%
group_by(key_ID) %>%
mutate(age = if_else(age == -99,NA,age)) %>%
summarize(hh_size = n(), num_women = sum(female), mean_age = mean(age, na.rm = TRUE))
## # A tibble: 131 × 4
## key_ID hh_size num_women mean_age
## <dbl> <int> <int> <dbl>
## 1 1 3 1 36
## 2 2 7 3 58.7
## 3 3 10 4 54.5
## 4 4 7 3 48.1
## 5 5 7 4 37.4
## 6 6 3 1 39.7
## 7 7 6 3 51.5
## 8 8 12 8 45.8
## 9 9 8 5 51.1
## 10 10 12 4 37.7
## # ℹ 121 more rows
Row-wise Operations
Suppose we wanted to run an operation over multiple variables. For example to
get the household size, number of women and average age from our wide data.
The easiest, and probably best, way to do this in R is by reshaping to long,
and then use summarize, like we did above. But in Stata you would probably use some sort of
egen
function, so that may come natural.
You can do similar things in R. It’s just a bit more complex than in Stata.
across(): doing the same operations on multiple variables using across
First, we need to make sure we update -99
to NA
in all age_
variables.
I use the across()
function, which takes two arguments: a column specifcation
(for which I use tidy select
syntax), and a function: basically, the syntax is the same as in the mutate()
step above, but with a tilde (~
) in front of ifelse()
and .x
instead of the variable name.
## # A tibble: 131 × 39
## key_ID female_1 age_1 female_2 age_2 female_3 age_3 female_4 age_4 female_5
## <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>
## 1 1 1 49 0 31 0 28 NA NA NA
## 2 2 1 77 1 80 0 45 0 54 1
## 3 3 0 78 0 76 1 23 0 47 0
## 4 4 1 29 0 71 1 84 1 55 0
## 5 5 1 73 0 48 1 39 0 4 1
## 6 6 0 37 0 75 1 7 NA NA NA
## 7 7 1 85 0 20 1 61 1 42 0
## 8 8 0 31 0 16 1 23 1 45 1
## 9 9 1 33 1 52 0 NA 1 27 0
## 10 10 0 24 0 44 1 50 0 81 1
## # ℹ 121 more rows
## # ℹ 29 more variables: age_5 <dbl>, female_6 <int>, age_6 <dbl>,
## # female_7 <int>, age_7 <dbl>, female_8 <int>, age_8 <dbl>, female_9 <int>,
## # age_9 <dbl>, female_10 <int>, age_10 <dbl>, female_11 <int>, age_11 <dbl>,
## # female_12 <int>, age_12 <dbl>, female_13 <int>, age_13 <dbl>,
## # female_14 <int>, age_14 <dbl>, female_15 <int>, age_15 <dbl>,
## # female_16 <int>, age_16 <dbl>, female_17 <int>, age_17 <dbl>, …
Notes:
- You can use
across(.cols = where(is.numeric), .fn = ...)
to apply a function to all numeric variables. - You can also combine
across()
withsummarize()
to summarize multiple variables more easily. See the section on faceting for an example.
c_across(): passing values from multiple variables to summarizing functions
Then I have to pass all the female_
and age_
variables to sum()
and mean()
:
wide_data %>%
mutate(across(.cols = starts_with("age_"),
.fn = ~if_else(.x == -99,NA,.x))) %>%
rowwise() %>%
mutate(mean_age = mean(c_across(starts_with("age_")),
na.rm=TRUE),
num_women = sum(c_across(starts_with("female_")),
na.rm=TRUE),
hh_size = sum(!is.na(c_across(starts_with("female_"))))) %>%
select(key_ID,hh_size,num_women, mean_age) %>%
ungroup()
## # A tibble: 131 × 4
## key_ID hh_size num_women mean_age
## <dbl> <int> <int> <dbl>
## 1 1 3 1 36
## 2 2 7 3 58.7
## 3 3 10 4 54.5
## 4 4 7 3 48.1
## 5 5 7 4 37.4
## 6 6 3 1 39.7
## 7 7 6 3 51.5
## 8 8 12 8 45.8
## 9 9 8 5 51.1
## 10 10 12 4 37.7
## # ℹ 121 more rows
The key trick here is the combination of rowwise()
and c_across()
.
rowwise()
ensures all summaries are computed per row, and c_across()
allows you to use tidy select syntax within the mean()
and sum()
functions.
Splitting multi-response variable into dummies
The SAFI data contains a number of columns that contain all responses selected
in a multiple response questions. For example, the variables items_owned
can
contain something like "bicycle;television;solar_panel;table"
. We want to
split this into dummies: one for each possible answers. There’s a number of
ways to do this, but the most convenient is using sepate_longer()
read_csv(here("data/SAFI_clean.csv"), na = "NULL") %>%
separate_longer_delim(items_owned, delim = ";") %>%
mutate(value = 1) %>%
pivot_wider(names_from = items_owned,
values_from = value,
names_glue = "owns_{items_owned}",
values_fill = 0) %>%
left_join(read_csv(here("data/SAFI_clean.csv"), na = "NULL") %>% select(key_ID,items_owned)) %>%
select(items_owned, starts_with("owns_")) %>%
head()
## # A tibble: 6 × 19
## items_owned owns_bicycle owns_television owns_solar_panel owns_table
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 bicycle;television;s… 1 1 1 1
## 2 cow_cart;bicycle;rad… 1 0 1 1
## 3 solar_torch 0 0 0 0
## 4 bicycle;radio;cow_pl… 1 0 1 0
## 5 motorcyle;radio;cow_… 0 0 0 0
## 6 <NA> 0 0 0 0
## # ℹ 14 more variables: owns_cow_cart <dbl>, owns_radio <dbl>,
## # owns_cow_plough <dbl>, owns_solar_torch <dbl>, owns_mobile_phone <dbl>,
## # owns_motorcyle <dbl>, owns_NA <dbl>, owns_fridge <dbl>,
## # owns_electricity <dbl>, owns_sofa_set <dbl>, owns_lorry <dbl>,
## # owns_sterio <dbl>, owns_computer <dbl>, owns_car <dbl>
Note that the original items_owned
variable is lost during the separate_longer_delim()
step, so I used left_join()
to merge it back in for demonstration purposes.
Renaming many variables
Let’s say we need to rename a bunch of variables.
One way is to use a named list:
# this follows the pattern new_name = old_name
newnames <- c("no_livestock" = "liv_count",
"no_rooms" = "rooms",
"affect_conflicts" = "ffect_conflicts")
read_csv(here("data/SAFI_clean.csv"), na = "NULL") %>%
rename(any_of(newnames)) %>%
glimpse()
## Rows: 131
## Columns: 14
## $ key_ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ village <chr> "God", "God", "God", "God", "God", "God", "God", …
## $ interview_date <dttm> 2016-11-17, 2016-11-17, 2016-11-17, 2016-11-17, …
## $ no_membrs <dbl> 3, 7, 10, 7, 7, 3, 6, 12, 8, 12, 6, 7, 6, 10, 5, …
## $ years_liv <dbl> 4, 9, 15, 6, 40, 3, 38, 70, 6, 23, 20, 20, 8, 20,…
## $ respondent_wall_type <chr> "muddaub", "muddaub", "burntbricks", "burntbricks…
## $ no_rooms <dbl> 1, 1, 1, 1, 1, 1, 1, 3, 1, 5, 1, 3, 1, 3, 2, 1, 1…
## $ memb_assoc <chr> NA, "yes", NA, NA, NA, NA, "no", "yes", "no", "no…
## $ affect_conflicts <chr> NA, "once", NA, NA, NA, NA, "never", "never", "ne…
## $ no_livestock <dbl> 1, 3, 1, 2, 4, 1, 1, 2, 3, 2, 2, 2, 3, 3, 3, 4, 1…
## $ items_owned <chr> "bicycle;television;solar_panel;table", "cow_cart…
## $ no_meals <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2…
## $ months_lack_food <chr> "Jan", "Jan;Sept;Oct;Nov;Dec", "Jan;Feb;Mar;Oct;N…
## $ instanceID <chr> "uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef", "uui…
Note that "affect_conflicts" = "ffect_conflicts"
didn’t do anything,
as there is no ffect_conflicts
variable.
any_of()
just ignored any variables not present in the data.
This can be useful in data pipelines using multiple files,
where some files have mistyped variable names.
What if we want to do the same thing to many variables? For example, turning them all to uppercase?
We could use newnames = c(KEY_ID = key_ID, VILLAGE = village)
etc. etc.,
but that’d be extremely tedious.
Instead, there is rename_with()
, which takes two arugments:
.fn
: A function that takes the variable names of your dataset as an argument..cols
: a tidy select statement defining which columns to change. Defaults to all columns.
This is what it’d look like:
## Rows: 131
## Columns: 14
## $ KEY_ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ VILLAGE <chr> "God", "God", "God", "God", "God", "God", "God", …
## $ INTERVIEW_DATE <dttm> 2016-11-17, 2016-11-17, 2016-11-17, 2016-11-17, …
## $ NO_MEMBRS <dbl> 3, 7, 10, 7, 7, 3, 6, 12, 8, 12, 6, 7, 6, 10, 5, …
## $ YEARS_LIV <dbl> 4, 9, 15, 6, 40, 3, 38, 70, 6, 23, 20, 20, 8, 20,…
## $ RESPONDENT_WALL_TYPE <chr> "muddaub", "muddaub", "burntbricks", "burntbricks…
## $ ROOMS <dbl> 1, 1, 1, 1, 1, 1, 1, 3, 1, 5, 1, 3, 1, 3, 2, 1, 1…
## $ MEMB_ASSOC <chr> NA, "yes", NA, NA, NA, NA, "no", "yes", "no", "no…
## $ AFFECT_CONFLICTS <chr> NA, "once", NA, NA, NA, NA, "never", "never", "ne…
## $ LIV_COUNT <dbl> 1, 3, 1, 2, 4, 1, 1, 2, 3, 2, 2, 2, 3, 3, 3, 4, 1…
## $ ITEMS_OWNED <chr> "bicycle;television;solar_panel;table", "cow_cart…
## $ NO_MEALS <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2…
## $ MONTHS_LACK_FOOD <chr> "Jan", "Jan;Sept;Oct;Nov;Dec", "Jan;Feb;Mar;Oct;N…
## $ INSTANCEID <chr> "uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef", "uui…
The variables names are used as the argument to toupper()
,
which returns them in upper case.
But what if we need to provide more arguments?
Let’s say we want to append _0
to all columns,
except the ID columns,
to indicate this is baseline data.
We use paste0()
which takes two arguments,
the variable name and string we wish to append:
read_csv(here("data/SAFI_clean.csv"), na = "NULL") %>%
rename_with(.fn = ~paste0(.x, "_0"), .cols = !key_ID) %>%
glimpse()
## Rows: 131
## Columns: 14
## $ key_ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ village_0 <chr> "God", "God", "God", "God", "God", "God", "God"…
## $ interview_date_0 <dttm> 2016-11-17, 2016-11-17, 2016-11-17, 2016-11-17…
## $ no_membrs_0 <dbl> 3, 7, 10, 7, 7, 3, 6, 12, 8, 12, 6, 7, 6, 10, 5…
## $ years_liv_0 <dbl> 4, 9, 15, 6, 40, 3, 38, 70, 6, 23, 20, 20, 8, 2…
## $ respondent_wall_type_0 <chr> "muddaub", "muddaub", "burntbricks", "burntbric…
## $ rooms_0 <dbl> 1, 1, 1, 1, 1, 1, 1, 3, 1, 5, 1, 3, 1, 3, 2, 1,…
## $ memb_assoc_0 <chr> NA, "yes", NA, NA, NA, NA, "no", "yes", "no", "…
## $ affect_conflicts_0 <chr> NA, "once", NA, NA, NA, NA, "never", "never", "…
## $ liv_count_0 <dbl> 1, 3, 1, 2, 4, 1, 1, 2, 3, 2, 2, 2, 3, 3, 3, 4,…
## $ items_owned_0 <chr> "bicycle;television;solar_panel;table", "cow_ca…
## $ no_meals_0 <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3,…
## $ months_lack_food_0 <chr> "Jan", "Jan;Sept;Oct;Nov;Dec", "Jan;Feb;Mar;Oct…
## $ instanceID_0 <chr> "uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef", "u…
In this case, we specify the function as a
a purrr-style inline anonymous function
(i.e. preceded by a ~
), and we supply the variable names as .x
.
(This all works the same as across()
above.)
Now, let’s replace all instances of membrs
or memb
in variable names with
members
(so no_membrs
becomes no_members
,
and memb_assoc
becomes members_assoc
.
For this I will use gsub()
,
which takes three arguments:
- A pattern (expressed as a regular expression) to look for;
- a replacement for all matches of the pattern; and,
- the data to look in (in this case the varaible names).
read_csv(here("data/SAFI_clean.csv"), na = "NULL") %>%
rename_with(.fn = ~gsub("membrs|memb", "members",.x)) %>%
glimpse()
## Rows: 131
## Columns: 14
## $ key_ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ village <chr> "God", "God", "God", "God", "God", "God", "God", …
## $ interview_date <dttm> 2016-11-17, 2016-11-17, 2016-11-17, 2016-11-17, …
## $ no_members <dbl> 3, 7, 10, 7, 7, 3, 6, 12, 8, 12, 6, 7, 6, 10, 5, …
## $ years_liv <dbl> 4, 9, 15, 6, 40, 3, 38, 70, 6, 23, 20, 20, 8, 20,…
## $ respondent_wall_type <chr> "muddaub", "muddaub", "burntbricks", "burntbricks…
## $ rooms <dbl> 1, 1, 1, 1, 1, 1, 1, 3, 1, 5, 1, 3, 1, 3, 2, 1, 1…
## $ members_assoc <chr> NA, "yes", NA, NA, NA, NA, "no", "yes", "no", "no…
## $ affect_conflicts <chr> NA, "once", NA, NA, NA, NA, "never", "never", "ne…
## $ liv_count <dbl> 1, 3, 1, 2, 4, 1, 1, 2, 3, 2, 2, 2, 3, 3, 3, 4, 1…
## $ items_owned <chr> "bicycle;television;solar_panel;table", "cow_cart…
## $ no_meals <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2…
## $ months_lack_food <chr> "Jan", "Jan;Sept;Oct;Nov;Dec", "Jan;Feb;Mar;Oct;N…
## $ instanceID <chr> "uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef", "uui…