Dplyr
Dplyr
{dplyr}
, a combination of data and plier, is a package that allows manipulation of data in an easy and efficient way. Whether you want to drop columns, drop rows, create new variables, or change old variables, {dplyr}
allows you to do this in an intuitive way which requires little code (especially compared to base R). If you have not already installed and loaded {dplyr}
, you can do so with:
In this section, we will work with the starwars
dataset, that is automatically loaded in when you load {dplyr}
.
# A tibble: 5 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Let’s load the starwars
data into our Global Environment so that we can easily access it:
All {dplyr}
functions that take a data frame or tibble start with the .data
argument. Therefore, in all examples below, we first specify our .data
as sw
before supplying arguments of the function.
Renaming and relocating
Before we can do anything to the data itself, we should make sure we like the names of the variables. Three of the variables contain the word ‘color’ (hair_color
, skin_color
, eye_color
). However, we might think it is intuitive enough that these variables indicate colour, so maybe we want to change those names. For this we can use the function rename()
. In rename()
, we specify any number of arguments we want, where the left hand side of the argument indicates the new column name and the righth and side of the argument indicates the column that needs to be changed (and therefore the old column name). The right hand side can also be the position of the column (e.g., 2 for the second column).
Many functions, among which functions in {dplyr}
like rename()
have an argument called ...
. This ellipsis simply means that you can supply any number of arguments of the type that the function uses. For rename()
, this means we can supply an endless list of variables to be renamed.
Now our data has some changed names, but we are not done yet. It might also be preferable that sex and gender are mentioned immediately after the name. To do this, we can simply use relocate()
. In relocate()
, we specify a group of columns and specify before (argument .before
) or after (argument .after
) which column they should be placed.
Filtering and selecting
Now that our data is a bit more how we (or I?) wanted it, we can also determine whether there are some columns or rows that we do not actually want or need. First of, because I am not interested in any data from outside the starwars universe, we can remove the column films
. To do this, we use select()
, which allows us to either specify the columns we want to keep, or specify the columns we want to drop. To keep columns, we simply name them and to drop columns we name them with a dash/minus sign before (-).
Additionally, I am not interested in any character (i.e., row) of who we do not know the mass or who is from Tatooine. Tatooine is full of sand and I don’t like sand. It’s coarse and rough. To remove rows, we can use the filter()
function. In filter()
, we supply conditions to which rows must adhere to stay.
Here, we use is.na()
to see what rows are missing in the column mass
and then take the reverse (!). We do the same for height
. Additionally (&), the homeworld should not equal (!=) Tatooine.
Mutating and conditionals
Now that we have the data that we want, we might want to create some new variables. For instance, it would be interesting to know the Body Mass Index (BMI) of the characters, which we can calculate because we have height and mass. We can do that with mutate()
. In this function, we can specify any number of arguments, with the left hand side of the argument being the name of the new variable and the right hand side being the value it should take.
A great characteristic of mutate()
is that we can immediately use a variable we just calculted within the same mutate()
function call. For instance, if we calculate BMI, we can immediately create a variable that indicates whether according to commonly used BMI dichotomisation for humans, a character classifies as overweight or not.
# Calculate BMI and overweight indicator
sw <- mutate(sw,
# Change height to meters
height = height / 100,
# Calculate BMI
bmi = mass / height ^ 2,
# Create indicator for overweight
overweight = ifelse(bmi >= 25, 1, 0))
# Show summary of BMI
summary(sw[["bmi"]])
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.89 19.48 23.83 32.40 25.86 443.43
0 1
32 16
What we used to create the overweight variable was an if-else clause, which is a commonly used clause in programming. Simply, it checks whether a condition is true. If a condition is true, then it does A, otherwise it does B. In this case, ifelse()
checks the condition bmi >= 25
. If that condition is TRUE
, then the variable overweight
is assigned a 1, otherwise a 0.
If we want to be able to assign more than two values based on a single conditions, we can use two approaches. First, we can nest multiple ifelse()
statements. Below, we code a variable weight_status
which indicates underweight (BMI < 18, 1), normal weight (BMI 18-25, 2), overweight (BMI >25, 3), and obesity (BMI >30, 4).
This can quickly become unreadable if more groups should be made. Luckily, {dplyr}
offers us case_when()
and case_match()
. Using case_when()
, we only need to call one function and then supply all conditions. In case_when()
this is written as condition ~ result:
# Create new variable weight status
sw <- mutate(sw,
# Create indicator for weight status
weight_status = case_when(bmi < 18 ~ 1,
bmi >= 18 & bmi <= 25 ~ 2,
bmi > 25 & bmi <= 30 ~ 3,
.default = 4))
We also specify .default = 4
, which means that any row without a matched condition should receive the value 4. However, be careful if you have missing data, as .default
will also fill in those values. If .default
is not supplied, any row not matching a condition will get an NA
.
case_when()
is a great function and especially useful for multiple conditions that can be matched based on multiple columns or when we want to use conditions (>, <=, etc). Nonetheless, if we want to use only a single variable and do not need conditions, we can also use case_match()
, where first the used variable is specified and then the matches, without having to respecify the variable in every match:
Working across columns
There might be a case where we want to change all numeric variables to character. In such case, we could write an extensive mutate argument as follows:
# Change numerics to character
sw <- mutate(sw,
# Change height
height = as.character(height),
# Change mass
mass = as.character(mass),
# Change birth year
birth_year = as.character(birth_year),
# Change BMI
bmi = as.character(bmi))
Even without the way of annotating that I try to circulate, this code would be unefficient to write. {dplyr}
offers us across()
to make our lives easier in such cases. With across()
, we can apply multiple functions immediately to mulitple columns. If we want to change all those columns to character and check how many characters each has, we could simply do:
# Change numerics to character and show last 7 columns
mutate(sw, across(c(height, mass, birth_year, bmi), list(char = as.character, charn = nchar)))[18:24]
# A tibble: 48 × 7
height_charn mass_char mass_charn birth_year_char birth_year_charn bmi_char
<int> <chr> <int> <chr> <int> <chr>
1 4 32 2 33 2 34.722222…
2 3 49 2 19 2 21.777777…
3 4 77 2 57 2 23.245984…
4 4 112 3 200 3 21.545090…
5 3 80 2 29 2 24.691358…
6 4 74 2 44 2 24.725182…
7 4 1358 4 600 3 443.42857…
8 3 77 2 21 2 26.643598…
9 3 110 3 <NA> NA 33.950617…
10 3 75 2 82 2 25.951557…
# ℹ 38 more rows
# ℹ 1 more variable: bmi_charn <int>
The pipe: %>%
Before we continue with other useful {dplyr}
functions, we should talk about the pipe operator %>%
. Although %>%
is not originally from {dplyr}
but instead imported from {magrittr}
that we will discuss after this section.
Normally, if we wanted to apply multiple functions at the same time (for instance, filter()
and select()
after one another), we would nest the functions:
# Keep only the birth year column and then keep only birth years above 100
filter(select(starwars, birth_year), birth_year > 100)
# A tibble: 5 × 1
birth_year
<dbl>
1 112
2 200
3 600
4 896
5 102
However, this quickly becomes unreadable. With %>%
, we can transform this into more readable code. What %>%
does is that it takes the outputted result from the previous function and feeds it into an argument in the next function called .data
or data
(if this argument is available). We will go into more detail in {magrittr}
.
So, with %>%
, we can rewrite the above code as:
# Take starwars data
starwars %>%
# Keep only birth year column
select(birth_year) %>%
# Keep only birth years above 100
filter(birth_year > 100)
# A tibble: 5 × 1
birth_year
<dbl>
1 112
2 200
3 600
4 896
5 102
Some more about annotation: with pipe operators, we can make our code much clearer and with that also our annotations. My personal preference is to annotate each new function after a pipe, so that it becomes even more readable. These annotations might just describe what we are doing, or give more details about why we are doing it. This makes it understandable to people who might not understand R equally well or know the functions we are using and allows better understanding of choices we made.
Grouping and slicing
Now that we know about pipe operators, we can apply them to some nice {dplyr}
functions. Sometimes, we might want to calculate some statistics per group in our data. To do this, we can group the data on the variable(s) that define the groups and then run our analyses. Before we group, we must make sure the data are sorted so that all variables are already ordered in groups. We can do this with arrange()
, which arranges variables in the order that you supply them to the function. If you want to arrange a variable in descending order, you can wrap it in desc()
. Subsequently, we make the groups with group_by()
, which makes groups in the order that the variables are supplied. When we are done working in the groups, we can ungroup the data again with ungroup()
. For example, if we want to calculate the mean BMI per sex, we can do the following:
However, I am actually interested in the mean BMI per groups of sex and gender and only want to keep one row per group. For this, we can use slice()
, which allows us subset rows based on their position.
# Calculate mean BMI in strata of sex
sw %>%
# Arrange for grouping
arrange(sex, gender) %>%
# Group on sex and then gender
group_by(sex, gender) %>%
# Calculate mean BMI
mutate(mean_bmi = mean(bmi)) %>%
# Keep one row per group
slice(1L) %>% # Note that 1L means that 1 is an integer
# Ungroup again
ungroup() %>%
# Keep only sex, gender, and mean BMI
select(sex, gender, mean_bmi)
# A tibble: 5 × 3
sex gender mean_bmi
<chr> <chr> <dbl>
1 female feminine 17.8
2 hermaphroditic masculine 443.
3 male masculine 24.6
4 none masculine 34.7
5 <NA> <NA> 24.6
slice()
is a general function but has some more specifications such as slice_head()
and slice_tail()
which you can read about in the slice()
help function (?slice
).
Grouping functions
When we group our data, we can also extract some metadata about the groups we created with the following functions:
group_data()
: metadata that defines the grouping structuregroup_keys()
: metadata that describes the groupsgroup_rows()
: location of the rows in the original data per groupgroup_indices()
: a vector with the group to which each row belongsgroup_vars()
: names of the grouping variables as a charactergroups()
: names of the grouping variables as a listgroup_size()
: size of each groupn_groups()
: the number of groups
Summarising groups
Of course it is great to be able to calculate variables within groups, but there is an easier way to quickly get information on a group level with summarise()
. This function creates a new data frame with any summary statistic that we specify per group. For instance, to calculate the mean and standard deviation of the birth year and the proportion of individuals without hair per planet, we can do:
# Use starwars
sw %>%
# Arrange for grouping
arrange(sex) %>%
# Group on sex
group_by(sex) %>%
# Calculate summary statistics
summarise(# Mean
mean = mean(birth_year, na.rm = TRUE),
# Standard deviation
sd = sd(birth_year, na.rm = TRUE),
# Proportion of no hair
prop_no_hair = sum(hair == "none", na.rm = TRUE) / n())
# A tibble: 5 × 4
sex mean sd prop_no_hair
<chr> <dbl> <dbl> <dbl>
1 female 42.2 14.5 0.375
2 hermaphroditic 600 NA 0
3 male 57.6 42.5 0.6
4 none 33 NA 0
5 <NA> NaN NA 0.333
The code for calculating the proportion of no hair was kindly supplied by Vera Broek.
Note that the NA
s in the output result from the fact that we cannot take the standard deviation of a single observation. One of the functions we used was n()
. This is one of the context functions supplied by {dplyr}
, which allows us to quickly compute a group-specific variable. The available context functions are:
n()
: the number of rows/observations in the groupcur_group()
: the group keys (grouping variables)cur_group_id()
: gives the group a unique identifiercur_group_rows()
: gives the row indices in the ungrouped data for the current groupcur_column()
: gives the name of the current column (only works inacross()
)
Count and tally
tally()
and count()
are two wrappers around summarise:
tally()
is equal to summarise but automatically callsn()
(if it’s the first tally) orsum()
(if it’s the second tally), and therefore an easy shorthand forsummarise(n = n())
.count()
is equal togroup_by()
followed bysummarise(n = n())
and closed withungroup()
.
Joins
There might be times where we work with data that is stored in different data sets. If we want to combine these data sets, e.g., x
and y
, into one bigger data set, we will often want to do this on the basis of one or more variables, such as a unique study participant identifier and/or a visit number. In this case, we can use joins from {dplyr}
, which offers four joins. These joins all add the data together, but differ in what they do with individuals that are not matched in either dataset.
Function | Will keep all observations in | Can drop observations in |
---|---|---|
inner_join() |
neither | x and y |
left_join() |
x |
y |
right_join() |
y |
x |
full_join() |
x and y |
neither |
To use a join function, we use a pipe operator after the data we want to be joined on (x
) and then specify on what column we want to join the data.
# Get some columns from starwars
starwars_x <- starwars[, 1:3]
# Get some more columns from starwars
starwars_y <- starwars[, c(1, 4:6)]
# Join data together based on the name
starwars_z <- starwars_x %>%
# Join some more columns from starwars
left_join(starwars_y, "name")
If we would want to join on more columns than only name
, we can put the joining columns in a vector with c()
.
Understanding pipelines
With the {dplyr}
functions we have seen so far, we can manipulate our data quite easily and quickly. However, it might be confusing what some functions do, especially when a longer pipeline (i.e., multiple pipe operators) is used, such as the below example. If we want to better understand, we could run the code line for line and view the data frame to get a better understanding. However, external tools are also available to help us.
# Extensive pipeline
starwars %>%
# Arrange for grouping
arrange(sex) %>%
# Group on sex
group_by(sex) %>%
# Create new variable BMI based on mass and height
mutate(bmi = mass / (height / 100) ^ 2) %>%
# Drop individuals with missing BMI
filter(!is.na(bmi))
Tidy data tutor
With tidy data tutor, we can run your pipeline on data that can be loaded into R from packages, such as starwars
from {dplyr}
, and visualize what happens at each pipe operator. We can simply paste the pipeline and load in available data and run it to get a visualization. An example for the above code can be found here.
Microsoft datamations
Another possibliity is {datamations}
, which gives us animations of how the pipeline operates. To use {datamations}
, you can first install it with:
Next, we can write your pipeline as you normally would. Subsequently, we put the whole pipeline between apostrophes to turn it into a string and use datamation_sanddance()
to generate the animation:
"# Pipeline
starwars %>%
# Remove hermaphroditic sex as this gives a large outlier
filter(sex != 'hermaphroditic') %>%
# Calculate BMI
mutate(bmi =mass / (height / 100) ^ 2) %>%
# Group on sex
group_by(sex) %>%
# Get mean BMI per group
summarise(mean_bmi = mean(bmi))" %>%
# Get animation
datamation_sanddance()
Note that {datamations}
cannot use every {dplyr}
function however. More information on {datamations}
can be found here.
Other useful functions
We now discussed some of the most useful functions in {dplyr}
, but there are other useful functions available. This list is not meant to make you remember everything, but more so to introduce you to them once, so that you might recognize when you are trying to perform an operation for which {dplyr}
has an efficient implementation. Moreover, this list is not exhaustive, but just a selection of useful functions; for everything available in {dplyr}
, see here.
Selecting parts of our data
pull()
: we can extract a single column withpull()
, which is similar to subsetting with the$
operator.pick()
: if we are inside a function such asmutate()
orsummarise()
, we can usepick()
to subset multiple columns.slice_sample()
: select random rows. This is especially useful if you have a rather large data set and first want to test your code on a random subset of your data before running it on the complete data, which might take longer.distinct()
: keep only one row per unique combination of columns supplied to the function. This function drops all columns not specified, unless.keep_all = TRUE
.
Getting information on and in our data
lag()
: take a column’s value from the previous row (will beNA
for the first row).lead()
: take a column’s value from the next row (will beNA
for the last row).first()
,last()
, andnth()
: take the first, last, or nth (can be specified in the function) value in a column.n_distinct()
: count the total number of unique values.glimpse()
: get a quick glimpse of your data, similar to pressing on the blue button in front of the data in the global environment.
Manipulating our data
rows()
: perform row-based modifications on a data frame using information in another data frame. See?rows
for a detailed explanation.bind_rows()
: add data frames together based on rows. This is a more efficient implementation ofrbind()
and does not require the different data frames to all have the same columns (the final data frame will have all columns of the individual data frames that are bound).bind_cols()
: similar tobind_rows()
but for columns, improving oncbind()
.rowwise()
: perform a function per row in the data. This is useful when a vectorized function (i.e., a function that operates over the entire vector) is not available. For more detailed explanation, see?rowwise
.na_if()
: replace a certain valuey
in vectorx
withNA
.if_else()
: an improved version ofifelse()
that can also take into account missings.
Extensions
{dtplyr}
{data.table}
is a package that works with data frames at a much higher speed than other R packages. Although {data.table}
is discussed in more detail in another section, here we shortly discuss the {dplyr}
integration with {data.table}
.
We can use lazy_dt()
to create a data table that tracks any operations and then transform it using as.data.table()
, as.data.frame()
, or as_tibble()
.
Although {dtplyr}
is not as fast as {data.table}
, it does allow using the readable {dplyr}
while approaching the speed of {data.table}
.
{dbplyr}
Sometimes, our data is not stored locally but in a remote database, such as SQL. In that case, you can use {dbplyr}
to manipulate the data as if it were in-memory using {dplyr}
functionality.
Using the dbConnect()
from {DBI}
, we can connect a data base from SQL to an object in R and then manipulate it. Behind the scenes, your R code is translated into SQL code. To get started, you can look up the vignette (vignette(sql)
) or see here and here.
{sparklyr}
Apache Spark is a powerful framework for processing data quickly and efficiently. It is popular due to it’s speed with big data. Although not officially a {dplyr}
extension, {dplyr}
and other R code can be used to access Apache Spark through {sparklyr}
. You can find more information here.
Wrapping up
We now saw the majority of functions that {dplyr}
offers us to manipulate our data into taking the shape we want it to take. If you ever want to have a quick look at what functions are available again, without the explanations in this tutorial, you can check out the {dplyr}
reference guide or the documentation.
Exercises
1. Subset data
From the data mtcars
, keep only cars with a weight (wt
) of 3 or higher and store this into an object dat
2. Create weight category
Create a column wt_cat
categorizing wt
into smaller than 4 (0) and 4 or greater (1) in dat
.
3. Drop columns
Remove the hp
and vs
columns from dat
.
4. Get group statistics
Determine the number of cars for each level of cyl
Answer
# A tibble: 3 × 2
cyl n
<dbl> <int>
1 4 2
2 6 4
3 8 14
Answer
# A tibble: 3 × 2
cyl n
<dbl> <int>
1 4 2
2 6 4
3 8 14
Answer
cyl n
1 4 2
2 6 4
3 8 14
Answer
# Solution 4: using gropu_by and mutate
dat %>%
# Arrange for grouping
arrange(cyl) %>%
# Group per level of cylinders
group_by(cyl) %>%
# Per group, calculate number of cylinders
mutate(n = n()) %>%
# Per group, keep one row
slice(1L) %>%
# Ungroup again
ungroup() %>%
# Keep only relevant columns
select(cyl, n)
# A tibble: 3 × 2
cyl n
<dbl> <int>
1 4 2
2 6 4
3 8 14
5. Put it all in a pipeline
Now perform exercises 1 through 4 but use one continuous pipeline
Answer
# You can combine any of the solutions of exercise 1 through 4, here I show just one:
dat <- mtcars %>%
# Keep only cars with weight >= 3
filter(wt >= 3) %>%
# Create category for weight
mutate(wt_cat = ifelse(wt < 4, 0, 1)) %>%
# Remove columns hp and vs
select(-c(hp, vs)) %>%
# Arrange for grouping
arrange(cyl) %>%
# Count per group of cylinders
count(cyl)
Next topic
Now that we know the grammar of {dplyr}
, we can see how {tidyr}
assists us if the data we want to manipulate is not quite tidy enough yet.
Next: Tidyr