Dplyr

Dplyr

{dplyr}, a combination of data and plier, is a package that allows manipulation of data in an easy and efficient way. Whether you want to drop columns, drop rows, create new variables, or change old variables, {dplyr} allows you to do this in an intuitive way which requires little code (especially compared to base R). If you have not already installed and loaded {dplyr}, you can do so with:

# Load dplyr
pacman::p_load("dplyr")

In this section, we will work with the starwars dataset, that is automatically loaded in when you load {dplyr}.

# Show first 5 rows of starwars
head(starwars, n = 5)
# A tibble: 5 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Let’s load the starwars data into our Global Environment so that we can easily access it:

# Load starwars into Global Environment object called sw
sw <- starwars

All {dplyr} functions that take a data frame or tibble start with the .data argument. Therefore, in all examples below, we first specify our .data as sw before supplying arguments of the function.

Renaming and relocating

Before we can do anything to the data itself, we should make sure we like the names of the variables. Three of the variables contain the word ‘color’ (hair_color, skin_color, eye_color). However, we might think it is intuitive enough that these variables indicate colour, so maybe we want to change those names. For this we can use the function rename(). In rename(), we specify any number of arguments we want, where the left hand side of the argument indicates the new column name and the righth and side of the argument indicates the column that needs to be changed (and therefore the old column name). The right hand side can also be the position of the column (e.g., 2 for the second column).

# Rename variables
sw <- rename(sw, hair = hair_color, skin = skin_color, eye = eye_color)

Many functions, among which functions in {dplyr} like rename() have an argument called .... This ellipsis simply means that you can supply any number of arguments of the type that the function uses. For rename(), this means we can supply an endless list of variables to be renamed.

Now our data has some changed names, but we are not done yet. It might also be preferable that sex and gender are mentioned immediately after the name. To do this, we can simply use relocate(). In relocate(), we specify a group of columns and specify before (argument .before) or after (argument .after) which column they should be placed.

# Relocate variables
sw <- relocate(sw, sex:gender, .after = name)

# This has the same effect
sw <- relocate(sw, c(sex, gender), .before = height)

Filtering and selecting

Now that our data is a bit more how we (or I?) wanted it, we can also determine whether there are some columns or rows that we do not actually want or need. First of, because I am not interested in any data from outside the starwars universe, we can remove the column films. To do this, we use select(), which allows us to either specify the columns we want to keep, or specify the columns we want to drop. To keep columns, we simply name them and to drop columns we name them with a dash/minus sign before (-).

# Drop films column
sw <- select(sw, -films)

# This does the same thing
sw <- select(sw, name:species, vehicles:starships)

Additionally, I am not interested in any character (i.e., row) of who we do not know the mass or who is from Tatooine. Tatooine is full of sand and I don’t like sand. It’s coarse and rough. To remove rows, we can use the filter() function. In filter(), we supply conditions to which rows must adhere to stay.

# Remove rows with missing mass or with characters from Tatooine
sw <- filter(sw, !is.na(mass) & !is.na(height) & homeworld != "Tatooine")

Here, we use is.na() to see what rows are missing in the column mass and then take the reverse (!). We do the same for height. Additionally (&), the homeworld should not equal (!=) Tatooine.

Mutating and conditionals

Now that we have the data that we want, we might want to create some new variables. For instance, it would be interesting to know the Body Mass Index (BMI) of the characters, which we can calculate because we have height and mass. We can do that with mutate(). In this function, we can specify any number of arguments, with the left hand side of the argument being the name of the new variable and the right hand side being the value it should take.

A great characteristic of mutate() is that we can immediately use a variable we just calculted within the same mutate() function call. For instance, if we calculate BMI, we can immediately create a variable that indicates whether according to commonly used BMI dichotomisation for humans, a character classifies as overweight or not.

# Calculate BMI and overweight indicator
sw <- mutate(sw, 
             # Change height to meters
             height = height / 100,
             # Calculate BMI
             bmi = mass / height ^ 2,
             # Create indicator for overweight
             overweight = ifelse(bmi >= 25, 1, 0))

# Show summary of BMI
summary(sw[["bmi"]])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.89   19.48   23.83   32.40   25.86  443.43 
# Show table of overweight
table(sw[["overweight"]])

 0  1 
32 16 

What we used to create the overweight variable was an if-else clause, which is a commonly used clause in programming. Simply, it checks whether a condition is true. If a condition is true, then it does A, otherwise it does B. In this case, ifelse() checks the condition bmi >= 25. If that condition is TRUE, then the variable overweight is assigned a 1, otherwise a 0.

If we want to be able to assign more than two values based on a single conditions, we can use two approaches. First, we can nest multiple ifelse() statements. Below, we code a variable weight_status which indicates underweight (BMI < 18, 1), normal weight (BMI 18-25, 2), overweight (BMI >25, 3), and obesity (BMI >30, 4).

# Create new variable weight status
sw <- mutate(sw, 
             # Create indicator for weight status
             weight_status = ifelse(bmi < 18, 1,
                                    ifelse(bmi >= 18 & bmi <= 25, 2,
                                           ifelse(bmi > 25 & bmi <= 30, 3, 4))))

This can quickly become unreadable if more groups should be made. Luckily, {dplyr} offers us case_when() and case_match(). Using case_when(), we only need to call one function and then supply all conditions. In case_when() this is written as condition ~ result:

# Create new variable weight status
sw <- mutate(sw, 
             # Create indicator for weight status
             weight_status = case_when(bmi < 18 ~ 1,
                                       bmi >= 18 & bmi <= 25 ~ 2,
                                       bmi > 25 & bmi <= 30 ~ 3,
                                       .default = 4))

We also specify .default = 4, which means that any row without a matched condition should receive the value 4. However, be careful if you have missing data, as .default will also fill in those values. If .default is not supplied, any row not matching a condition will get an NA.

case_when() is a great function and especially useful for multiple conditions that can be matched based on multiple columns or when we want to use conditions (>, <=, etc). Nonetheless, if we want to use only a single variable and do not need conditions, we can also use case_match(), where first the used variable is specified and then the matches, without having to respecify the variable in every match:

# Create new variable weight status
sw <- mutate(sw,
             # Create indicator for weight status
             weight_status = case_match(species,
                                        "Human" ~ "Humanoid",
                                        "Yoda's species" ~ "Small, green, and wrinkly with pointy ears",
                                        .default = species))

Working across columns

There might be a case where we want to change all numeric variables to character. In such case, we could write an extensive mutate argument as follows:

# Change numerics to character
sw <- mutate(sw,
             # Change height
             height = as.character(height),
             # Change mass
             mass = as.character(mass),
             # Change birth year
             birth_year = as.character(birth_year),
             # Change BMI
             bmi = as.character(bmi))

Even without the way of annotating that I try to circulate, this code would be unefficient to write. {dplyr} offers us across() to make our lives easier in such cases. With across(), we can apply multiple functions immediately to mulitple columns. If we want to change all those columns to character and check how many characters each has, we could simply do:

# Change numerics to character and show last 7 columns
mutate(sw, across(c(height, mass, birth_year, bmi), list(char = as.character, charn = nchar)))[18:24]
# A tibble: 48 × 7
   height_charn mass_char mass_charn birth_year_char birth_year_charn bmi_char  
          <int> <chr>          <int> <chr>                      <int> <chr>     
 1            4 32                 2 33                             2 34.722222…
 2            3 49                 2 19                             2 21.777777…
 3            4 77                 2 57                             2 23.245984…
 4            4 112                3 200                            3 21.545090…
 5            3 80                 2 29                             2 24.691358…
 6            4 74                 2 44                             2 24.725182…
 7            4 1358               4 600                            3 443.42857…
 8            3 77                 2 21                             2 26.643598…
 9            3 110                3 <NA>                          NA 33.950617…
10            3 75                 2 82                             2 25.951557…
# ℹ 38 more rows
# ℹ 1 more variable: bmi_charn <int>

The pipe: %>%

Before we continue with other useful {dplyr} functions, we should talk about the pipe operator %>%. Although %>% is not originally from {dplyr} but instead imported from {magrittr} that we will discuss after this section.

Normally, if we wanted to apply multiple functions at the same time (for instance, filter() and select() after one another), we would nest the functions:

# Keep only the birth year column and then keep only birth years above 100
filter(select(starwars, birth_year), birth_year > 100)
# A tibble: 5 × 1
  birth_year
       <dbl>
1        112
2        200
3        600
4        896
5        102

However, this quickly becomes unreadable. With %>%, we can transform this into more readable code. What %>% does is that it takes the outputted result from the previous function and feeds it into an argument in the next function called .data or data (if this argument is available). We will go into more detail in {magrittr}.

So, with %>%, we can rewrite the above code as:

# Take starwars data
starwars %>%
    # Keep only birth year column
    select(birth_year) %>%
    # Keep only birth years above 100
    filter(birth_year > 100)
# A tibble: 5 × 1
  birth_year
       <dbl>
1        112
2        200
3        600
4        896
5        102

Some more about annotation: with pipe operators, we can make our code much clearer and with that also our annotations. My personal preference is to annotate each new function after a pipe, so that it becomes even more readable. These annotations might just describe what we are doing, or give more details about why we are doing it. This makes it understandable to people who might not understand R equally well or know the functions we are using and allows better understanding of choices we made.

Grouping and slicing

Now that we know about pipe operators, we can apply them to some nice {dplyr} functions. Sometimes, we might want to calculate some statistics per group in our data. To do this, we can group the data on the variable(s) that define the groups and then run our analyses. Before we group, we must make sure the data are sorted so that all variables are already ordered in groups. We can do this with arrange(), which arranges variables in the order that you supply them to the function. If you want to arrange a variable in descending order, you can wrap it in desc(). Subsequently, we make the groups with group_by(), which makes groups in the order that the variables are supplied. When we are done working in the groups, we can ungroup the data again with ungroup(). For example, if we want to calculate the mean BMI per sex, we can do the following:

# Calculate mean BMI in strata of sex
sw <- sw %>%
    # Arrange for grouping
    arrange(sex) %>%
    # Group on sex
    group_by(sex) %>%
    # Calculate mean BMI
    mutate(mean_bmi = mean(bmi)) %>%
    # Ungroup again
    ungroup()

However, I am actually interested in the mean BMI per groups of sex and gender and only want to keep one row per group. For this, we can use slice(), which allows us subset rows based on their position.

# Calculate mean BMI in strata of sex
sw %>%
    # Arrange for grouping
    arrange(sex, gender) %>%
    # Group on sex and then gender
    group_by(sex, gender) %>%
    # Calculate mean BMI
    mutate(mean_bmi = mean(bmi)) %>%
    # Keep one row per group
    slice(1L) %>%           # Note that 1L means that 1 is an integer
    # Ungroup again
    ungroup() %>%
    # Keep only sex, gender, and mean BMI
    select(sex, gender, mean_bmi)
# A tibble: 5 × 3
  sex            gender    mean_bmi
  <chr>          <chr>        <dbl>
1 female         feminine      17.8
2 hermaphroditic masculine    443. 
3 male           masculine     24.6
4 none           masculine     34.7
5 <NA>           <NA>          24.6

slice() is a general function but has some more specifications such as slice_head() and slice_tail() which you can read about in the slice() help function (?slice).

Grouping functions

When we group our data, we can also extract some metadata about the groups we created with the following functions:

  • group_data(): metadata that defines the grouping structure

  • group_keys(): metadata that describes the groups

  • group_rows(): location of the rows in the original data per group

  • group_indices(): a vector with the group to which each row belongs

  • group_vars(): names of the grouping variables as a character

  • groups(): names of the grouping variables as a list

  • group_size(): size of each group

  • n_groups(): the number of groups

Summarising groups

Of course it is great to be able to calculate variables within groups, but there is an easier way to quickly get information on a group level with summarise(). This function creates a new data frame with any summary statistic that we specify per group. For instance, to calculate the mean and standard deviation of the birth year and the proportion of individuals without hair per planet, we can do:

# Use starwars
sw %>%
    # Arrange for grouping
    arrange(sex) %>%
    # Group on sex
    group_by(sex) %>%
    # Calculate summary statistics
    summarise(# Mean
              mean = mean(birth_year, na.rm = TRUE),
              # Standard deviation
              sd = sd(birth_year, na.rm = TRUE),
              # Proportion of no hair
              prop_no_hair = sum(hair == "none", na.rm = TRUE) / n())
# A tibble: 5 × 4
  sex             mean    sd prop_no_hair
  <chr>          <dbl> <dbl>        <dbl>
1 female          42.2  14.5        0.375
2 hermaphroditic 600    NA          0    
3 male            57.6  42.5        0.6  
4 none            33    NA          0    
5 <NA>           NaN    NA          0.333

The code for calculating the proportion of no hair was kindly supplied by Vera Broek.

Note that the NAs in the output result from the fact that we cannot take the standard deviation of a single observation. One of the functions we used was n(). This is one of the context functions supplied by {dplyr}, which allows us to quickly compute a group-specific variable. The available context functions are:

  • n(): the number of rows/observations in the group

  • cur_group(): the group keys (grouping variables)

  • cur_group_id(): gives the group a unique identifier

  • cur_group_rows(): gives the row indices in the ungrouped data for the current group

  • cur_column(): gives the name of the current column (only works in across())

Count and tally

tally() and count() are two wrappers around summarise:

  • tally() is equal to summarise but automatically calls n() (if it’s the first tally) or sum() (if it’s the second tally), and therefore an easy shorthand for summarise(n = n()).

  • count() is equal to group_by() followed by summarise(n = n()) and closed with ungroup().

Joins

There might be times where we work with data that is stored in different data sets. If we want to combine these data sets, e.g., x and y, into one bigger data set, we will often want to do this on the basis of one or more variables, such as a unique study participant identifier and/or a visit number. In this case, we can use joins from {dplyr}, which offers four joins. These joins all add the data together, but differ in what they do with individuals that are not matched in either dataset.

Function Will keep all observations in Can drop observations in
inner_join() neither x and y
left_join() x y
right_join() y x
full_join() x and y neither

To use a join function, we use a pipe operator after the data we want to be joined on (x) and then specify on what column we want to join the data.

# Get some columns from starwars
starwars_x <- starwars[, 1:3]

# Get some more columns from starwars
starwars_y <- starwars[, c(1, 4:6)]

# Join data together based on the name
starwars_z <- starwars_x %>%
    # Join some more columns from starwars
    left_join(starwars_y, "name")

If we would want to join on more columns than only name, we can put the joining columns in a vector with c().

Understanding pipelines

With the {dplyr} functions we have seen so far, we can manipulate our data quite easily and quickly. However, it might be confusing what some functions do, especially when a longer pipeline (i.e., multiple pipe operators) is used, such as the below example. If we want to better understand, we could run the code line for line and view the data frame to get a better understanding. However, external tools are also available to help us.

# Extensive pipeline
starwars %>%
    # Arrange for grouping
    arrange(sex) %>%
    # Group on sex
    group_by(sex) %>%
    # Create new variable BMI based on mass and height
    mutate(bmi = mass / (height / 100) ^ 2) %>%
    # Drop individuals with missing BMI
    filter(!is.na(bmi))

Tidy data tutor

With tidy data tutor, we can run your pipeline on data that can be loaded into R from packages, such as starwars from {dplyr}, and visualize what happens at each pipe operator. We can simply paste the pipeline and load in available data and run it to get a visualization. An example for the above code can be found here.

Microsoft datamations

Another possibliity is {datamations}, which gives us animations of how the pipeline operates. To use {datamations}, you can first install it with:

# Install datamations from github
devtools::install_github("microsoft/datamations")

Next, we can write your pipeline as you normally would. Subsequently, we put the whole pipeline between apostrophes to turn it into a string and use datamation_sanddance() to generate the animation:

"# Pipeline
starwars %>%
    # Remove hermaphroditic sex as this gives a large outlier
    filter(sex != 'hermaphroditic') %>%
    # Calculate BMI
    mutate(bmi =mass / (height / 100) ^ 2) %>%
     # Group on sex
    group_by(sex) %>%
    # Get mean BMI per group
    summarise(mean_bmi = mean(bmi))" %>% 
    # Get animation
    datamation_sanddance()

Note that {datamations} cannot use every {dplyr} function however. More information on {datamations} can be found here.

Other useful functions

We now discussed some of the most useful functions in {dplyr}, but there are other useful functions available. This list is not meant to make you remember everything, but more so to introduce you to them once, so that you might recognize when you are trying to perform an operation for which {dplyr} has an efficient implementation. Moreover, this list is not exhaustive, but just a selection of useful functions; for everything available in {dplyr}, see here.

Selecting parts of our data

  • pull(): we can extract a single column with pull(), which is similar to subsetting with the $ operator.

  • pick(): if we are inside a function such as mutate() or summarise(), we can use pick() to subset multiple columns.

  • slice_sample(): select random rows. This is especially useful if you have a rather large data set and first want to test your code on a random subset of your data before running it on the complete data, which might take longer.

  • distinct(): keep only one row per unique combination of columns supplied to the function. This function drops all columns not specified, unless .keep_all = TRUE.

Getting information on and in our data

  • lag(): take a column’s value from the previous row (will be NA for the first row).

  • lead(): take a column’s value from the next row (will be NA for the last row).

  • first(), last(), and nth(): take the first, last, or nth (can be specified in the function) value in a column.

  • n_distinct(): count the total number of unique values.

  • glimpse(): get a quick glimpse of your data, similar to pressing on the blue button in front of the data in the global environment.

Manipulating our data

  • rows(): perform row-based modifications on a data frame using information in another data frame. See ?rows for a detailed explanation.

  • bind_rows(): add data frames together based on rows. This is a more efficient implementation of rbind() and does not require the different data frames to all have the same columns (the final data frame will have all columns of the individual data frames that are bound).

  • bind_cols(): similar to bind_rows() but for columns, improving on cbind().

  • rowwise(): perform a function per row in the data. This is useful when a vectorized function (i.e., a function that operates over the entire vector) is not available. For more detailed explanation, see ?rowwise.

  • na_if(): replace a certain value y in vector x with NA.

  • if_else(): an improved version of ifelse() that can also take into account missings.

Extensions

{dtplyr}

{data.table} is a package that works with data frames at a much higher speed than other R packages. Although {data.table} is discussed in more detail in another section, here we shortly discuss the {dplyr} integration with {data.table}.

We can use lazy_dt() to create a data table that tracks any operations and then transform it using as.data.table(), as.data.frame(), or as_tibble().

Although {dtplyr} is not as fast as {data.table}, it does allow using the readable {dplyr} while approaching the speed of {data.table}.

{dbplyr}

Sometimes, our data is not stored locally but in a remote database, such as SQL. In that case, you can use {dbplyr} to manipulate the data as if it were in-memory using {dplyr} functionality.

Using the dbConnect() from {DBI}, we can connect a data base from SQL to an object in R and then manipulate it. Behind the scenes, your R code is translated into SQL code. To get started, you can look up the vignette (vignette(sql)) or see here and here.

{sparklyr}

Apache Spark is a powerful framework for processing data quickly and efficiently. It is popular due to it’s speed with big data. Although not officially a {dplyr} extension, {dplyr} and other R code can be used to access Apache Spark through {sparklyr}. You can find more information here.

Wrapping up

We now saw the majority of functions that {dplyr} offers us to manipulate our data into taking the shape we want it to take. If you ever want to have a quick look at what functions are available again, without the explanations in this tutorial, you can check out the {dplyr} reference guide or the documentation.

Exercises

1. Subset data

From the data mtcars, keep only cars with a weight (wt) of 3 or higher and store this into an object dat

Answer
dat <- filter(mtcars, wt >= 3)

2. Create weight category

Create a column wt_cat categorizing wt into smaller than 4 (0) and 4 or greater (1) in dat.

Answer
# Solution 1: using ifelse
dat <- mutate(dat, wt_cat = ifelse(wt < 4, 0, 1))

# Solution 2: using if_else
dat <- mutate(dat, wt_cat = if_else(wt < 4, 0, 1))

# Solution 3: using case_when
dat <- mutate(dat, wt_cat = case_when(wt < 4 ~ 0, .default = 1))

3. Drop columns

Remove the hp and vs columns from dat.

Answer
dat <- select(dat, -c(hp, vs))

4. Get group statistics

Determine the number of cars for each level of cyl

Answer
# Solution 1: using group_by and summarise
dat %>%
    # Arrange data for grouping
    arrange(cyl) %>% 
    # Group per level of cylinders
    group_by(cyl) %>% 
    # Get count per level of cylinders
    summarise(n = n())
# A tibble: 3 × 2
    cyl     n
  <dbl> <int>
1     4     2
2     6     4
3     8    14
Answer
# Solution 2: using group_by and tally
dat %>%
    # Arrange for grouping
    arrange(cyl) %>%
    # Group per level of cylinders
    group_by(cyl) %>%
    # Tally per group of cylinders
    tally()
# A tibble: 3 × 2
    cyl     n
  <dbl> <int>
1     4     2
2     6     4
3     8    14
Answer
# Solution 3: using count
dat %>%
    # Arrange for grouping
    arrange(cyl) %>%
    # Count per group of cylinders
    count(cyl)
  cyl  n
1   4  2
2   6  4
3   8 14
Answer
# Solution 4: using gropu_by and mutate
dat %>%
    # Arrange for grouping
    arrange(cyl) %>%
    # Group per level of cylinders
    group_by(cyl) %>%
    # Per group, calculate number of cylinders
    mutate(n = n()) %>%
    # Per group, keep one row
    slice(1L) %>%
    # Ungroup again
    ungroup() %>%
    # Keep only relevant columns
    select(cyl, n)
# A tibble: 3 × 2
    cyl     n
  <dbl> <int>
1     4     2
2     6     4
3     8    14

5. Put it all in a pipeline

Now perform exercises 1 through 4 but use one continuous pipeline

Answer
# You can combine any of the solutions of exercise 1 through 4, here I show just one:
dat <- mtcars %>%
    # Keep only cars with weight >= 3
    filter(wt >= 3) %>%
    # Create category for weight
    mutate(wt_cat = ifelse(wt < 4, 0, 1)) %>%
    # Remove columns hp and vs
    select(-c(hp, vs)) %>%
    # Arrange for grouping
    arrange(cyl) %>%
    # Count per group of cylinders
    count(cyl)

Next topic

Now that we know the grammar of {dplyr}, we can see how {tidyr} assists us if the data we want to manipulate is not quite tidy enough yet.

Next: Tidyr