{tidyr} is a package from the tidyverse that is meant to help you tidy up your data. What is considered ‘tidy’ data can be read in vignette("tidy-data"). In short, tidy data helps you use {dplyr} and other tidyverse tools in a way that lets you spend less time on structuring your data and more on performing your analyses.
Reshaping data
Wide to long
Some analyses require data in long format (i.e. one study participant can have multiple observations which are represented by multiple rows) while data might be retrieved in wide format (i.e. one study participant can have multiple observations which are represented by multiple columns). In such a case, {tidyr} offers us pivot_longer().
We can use pivot_longer() in a number of ways, but we must always supply the data in the argument data, and the columns to be pivoted in cols. In the simplest cases, this will suffice.
As an example, let’s look at billboard, a dataset available through {tidyr}. billboard contains songs ranking the top 100 in the year 2000 with their position in each week after entering the top 100 (wk).
# Load tidyrpacman::p_load("tidyr")# What does billboard look like?head(billboard)
We can see that the data now has much less columns (5 as opposed to 79 previously). Instead, the number of rows has been increased: each row now corresponds to a week for each song. The weeks were put into a column called ‘name’ by default, and the values that were in the original week columns went into a column called ‘value’ by default.
We can of course change the names of those columns:
# Pivot billboard to long formatbillboard_long <-pivot_longer(billboard, wk1:wk76, names_to ="week", values_to ="position")# Reprint billboardhead(billboard_long)
Currently, the week numbers are in the column called ‘week’, but are represented as a character string because the characters ‘wk’ are present in all columns. Although we will learn how to manipulate this in Regex, we can prevent this while calling the function:
# Pivot billboard to long formatbillboard_long <-pivot_longer(billboard, wk1:wk76, names_to ="week", values_to ="position", names_prefix ="wk")# Reprint billboardhead(billboard_long)
This kind of control gives us much less head-ache when, for instance, we want to use the data for plotting (for which we will discuss the code in Plotting):
# Load packagespacman::p_load("dplyr", # Data wrangling"ggplot2"# Data visualization)# Get only Britney Spears songsdat_plot <-filter(billboard_long, artist =="Spears, Britney"&!is.na(position))# Create plotggplot(dat_plot, aes(x = week, y = position, colour = track)) +# Geometriesgeom_line() +# Scalingscale_x_continuous(limits =c(0, 20)) +scale_y_continuous(limits =c(0, 100)) +scale_colour_manual(values =c("#d3b866", "#e09373", "#e586b4")) +# Aestheticstheme(panel.grid =element_blank(),panel.background =element_rect(colour ="#002b36", fill ="#002b36"),plot.background =element_rect(colour ="#002b36", fill ="#002b36"),axis.text =element_text(colour ="#dee2e6"),axis.title =element_text(colour ="#dee2e6"),axis.line =element_line(colour ="#dee2e6"),axis.ticks =element_line(colour ="#dee2e6"),legend.text =element_text(colour ="#dee2e6"),legend.title =element_blank(),legend.background =element_rect(colour ="#002b36", fill ="#002b36"),legend.position ="bottom")
Long to wide
We can also reshape data from longer to wider format using pivot_wider(). We can simply do this by defining which column contains the name and which column contains the values. We can append the final name of the columns using the names_prefix argument.
# Pivot data to wider formarbillboard_wide <-pivot_wider(billboard_long, names_from = week, values_from = position, names_prefix ="wk")# Preview datahead(billboard_wide)
This returns the data to its original state (note that missings cannot be compared):
# Compare all individual values in the data framestable(billboard == billboard_wide, useNA ="always")
TRUE <NA>
6258 18785
Filling in NA’s
Single value for a whole column
Sometimes, we have missing values in columns (i.e. NA’s), that we want to change. This can be due to some assumptions we have made (e.g. absence of a diagnostic code in a health record means absence of the corresponding disease), because we want to name our missings something different than NA, or because we otherwise know what these values should be (e.g. an artificial indicator).
To do this, we can use the function replace_na(), which allows us to specify what missing values should be for each individual column specified. As an example, we will change missing values in the billboard data to -1 for the weeks 20 and 21, to indicate the song was no longer in the top 100 in the week specified by that column.
# Change NAs in weeks 66 and 67billboard_replaced <-replace_na(billboard, list(wk20 =-1, wk21 =-1))# Show that these values now exist where missing data first existed for week 20table(billboard[["wk20"]], useNA ="always"); table(billboard_replaced[["wk20"]], useNA ="always")
In some cases, we want to copy the value of a certain column to other, missing rows in that column. As an example, this could be the case of biological sex at birth is recorded only in a first study visit, even though biological sex at birth remains constant over time (because birth only occurs once). For this, we can use the fill() function.
The fill() function allows us to fill the value in a column. We can specify one of four directions for how the filling occurrs. Note this example data:
# Create example data for fill()dat_fill <-tibble(id =c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4),visit =rep(1:3, 4),bio_sex_birth =c("male", NA, NA, NA, "female", NA, NA, NA, NA, "male", NA, NA))# Show datadat_fill
We can see that for three individuals, we know their biological sex at birth, for one individual, this information was recored at visit 2 (ID 2), and for one individual (ID 3), this information is missing. To fill in these missing values, we can use fill():
# Fill missing values downwardsfill(dat_fill, bio_sex_birth, .direction ="down")
# A tibble: 12 × 3
id visit bio_sex_birth
<dbl> <int> <chr>
1 1 1 male
2 1 2 male
3 1 3 male
4 2 1 male
5 2 2 female
6 2 3 female
7 3 1 female
8 3 2 female
9 3 3 female
10 4 1 male
11 4 2 male
12 4 3 male
When the fill direction is down, we can see that each value is only filled in downwards. The downwards direction poses a problem for individual 2 that now switches biological sex at birth, which should not be possible.
# Fill missing values upwardsfill(dat_fill, bio_sex_birth, .direction ="up")
# A tibble: 12 × 3
id visit bio_sex_birth
<dbl> <int> <chr>
1 1 1 male
2 1 2 female
3 1 3 female
4 2 1 female
5 2 2 female
6 2 3 male
7 3 1 male
8 3 2 male
9 3 3 male
10 4 1 male
11 4 2 <NA>
12 4 3 <NA>
When the fill direction is up, we just fill upwards, which means that individual 4 does not get their missing values filled in.
# Fill missing values downwards then upwardsfill(dat_fill, bio_sex_birth, .direction ="downup")
# A tibble: 12 × 3
id visit bio_sex_birth
<dbl> <int> <chr>
1 1 1 male
2 1 2 male
3 1 3 male
4 2 1 male
5 2 2 female
6 2 3 female
7 3 1 female
8 3 2 female
9 3 3 female
10 4 1 male
11 4 2 male
12 4 3 male
The direction downup simply means that first we fill downwards and then upwards, meaning that when the filling upwards occurs, there are already some NA’s filled in by the downward filling.
# Fill missing values upwards then downwardsfill(dat_fill, bio_sex_birth, .direction ="updown")
# A tibble: 12 × 3
id visit bio_sex_birth
<dbl> <int> <chr>
1 1 1 male
2 1 2 female
3 1 3 female
4 2 1 female
5 2 2 female
6 2 3 male
7 3 1 male
8 3 2 male
9 3 3 male
10 4 1 male
11 4 2 male
12 4 3 male
The updown direction is alike the downup direction, but just the other way around.
Importantly however, all these directions fill also the missing values for individual 3, even though we do not know their actual value. This is why it is always important to evaluate whether you need to fill within a grouping structure. If we apply this here, we can see that both updown and downup give us the correct results:
# Fill missing valuesdat_fill %>%# Arrange for groupingarrange(id) %>%# Create groups out of individualsgroup_by(id) %>%# Fill values downwards then upwardsfill(bio_sex_birth, .direction ="downup") %>%# Remove grouping structureungroup()
# A tibble: 12 × 3
id visit bio_sex_birth
<dbl> <int> <chr>
1 1 1 male
2 1 2 male
3 1 3 male
4 2 1 female
5 2 2 female
6 2 3 female
7 3 1 <NA>
8 3 2 <NA>
9 3 3 <NA>
10 4 1 male
11 4 2 male
12 4 3 male
There are also cases where we are only interested in filling values in a single direction only, such as when a measurement should only be available from a certain point onwards.
Exercises
For these exercises, we will use the storms dataset available in {dplyr}.
1. Continue storm category
Imagine that we want to say that a storm keeps their hurricane category, even after it ceases to be a hurricane. Tidy the data to portray this.
Answer
# Fill the category downwards in grouped datastorms %>%# Arrange for groupingarrange(name) %>%# Group per stormgroup_by(name) %>%# Fill category downwards onlyfill(category, .direction ="down") %>%# Remove grouping structureungroup() %>%## Only to show what answer should look like# Show storm Gladys as an examplefilter(name =="Gladys") %>%# Print all rowsprint(n =46)
As an alternative, set category to -1 if the storm was not a hurricane (and therefore the category is NA).
Answer
# Set NA's to -1storms_replaced <-replace_na(storms, list(category =-1))# See categories in storms_replacedtable(storms_replaced[["category"]], useNA ="always")
-1 1 2 3 4 5 <NA>
14734 2548 993 593 553 116 0
Next topic
Now that we have seen how we tidy our data, the last thing left for us to discuss in the tidyverse is how to combine all these functions into efficient pipelines using {magrittr} and the pipe (%>%).