Tidyr

Tidyr

{tidyr} is a package from the tidyverse that is meant to help you tidy up your data. What is considered ‘tidy’ data can be read in vignette("tidy-data"). In short, tidy data helps you use {dplyr} and other tidyverse tools in a way that lets you spend less time on structuring your data and more on performing your analyses.

Reshaping data

Wide to long

Some analyses require data in long format (i.e. one study participant can have multiple observations which are represented by multiple rows) while data might be retrieved in wide format (i.e. one study participant can have multiple observations which are represented by multiple columns). In such a case, {tidyr} offers us pivot_longer().

We can use pivot_longer() in a number of ways, but we must always supply the data in the argument data, and the columns to be pivoted in cols. In the simplest cases, this will suffice.

As an example, let’s look at billboard, a dataset available through {tidyr}. billboard contains songs ranking the top 100 in the year 2000 with their position in each week after entering the top 100 (wk).

# Load tidyr
pacman::p_load("tidyr")

# What does billboard look like?
head(billboard)
# A tibble: 6 × 79
  artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
6 98^0        Give… 2000-08-19      51    39    34    26    26    19     2     2
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
# Pivot billboard to long format
billboard_long <- pivot_longer(billboard, wk1:wk76)

# Reprint billboard
head(billboard_long)
# A tibble: 6 × 5
  artist track                   date.entered name  value
  <chr>  <chr>                   <date>       <chr> <dbl>
1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94

We can see that the data now has much less columns (5 as opposed to 79 previously). Instead, the number of rows has been increased: each row now corresponds to a week for each song. The weeks were put into a column called ‘name’ by default, and the values that were in the original week columns went into a column called ‘value’ by default.

We can of course change the names of those columns:

# Pivot billboard to long format
billboard_long <- pivot_longer(billboard, wk1:wk76, names_to = "week", values_to = "position")

# Reprint billboard
head(billboard_long)
# A tibble: 6 × 5
  artist track                   date.entered week  position
  <chr>  <chr>                   <date>       <chr>    <dbl>
1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1         87
2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2         82
3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3         72
4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4         77
5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5         87
6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6         94

Currently, the week numbers are in the column called ‘week’, but are represented as a character string because the characters ‘wk’ are present in all columns. Although we will learn how to manipulate this in Regex, we can prevent this while calling the function:

# Pivot billboard to long format
billboard_long <- pivot_longer(billboard, wk1:wk76, names_to = "week", values_to = "position", names_prefix = "wk")

# Reprint billboard
head(billboard_long)
# A tibble: 6 × 5
  artist track                   date.entered week  position
  <chr>  <chr>                   <date>       <chr>    <dbl>
1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   1           87
2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   2           82
3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   3           72
4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   4           77
5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   5           87
6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   6           94

Although we have numbers now, they are still of class ‘character’. However, we can specify the type they should be using names_transform:

# Pivot billboard to long format
billboard_long <- pivot_longer(billboard, wk1:wk76, names_to = "week", values_to = "position", names_prefix = "wk", names_transform = as.numeric)

# Reprint billboard
head(billboard_long)
# A tibble: 6 × 5
  artist track                   date.entered  week position
  <chr>  <chr>                   <date>       <dbl>    <dbl>
1 2 Pac  Baby Don't Cry (Keep... 2000-02-26       1       87
2 2 Pac  Baby Don't Cry (Keep... 2000-02-26       2       82
3 2 Pac  Baby Don't Cry (Keep... 2000-02-26       3       72
4 2 Pac  Baby Don't Cry (Keep... 2000-02-26       4       77
5 2 Pac  Baby Don't Cry (Keep... 2000-02-26       5       87
6 2 Pac  Baby Don't Cry (Keep... 2000-02-26       6       94

This kind of control gives us much less head-ache when, for instance, we want to use the data for plotting (for which we will discuss the code in Plotting):

# Load packages
pacman::p_load("dplyr",   # Data wrangling
               "ggplot2"  # Data visualization
)

# Get only Britney Spears songs
dat_plot <- filter(billboard_long, artist == "Spears, Britney" & !is.na(position))

# Create plot
ggplot(dat_plot, aes(x = week, y = position, colour = track)) +
    # Geometries
    geom_line() +
    # Scaling
    scale_x_continuous(limits = c(0, 20)) +
    scale_y_continuous(limits = c(0, 100)) +
    scale_colour_manual(values = c("#d3b866", "#e09373", "#e586b4")) + 
    # Aesthetics
    theme(panel.grid = element_blank(),
          panel.background = element_rect(colour = "#002b36", fill = "#002b36"),
          plot.background = element_rect(colour = "#002b36", fill = "#002b36"),
          axis.text = element_text(colour = "#dee2e6"),
          axis.title = element_text(colour = "#dee2e6"),
          axis.line = element_line(colour = "#dee2e6"),
          axis.ticks = element_line(colour = "#dee2e6"),
          legend.text = element_text(colour = "#dee2e6"),
          legend.title = element_blank(),
          legend.background = element_rect(colour = "#002b36", fill = "#002b36"),
          legend.position = "bottom")

Long to wide

We can also reshape data from longer to wider format using pivot_wider(). We can simply do this by defining which column contains the name and which column contains the values. We can append the final name of the columns using the names_prefix argument.

# Pivot data to wider formar
billboard_wide <- pivot_wider(billboard_long, names_from = week, values_from = position, names_prefix = "wk")

# Preview data
head(billboard_wide)
# A tibble: 6 × 79
  artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
6 98^0        Give… 2000-08-19      51    39    34    26    26    19     2     2
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

This returns the data to its original state (note that missings cannot be compared):

# Compare all individual values in the data frames
table(billboard == billboard_wide, useNA = "always")

 TRUE  <NA> 
 6258 18785 

Filling in NA’s

Single value for a whole column

Sometimes, we have missing values in columns (i.e. NA’s), that we want to change. This can be due to some assumptions we have made (e.g. absence of a diagnostic code in a health record means absence of the corresponding disease), because we want to name our missings something different than NA, or because we otherwise know what these values should be (e.g. an artificial indicator).

To do this, we can use the function replace_na(), which allows us to specify what missing values should be for each individual column specified. As an example, we will change missing values in the billboard data to -1 for the weeks 20 and 21, to indicate the song was no longer in the top 100 in the week specified by that column.

# Change NAs in weeks 66 and 67
billboard_replaced <- replace_na(billboard, list(wk20 = -1, wk21 = -1))

# Show that these values now exist where missing data first existed for week 20
table(billboard[["wk20"]], useNA = "always"); table(billboard_replaced[["wk20"]], useNA = "always")

   2    3    4    5    6    7    9   10   11   12   14   15   16   18   19   20 
   2    2    3    2    1    2    1    4    1    2    4    4    1    1    1    3 
  21   22   23   25   26   28   30   33   34   35   39   40   42   43   45   46 
   1    2    3    4    3    1    3    1    2    1    1    1    2    1    1    1 
  48   49   50   51   53   54   56   58   59   61   62   66   67   68   69   70 
   1    1    2    1    1    2    1    2    3    1    3    2    1    2    1    3 
  71   72   74   77   78   79   82   83   84   85   86   87   88   89   90   91 
   1    1    1    2    2    2    1    1    1    1    4    5    2    6    4    3 
  92   93   94   95   96   97   98   99  100 <NA> 
   2    1    2    3    1    2    1    3    5  171 

  -1    2    3    4    5    6    7    9   10   11   12   14   15   16   18   19 
 171    2    2    3    2    1    2    1    4    1    2    4    4    1    1    1 
  20   21   22   23   25   26   28   30   33   34   35   39   40   42   43   45 
   3    1    2    3    4    3    1    3    1    2    1    1    1    2    1    1 
  46   48   49   50   51   53   54   56   58   59   61   62   66   67   68   69 
   1    1    1    2    1    1    2    1    2    3    1    3    2    1    2    1 
  70   71   72   74   77   78   79   82   83   84   85   86   87   88   89   90 
   3    1    1    1    2    2    2    1    1    1    1    4    5    2    6    4 
  91   92   93   94   95   96   97   98   99  100 <NA> 
   3    2    1    2    3    1    2    1    3    5    0 

Based on other values in column

In some cases, we want to copy the value of a certain column to other, missing rows in that column. As an example, this could be the case of biological sex at birth is recorded only in a first study visit, even though biological sex at birth remains constant over time (because birth only occurs once). For this, we can use the fill() function.

The fill() function allows us to fill the value in a column. We can specify one of four directions for how the filling occurrs. Note this example data:

# Create example data for fill()
dat_fill <- tibble(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4),
                   visit = rep(1:3, 4),
                   bio_sex_birth = c("male", NA, NA, NA, "female", NA, 
                                     NA, NA, NA, "male", NA, NA))

# Show data
dat_fill
# A tibble: 12 × 3
      id visit bio_sex_birth
   <dbl> <int> <chr>        
 1     1     1 male         
 2     1     2 <NA>         
 3     1     3 <NA>         
 4     2     1 <NA>         
 5     2     2 female       
 6     2     3 <NA>         
 7     3     1 <NA>         
 8     3     2 <NA>         
 9     3     3 <NA>         
10     4     1 male         
11     4     2 <NA>         
12     4     3 <NA>         

We can see that for three individuals, we know their biological sex at birth, for one individual, this information was recored at visit 2 (ID 2), and for one individual (ID 3), this information is missing. To fill in these missing values, we can use fill():

# Fill missing values downwards
fill(dat_fill, bio_sex_birth, .direction = "down")
# A tibble: 12 × 3
      id visit bio_sex_birth
   <dbl> <int> <chr>        
 1     1     1 male         
 2     1     2 male         
 3     1     3 male         
 4     2     1 male         
 5     2     2 female       
 6     2     3 female       
 7     3     1 female       
 8     3     2 female       
 9     3     3 female       
10     4     1 male         
11     4     2 male         
12     4     3 male         

When the fill direction is down, we can see that each value is only filled in downwards. The downwards direction poses a problem for individual 2 that now switches biological sex at birth, which should not be possible.

# Fill missing values upwards
fill(dat_fill, bio_sex_birth, .direction = "up")
# A tibble: 12 × 3
      id visit bio_sex_birth
   <dbl> <int> <chr>        
 1     1     1 male         
 2     1     2 female       
 3     1     3 female       
 4     2     1 female       
 5     2     2 female       
 6     2     3 male         
 7     3     1 male         
 8     3     2 male         
 9     3     3 male         
10     4     1 male         
11     4     2 <NA>         
12     4     3 <NA>         

When the fill direction is up, we just fill upwards, which means that individual 4 does not get their missing values filled in.

# Fill missing values downwards then upwards
fill(dat_fill, bio_sex_birth, .direction = "downup")
# A tibble: 12 × 3
      id visit bio_sex_birth
   <dbl> <int> <chr>        
 1     1     1 male         
 2     1     2 male         
 3     1     3 male         
 4     2     1 male         
 5     2     2 female       
 6     2     3 female       
 7     3     1 female       
 8     3     2 female       
 9     3     3 female       
10     4     1 male         
11     4     2 male         
12     4     3 male         

The direction downup simply means that first we fill downwards and then upwards, meaning that when the filling upwards occurs, there are already some NA’s filled in by the downward filling.

# Fill missing values upwards then downwards
fill(dat_fill, bio_sex_birth, .direction = "updown")
# A tibble: 12 × 3
      id visit bio_sex_birth
   <dbl> <int> <chr>        
 1     1     1 male         
 2     1     2 female       
 3     1     3 female       
 4     2     1 female       
 5     2     2 female       
 6     2     3 male         
 7     3     1 male         
 8     3     2 male         
 9     3     3 male         
10     4     1 male         
11     4     2 male         
12     4     3 male         

The updown direction is alike the downup direction, but just the other way around.

Importantly however, all these directions fill also the missing values for individual 3, even though we do not know their actual value. This is why it is always important to evaluate whether you need to fill within a grouping structure. If we apply this here, we can see that both updown and downup give us the correct results:

# Fill missing values
dat_fill %>%
    # Arrange for grouping
    arrange(id) %>%
    # Create groups out of individuals
    group_by(id) %>%
    # Fill values downwards then upwards
    fill(bio_sex_birth, .direction = "downup") %>%
    # Remove grouping structure
    ungroup()
# A tibble: 12 × 3
      id visit bio_sex_birth
   <dbl> <int> <chr>        
 1     1     1 male         
 2     1     2 male         
 3     1     3 male         
 4     2     1 female       
 5     2     2 female       
 6     2     3 female       
 7     3     1 <NA>         
 8     3     2 <NA>         
 9     3     3 <NA>         
10     4     1 male         
11     4     2 male         
12     4     3 male         

There are also cases where we are only interested in filling values in a single direction only, such as when a measurement should only be available from a certain point onwards.

Exercises

For these exercises, we will use the storms dataset available in {dplyr}.

1. Continue storm category

Imagine that we want to say that a storm keeps their hurricane category, even after it ceases to be a hurricane. Tidy the data to portray this.

Answer
# Fill the category downwards in grouped data
storms %>%
    # Arrange for grouping
    arrange(name) %>%
    # Group per storm
    group_by(name) %>%
    # Fill category downwards only
    fill(category, .direction = "down") %>%
    # Remove grouping structure
    ungroup() %>%
    ## Only to show what answer should look like
    # Show storm Gladys as an example
    filter(name == "Gladys") %>%
    # Print all rows
    print(n = 46)
# A tibble: 46 × 13
   name    year month   day  hour   lat  long status     category  wind pressure
   <chr>  <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>         <dbl> <int>    <int>
 1 Gladys  1975     9    22    18  10.3 -34.8 tropical …       NA    25     1012
 2 Gladys  1975     9    23     0  10.6 -35.8 tropical …       NA    25     1012
 3 Gladys  1975     9    23     6  11   -36.7 tropical …       NA    25     1012
 4 Gladys  1975     9    23    12  11.4 -37.4 tropical …       NA    25     1012
 5 Gladys  1975     9    23    18  11.7 -38.2 tropical …       NA    30     1010
 6 Gladys  1975     9    24     0  12.1 -38.8 tropical …       NA    30     1010
 7 Gladys  1975     9    24     6  12.4 -39.6 tropical …       NA    30     1010
 8 Gladys  1975     9    24    12  12.9 -40   tropical …       NA    30     1010
 9 Gladys  1975     9    24    18  13.5 -40.4 tropical …       NA    35     1005
10 Gladys  1975     9    25     0  14.2 -41   tropical …       NA    40     1005
11 Gladys  1975     9    25     6  14.8 -42   tropical …       NA    50     1000
12 Gladys  1975     9    25    12  15.4 -43   tropical …       NA    60     1000
13 Gladys  1975     9    25    18  15.8 -44   hurricane         1    65      990
14 Gladys  1975     9    26     0  16.2 -45   hurricane         1    65      990
15 Gladys  1975     9    26     6  16.4 -46.1 hurricane         1    65      990
16 Gladys  1975     9    26    12  16.6 -47.7 hurricane         1    65      990
17 Gladys  1975     9    26    18  16.8 -49.3 hurricane         1    65      990
18 Gladys  1975     9    27     0  17.1 -50.7 hurricane         1    65      990
19 Gladys  1975     9    27     6  17.6 -52.2 hurricane         1    65      990
20 Gladys  1975     9    27    12  18.2 -53.7 hurricane         1    65      990
21 Gladys  1975     9    27    18  18.8 -55.1 hurricane         1    65      990
22 Gladys  1975     9    28     0  19.4 -56.4 hurricane         1    65      992
23 Gladys  1975     9    28     6  19.6 -57.4 hurricane         1    65      992
24 Gladys  1975     9    28    12  19.8 -58.2 hurricane         1    65     1000
25 Gladys  1975     9    28    18  20.3 -59.3 hurricane         1    65      998
26 Gladys  1975     9    29     0  21.2 -60.3 hurricane         1    65      995
27 Gladys  1975     9    29     6  22.1 -61.4 hurricane         1    70      990
28 Gladys  1975     9    29    12  23   -62.6 hurricane         1    70      990
29 Gladys  1975     9    29    18  23.6 -63.9 hurricane         1    75      985
30 Gladys  1975     9    30     0  24.1 -65.2 hurricane         1    80      975
31 Gladys  1975     9    30     6  24.6 -66.5 hurricane         1    80      975
32 Gladys  1975     9    30    12  25.1 -67.9 hurricane         1    80      975
33 Gladys  1975     9    30    18  25.6 -69.3 hurricane         1    80      975
34 Gladys  1975    10     1     0  26.1 -70.6 hurricane         1    80      975
35 Gladys  1975    10     1     6  26.8 -71.7 hurricane         1    80      975
36 Gladys  1975    10     1    12  27.9 -72.4 hurricane         1    80      975
37 Gladys  1975    10     1    18  29.4 -73   hurricane         2    90      969
38 Gladys  1975    10     2     0  31   -73   hurricane         3   100      954
39 Gladys  1975    10     2     6  32.9 -72.1 hurricane         3   110      942
40 Gladys  1975    10     2    12  35.3 -69.8 hurricane         4   120      939
41 Gladys  1975    10     2    18  37.8 -67   hurricane         4   120      939
42 Gladys  1975    10     3     0  40.8 -62.6 hurricane         3   110      950
43 Gladys  1975    10     3     6  43.7 -57   hurricane         2    85      960
44 Gladys  1975    10     3    12  46.6 -50.6 hurricane         2    85      960
45 Gladys  1975    10     3    18  50.5 -45.5 extratrop…        2    75      975
46 Gladys  1975    10     4     0  55   -40   extratrop…        2    65      980
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>

2. Indicate non-hurricanes

As an alternative, set category to -1 if the storm was not a hurricane (and therefore the category is NA).

Answer
# Set NA's to -1
storms_replaced <- replace_na(storms, list(category = -1))

# See categories in storms_replaced
table(storms_replaced[["category"]], useNA = "always")

   -1     1     2     3     4     5  <NA> 
14734  2548   993   593   553   116     0 

Next topic

Now that we have seen how we tidy our data, the last thing left for us to discuss in the tidyverse is how to combine all these functions into efficient pipelines using {magrittr} and the pipe (%>%).

Next: Magrittr