Tidyverse
What is the tidyverse?
R is a great language for statistical programming, but can sometimes be strenuous to work with smoothly. The tidyverse is a collection of packages that aims to make it easier to perform these strenuous operations. This ranges from data manipulation and visualization to working specifically with dates. The tidyverse allows these operations to be done in an easy-to-read and easy-to-write style, with all packages integrating with one another fluently (I swear, this is not an advertisement).
There are some packages that form the core of the tidyverse, that are all discussed in this tutorial:
Package | Focususes on | Discussed in |
---|---|---|
{tibble} |
Better data frames | Tidyverse |
{dplyr} |
Data manipulation | Dplyr |
{tidyr} |
Data tidying | Tidyr |
{readr} |
Reading in data | Data |
{purr} |
Programming with functions | Functions |
{stringr} |
Working with strings | Regex |
{ggplot2} |
Data visualization | Plotting |
{forcats} |
Working with factors | Plotting |
Besides these packages, the tidyverse also contains other packages that support these packages or add other functionality.
Installing tidyverse
To install all packages of the tidyverse, we can simply run:
or of course:
── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to
become errors
However, note that loading {tidyverse}
only loads the core packages as we see in the output. If we want to load packages from the tidyverse that are not part of the core set, we need to load those packages separately:
Shiver my timbers tibbles
Although R normally works with data frames, the tidyverse works with tibbles. Tibbles are an enhanced type of data frame that try to accomplish two things:
They try to do less
They complain more
As stated by the documentation, this is useful because it: “…forces you to confront problems earlier, typically leading to cleaner, more expressive code”.
Tidyverse automatically creates tibbles, but you can also make tibbles yourself. Similar to data.frame()
which we saw before, a tibble can be created with tibble()
. Additionally, pre-existing data frames can be transformed to tibbles with as_tibble()
.
Besides better functionality, tibbles also print cleaner. Compare printing the first 15 rows of the data frame iris
to printing all rows of the tibble iris
:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
In tibbles, when printed, numerical values are not called numeric but double. In practice, there is no difference and the class (class(as_tibble(iris)[["Sepal.Length"]])
will still be numerical. More information about column types in tibbles can be found here.
Next topic
Next, we will take a good look at an important core package of the tidyverse: dplyr.
Next: Dplyr