library(tidyr)
Introduction
Tidy datasets are all alike, but every messy dataset is messy in its own way.
— Hadley Wickham
The Tidyverse revolves around tidy data. Data can be represented in several ways, but to be tidy it must have the following properties:
- Each variable corresponds to exactly one column.
- Each observation corresponds to exactly one row.
- Each value corresponds to exactly one cell.
Since these three rules are related, they can be simplified to two practical recommendations:
- Put each dataset in a tibble.
- Put each variable in a column.
If you are interested in all the details about tidy data, make sure to check out this article.
If a dataset is already tidy, operations using Tidyverse packages are easy and straightforward. If this is not the case, the tidyr package provides useful functions to transform a messy dataset into a tidy one.
Here are some examples for tidy and messy data frames representing the same dataset consisting of four variables country, year, population, and cases. These examples are available in the tidyr package, so let’s activate it first:
table1
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4b
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
All data frames represent exactly the same data, but only table1
is tidy. Let’s determine how each of the messy data frames are not tidy:
table2
contains values of two variables (cases
andpopulation
) in its last column; this violates the rule that each variable must have its own column.table3
combines two variables (cases
andpopulation
) in one column (rate
). In addition, the rate is represented as a character column, which makes it impossible to directly perform numeric calculations.table4a
andtable4b
are spread across two data frames, although they each contain a variable for the same observations.
Here are some examples that illustrate how easy it is to work with table1
:
library(dplyr)
# compute rate per 10000
|>
table1 mutate(rate=cases / population * 10000)
# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
# compute cases per year
|>
table1 count(year, wt=cases)
# A tibble: 2 × 2
year n
<dbl> <dbl>
1 1999 250740
2 2000 296920
# plot changes over time
library(ggplot2)
ggplot(data=table1, mapping=aes(x=year, y=cases, color=country)) +
geom_line() +
geom_point()
Pivoting
There are two main things to consider if a dataset is not tidy. First, it is important to determine what the variables and observations are. This sounds easier as it might be in practice, so sometimes it is necessary to do a fair amount of research before completing this step. Second, we need to make sure that the rules of tidy data apply. This means that each variable should correspond to one column, and each observation should correspond to one row. If this is not the case, we can pivot the data so that it fulfills these requirements. In most cases, we only need two functions from the tidyr package, namely pivot_longer()
and pivot_wider()
.
Longer
A single variable is often spread across two or more columns. This means that column titles contain values of a variable (instead of variable names). For example, table4a
(and table4b
) contains two columns named `1999`
and `2000`
:
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
These column titles are actually two values of the variable year
. The values in these columns are values of the variable cases
, and therefore one row corresponds to two observations. We can tidy this representation by pivoting the two columns into a single column using pivot_longer()
. To do this, we need to specify three pieces of information:
- The columns which contain values instead of names (
`1999`
and`2000`
in this case). - A (new) variable name to move the column names to (we choose
year
in this example). - A (new) variable name to move the values to (here we choose
cases
).
|>
table4a pivot_longer(c(`1999`, `2000`), names_to="year", values_to="cases")
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Note that because these two columns contain names that would be a syntax error (names in R must not start with a digit), we have to enclose them with backticks. Furthermore, the type of the resulting year
column is character, because this is the default behavior for columns created with the names_to
argument. If we wanted an integer column, we could use the names_transform
argument:
= table4a |>
tidy4a pivot_longer(
c(`1999`, `2000`),
names_to="year",
values_to="cases",
names_transform=list(year=as.integer)
)
The resulting data frame is now tidy. We can perform the same pivoting for table4b
containing data on the population
variable:
= table4b |>
tidy4b pivot_longer(
c(`1999`, `2000`),
names_to="year",
values_to="population",
names_transform=list(year=as.integer)
) tidy4b
# A tibble: 6 × 3
country year population
<chr> <int> <dbl>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
The function is called pivot_longer()
because it generally results in a data frame consisting of more rows and/or fewer columns than the original one (so the data frame gets “longer”).
Finally, we would like to combine tidy4a
and tidy4b
to a single data frame. We will not cover how to do that (you can read about it in the relational data chapter in R for Data Science), but just provide the relevant dplyr function call:
left_join(tidy4a, tidy4b)
# A tibble: 6 × 4
country year cases population
<chr> <int> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Wider
The opposite operation to pivot_longer()
is pivot_wider()
. It is useful in the case where one observation is represented in multiple rows, so our goal is to put the values into separate columns. In general, the resulting data frame is wider because it consists of more columns and/or fewer rows.
Here’s an example of a data frame with multiple rows per observation:
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
To separate rows into columns, we pass two arguments to pivot_wider()
, namely the column containing the names of the variables and the column containing the values:
|>
table2 pivot_wider(names_from=type, values_from=count)
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Exercises
Separating and uniting
TODO