7 – Data wrangling

Introduction to R

Author

Clemens Brunner

Published

December 16, 2024

Reshaping data

There are (at least) two different ways to represent the same data in a table, either in wide format or in long format. The wide format is better suited for calculating descriptive statistics, but the long format is necessary for other tasks. Therefore, it is useful to know how to convert one representation into the other.

A wide data frame has a separate column for each variable. A long data frame has only one column that contains all values, and one or more columns with indicator variables that define the meaning of the values. The following table shows example data in wide format:

Person Age Weight Height
Bob 32 98 188
Al 24 61 176
Sue 64 87 174

There are three value columns (Age, Weight, and Height), as well as a column that identifies the person (Person). The same data in long format looks like this (note that there is only one value column called Value):

Person Variable Value
Bob Age 32
Bob Weight 98
Bob Height 188
Al Age 24
Al Weight 61
Al Height 176
Sue Age 64
Sue Weight 87
Sue Height 174

The tidyr package in R allows you to switch between these two formats, i.e., if the data is in one format, you can relatively easily produce the other format. First, let’s create a data frame with data from the previous example in wide format as follows:

library(tibble)

(df = tibble(
    Person=c("Bob", "Al", "Sue"),
    Age=c(32, 24, 64),
    Weight=c(98, 61, 87),
    Height=c(188, 176, 174)
))
# A tibble: 3 × 4
  Person   Age Weight Height
  <chr>  <dbl>  <dbl>  <dbl>
1 Bob       32     98    188
2 Al        24     61    176
3 Sue       64     87    174

The tidyr package (part of the Tidyverse) includes the functions pivot_longer() and pivot_wider(), which can convert data frames from wide to long and vice versa. Our data frame df (which is in wide format) can be converted to long format as follows:

library(tidyr)

(df_long = pivot_longer(
    df,
    Age:Height,
    names_to="Variable",
    values_to="Value"
))
# A tibble: 9 × 3
  Person Variable Value
  <chr>  <chr>    <dbl>
1 Bob    Age         32
2 Bob    Weight      98
3 Bob    Height     188
4 Al     Age         24
5 Al     Weight      61
6 Al     Height     176
7 Sue    Age         64
8 Sue    Weight      87
9 Sue    Height     174

Here, we first pass the wide-format data (df), followed by a selection of the (value) columns that we want to convert to long format – in our example, these are the columns from Age to Height. For this column selection, we can write the column names without quotes, and the colon works like in number sequences (it creates a range from Age to Height). We could also use the column indices here, i.e., 2:4 instead of Age:Height. Finally, we specify the desired name of the indicator column with names_to (in our example, this should be Variable) and the name of the new value column with values_to (Value). The resulting data frame df_long is now in long format.

Conversely, the pivot_wider() function can split a (value) column into multiple columns, i.e., convert from long to wide format:

(df_wide = pivot_wider(
    df_long,
    id_cols=Person,
    names_from=Variable,
    values_from=Value
))
# A tibble: 3 × 4
  Person   Age Weight Height
  <chr>  <dbl>  <dbl>  <dbl>
1 Bob       32     98    188
2 Al        24     61    176
3 Sue       64     87    174

Here, we first specify the long-format data (df_long). Then we provide the column names that identify the individual cases (in our example, this is only one column named Person) with id_cols. Next, we specify the column that contains the variable names (i.e., the indicator column Variable) with names_from. Finally, we define the column that contains the values (Value) with values_from.

These two functions can convert much more complex structures into the respective formats – all of this is documented in the help, including many examples that demonstrate the usage.

Note

The Tidyverse is based on a representation of data called “tidy.” Essentially, this corresponds to the wide format, where each variable has its own column and each data point has its own row (more details can be found in this article).

Filtering rows with subset()

Often, it is desirable to use only a subset of rows from an existing data frame. For example, it might be necessary to separately analyze male and female participants to address a specific research question. This information is encoded the sex column with values m and f in the following example:

(df = tibble(
    name=c("Bob", "Al", "Sue", "John", "Mary", "Ann"),
    age=c(32, 24, 64, 44, 21, 75),
    weight=c(98, 61, 87, 82, 73, 66),
    height=c(188, 176, 174, 182, 181, 159),
    sex=factor(c("m", "m", "f", "m", "f", "f"))
))
# A tibble: 6 × 5
  name    age weight height sex  
  <chr> <dbl>  <dbl>  <dbl> <fct>
1 Bob      32     98    188 m    
2 Al       24     61    176 m    
3 Sue      64     87    174 f    
4 John     44     82    182 m    
5 Mary     21     73    181 f    
6 Ann      75     66    159 f    

We have already learned that we can filter specific rows by indexing. With the following command, we create a new data frame consisting of all female participants:

df[df$sex == "f",]
# A tibble: 3 × 5
  name    age weight height sex  
  <chr> <dbl>  <dbl>  <dbl> <fct>
1 Sue      64     87    174 f    
2 Mary     21     73    181 f    
3 Ann      75     66    159 f    

This notation is relatively cumbersome for data frames, especially because the underlying data frame df has to be written twice. Fortunately, there is a more intuitive alternative using the subset() function. This function can create subsets of existing vectors or data frames. In the case of a data frame, this means that you can select a subset of rows and/or columns.

Let’s start with selecting rows (this operation is also called “filtering”). You pass the original data frame as the first argument to the subset() function. The second argument (called subset, not to be confused with the function name) then determines which rows should be selected (filtered). The example above can be rewritten as follows:

subset(df, sex == "f")
# A tibble: 3 × 5
  name    age weight height sex  
  <chr> <dbl>  <dbl>  <dbl> <fct>
1 Sue      64     87    174 f    
2 Mary     21     73    181 f    
3 Ann      75     66    159 f    

Note that you can directly mention the column name without having to prepend df$. The result is the same as in the previous example.

You can also perform more complex filtering operations by passing a more complex comparison to the subset argument, e.g., by combining multiple comparisons:

subset(df, age > 40 & weight <= 73)
# A tibble: 1 × 5
  name    age weight height sex  
  <chr> <dbl>  <dbl>  <dbl> <fct>
1 Ann      75     66    159 f    

Here you can use all kinds of comparison or logical operators (& and, | or, xor() exclusive or).

Selecting columns with subset()

Sometimes it is useful to use only specific columns from a data frame. Classically, this can be solved again by indexing. The following example creates a subset consisting of the columns name, age, and sex:

df[, c("name", "age", "sex")]
# A tibble: 6 × 3
  name    age sex  
  <chr> <dbl> <fct>
1 Bob      32 m    
2 Al       24 m    
3 Sue      64 f    
4 John     44 m    
5 Mary     21 f    
6 Ann      75 f    

Alternatively, you can use the subset() function with the third argument (called select):

subset(df, select=c(name, age, sex))
# A tibble: 6 × 3
  name    age sex  
  <chr> <dbl> <fct>
1 Bob      32 m    
2 Al       24 m    
3 Sue      64 f    
4 John     44 m    
5 Mary     21 f    
6 Ann      75 f    

Note that column names can also be specified without quotes. It is even possible to address a range with a : as follows (as we have already seen with pivot_longer() or pivot_wider()):

subset(df, select=name:weight)
# A tibble: 6 × 3
  name    age weight
  <chr> <dbl>  <dbl>
1 Bob      32     98
2 Al       24     61
3 Sue      64     87
4 John     44     82
5 Mary     21     73
6 Ann      75     66

This command selects all columns from name to weight. Of course, you can also combine row and column selections in a single command:

subset(df, subset=age > 30, select=c(name, age, sex))
# A tibble: 4 × 3
  name    age sex  
  <chr> <dbl> <fct>
1 Bob      32 m    
2 Sue      64 f    
3 John     44 m    
4 Ann      75 f    

Creating new columns with transform()

Another important task in data analysis is to add new columns to an existing data frame. If the values of these new columns are based on existing columns, this is called a transformation. Consider the airquality data frame available in R:

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Temperature values in the Temp column are given in Fahrenheit. We can now add a new column called celsius by transforming the existing Temp column as follows:

aq = transform(airquality, celsius=(Temp - 32) * (5/9))
head(aq)
  Ozone Solar.R Wind Temp Month Day  celsius
1    41     190  7.4   67     5   1 19.44444
2    36     118  8.0   72     5   2 22.22222
3    12     149 12.6   74     5   3 23.33333
4    18     313 11.5   62     5   4 16.66667
5    NA      NA 14.3   56     5   5 13.33333
6    28      NA 14.9   66     5   6 18.88889

Note that we can reuse existing columns directly in the transformation (in the example Temp) without having to use quotes or airquality$. Also, the function always creates a new data frame; you could even assign it to the original name if you want (so airquality in the example above instead of aq).

Important

The result of the transform() function is a data frame, even if you use a tibble as the source data frame. If the result should be a tibble, you must explicitly convert the return value of transform() to a tibble with tibble::as_tibble().

The pipe operator |>

R gained a new operator with version 4.1, the so-called pipe operator, written as |>. The principle behind it is as simple as it is ingenious. Although it does not add any new functionality (i.e., all operations can also be implemented without the pipe operator), many operations become easier and/or more intuitive.

In principle, you can write a function call f(x) as x |> f(). This is a pure syntactic alternative, i.e., both variants do exactly the same thing. For example:

x = 1:10
mean(x)  # classic
[1] 5.5
x |> mean()  # pipe
[1] 5.5
Tip

The pipe operator pulls the first argument in front of the function call.

In this example, this makes little sense, because mean(x) is much shorter and easier to read. However, it gets more interesting when you want to use the result of a function call directly as an argument (input) for another function call. Classically, you would write this as:

g(f(x))

In this case, it is more difficult to see in which order the calculation is actually performed: first f(x) is calculated, and its result is passed as an argument to the function g(). R performs the calculations from the inside out. With the pipe operator, the same operation would look like this:

x |> f() |> g()

This makes the order of operations immediately clear: x is first passed to the function f(), and the result of that is passed to the function g(). Therefore, the pipe operator is particularly useful when you have a sequence of operations that you want to perform one after the other. The following example first calculates the mean of a vector x and then the logarithm of this mean:

log(mean(x))
[1] 1.704748
x |> mean() |> log()
[1] 1.704748

Using the pipe operator is usually more intuitive. Moreover, it becomes even clearer if each step in the pipeline is written on a separate line (this is possible in R with all commands in principle):

x |>
    mean() |>
    log()
[1] 1.704748

Data wrangling

Combining everything we have learned in this chapter (namely subset(), transform(), and |>), we get a set of tools that allows for very intuitive data frame transformations (so-called data wrangling). Let’s look at the airquality data set again. Suppose we want to investigate the temperatures (in °C) in the month of July. We could use the following pipeline:

library(tibble)

airquality |>
    transform(celsius=(Temp - 32) * (5/9)) |>
    subset(Month == 7) |>
    subset(select=-c(Month, Day)) |>
    as_tibble()
# A tibble: 31 × 5
   Ozone Solar.R  Wind  Temp celsius
   <int>   <int> <dbl> <int>   <dbl>
 1   135     269   4.1    84    28.9
 2    49     248   9.2    85    29.4
 3    32     236   9.2    81    27.2
 4    NA     101  10.9    84    28.9
 5    64     175   4.6    83    28.3
 6    40     314  10.9    83    28.3
 7    77     276   5.1    88    31.1
 8    97     267   6.3    92    33.3
 9    97     272   5.7    92    33.3
10    85     175   7.4    89    31.7
# ℹ 21 more rows
Note

In this example, we have converted the final result into a tibble with as_tibble(). Otherwise, we would have ended up with a data frame because we used the transform() function.

To continue working with this data frame, you should assign the result to a (new) variable, e.g.:

aq = airquality |>
    transform(celsius=(Temp - 32) * (5/9)) |>
    subset(Month == 7) |>
    subset(select=-c(Month, Day))

mean(aq$celsius)  # mean temperature (in °C) in July
[1] 28.83513

Finally, it should be noted that the Tidyverse offers many more functions that can be elegantly combined with the pipe operator. In particular, grouped summary statistics (e.g., the average temperatures for all months) become similarly intuitive as in the examples shown here. With base R, this is also possible, but not as elegant and concise.

Exercises

  1. Install and activate the tidyr package. It contains the data set table2. Create a new data frame from this data set, which contains the values in the columns type and count in two columns cases and population.

  2. The tidyr package also contains a data set table4a. Combine the two columns 1999 and 2000 into a value column named count and an indicator column named year.

    Note

    The two column names must be enclosed in backticks (`), because names starting with numbers would otherwise be interpreted as numbers by R, i.e., `1999` and `2000`.

  3. Create a new data frame called mtcars1 from the existing data frame mtcars, which only contains rows where the column mpg has values greater than 25. How many rows and columns does mtcars have, and how many rows and columns does mtcars1 have? Use the subset() function!

  4. Install and activate the nycflights13 package. We use the flights data set from this package in this exercise. First, read the documentation for flights. Then perform the following tasks (using the pipe operator and the subset() or transform() functions):

    • Create a new data frame that contains all flights on January 1, 2013. Only the columns year, month, day, dep_time, arr_time, and tailnum should be retained. How many flights are there?
    • Create two new columns hours (the flight time in hours; the column air_time contains the flight time in minutes) and speed (the flight speed in km/h; the column distance contains the distance traveled in miles, so an additional column km would be helpful). Output the new data frame with only the columns month, day, carrier, tailnum, and speed!
    • Create a new data frame that contains all early flights (flights that departed before 6:00).
    • Create a new data frame that contains only those flights that were faster than planned (i.e., where the delay at arrival was smaller than at departure).