Introduction

Data wrangling in R using the Tidyverse

Author

Clemens Brunner

Published

March 25, 2024

Tidyverse

The Tidyverse is a collection of R packages designed to simplify data analysis and visualization. Their fundamental assumption is that data is represented in the so-called “tidy” format. This basically means that it can be represented by a table, in which a row corresponds to an observation and a column corresponds to a variable (more details are available in this article).

A typical data science workflow can be visualized as follows:

A typical data science workflow (adapted from here).

The Tidyverse offers packages that cover all aspects of this workflow. In this course, we will focus on data wrangling (the green box). In addition, we will also discuss importing and visualizing data. Modeling data and communicating results are outside the scope of this workshop.

Prerequisites

To get started with the Tidyverse, we need to install both R and RStudio. Although RStudio is technically not required, I strongly recommend it, because it makes working with R much more pleasant.

Once you have these programs installed on your computer, start RStudio and install the tidyverse package, which is basically a meta-package consisting of all packages that are part of the Tidyverse. We will also need the nycflights13 and palmerpenguins packages, because they provide nice datasets that we are going to explore.

We will now go through some basic R commands and workflows. Ideally, you should already be familiar with most of these topics. If not, this hopefully serves as a quick tour and should get you up to speed.

Note

Throughout the course material, we show R code in gray boxes and corresponding output/results as follows:

mean(c(1, 2, 3))
[1] 2

You can copy code from a gray box and paste it into the R console.

This workshop is based on selected chapters from the book “R for Data Science” by Hadley Wickham and Garrett Grolemund.

R for Data Science (freely available here).

R basics

Packages

A package contains additional functions and/or datasets that extend the capabilities of R. We install a package with the install.packages() function. In this workshop, we are going to use the tidyverse, nycflights13, and palmerpenguins packages. To install them, run the following commands in the R console (note that package names must be surrounded by single or double quotes):

install.packages("tidyverse")
install.packages("nycflights13")
install.packages("palmerpenguins")

Alternatively, you can use the “Packages” pane in RStudio (bottom right pane in the default layout) to install/update/uninstall R packages.

Once installed, we need to activate a package with the library function in each R session. If we don’t activate a package, we do not have (direct) access to the objects it contains. Here’s how we activate the packages that we’ve just installed:

library(tidyverse)
library(nycflights13)
library(palmerpenguins)
Tip

We can use functions without activating their corresponding package by prefixing the function name with the package name followed by two colons. For example, nycflights13::flights accesses the flights data frame contained in the nycflights13 package without having to activate the package first. In contrast, library(nycflights13) enables us to access this data frame with flights directly (together will all other objects contained in this package).

Working directory

When R runs commands, it performs all computations in the so-called working directory. R expects all data files that you want to import in (or relative to) this directory (if not otherwise specified). The working directory can be any directory on your computer, and there are several options to change it.

The function getwd() returns the current working directory. The subtitle of the “Console” pane in RStudio (bottom left in the default layout) also shows the working directory.

The function setwd() sets the working directory to the folder passed as an argument, for example setwd("C:/Users/myuser/R") or equivalently setwd("~/R") (the tilde symbol is an abbreviation for the current user’s home directory). Alternative methods to change the working directory with RStudio include the “Session” – “Set Working Directory” menu and the “More” icon in the “Files” pane (bottom right). Also, if you double-click an R script in Windows Explorer or macOS Finder, RStudio will open it and automatically set the working directory to the corresponding file location.

Note

Paths should be separated with a forward slash / even on Windows (which normally uses a backslash \). If you really want to use backslashes, you need to enter double backslashes, for example:

setwd("C:\\Users\\myuser\\R")
Important

Never ever set the working directory in a script! Instead, always refer to files with relative paths (relative to the location of the script). This makes the script portable across different computers, because when you use setwd(), it is very unlikely that another person has the exact same directory that you are trying to set.

R code

Typically, we enter R commands in the console. A prompt symbol > indicates that R is ready for our input (note that we do not show the prompt symbol in the gray code boxes). Once we hit , R will immediately evaluate what we just typed and print the result. This workflow is called REPL (read-eval-print loop), and it is a convenient way to interactively work with R and try out new things. Here’s an example of some commands with their outputs (try running these commands in your console):

1 + 9  # this is a comment
[1] 10
x = 1:10  # the value of x is not printed in an assignment
sum(x)  # function call
[1] 55

Notice that when creating a new object with the assignment operator = (or equivalently <-), R does not automatically print its value. However, it is often useful to assign a new object and then immediately inspect its value. We could do this with two lines of code:

x = 1:10
x
 [1]  1  2  3  4  5  6  7  8  9 10

A more convenient way is to enclose the assignment in parentheses, which will both create the object and print its value:

(x = 1:10)
 [1]  1  2  3  4  5  6  7  8  9 10

Whenever R prints a vector, it automatically adds the index of the first element of each line in square brackets to the output. We only saw [1] in the previous outputs, because the values fitted into one line. If you print a longer vector, each line will show the index of its first element:

(y = 1:100)
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

The console is nice for trying out commands, fixing errors, and playing around with code. However, if we want to store a sequence of R commands for later use, we can put them into so-called R scripts. An R script is a plain-text file (ending in .R) containing R commands, usually one command per line. RStudio includes an editor (top left pane in the default layout), which can be used to write, edit, and run (parts of) a script.

Importantly, a data analysis project stored as an R script can be run over and over again. This means if another person wants to reproduce your analysis, all you need to do is share your script and data files. The person then runs your entire script, for example by clicking on the “Source” icon, which fully reproduces the entire analysis and all results.

RStudio keyboard shortcuts

RStudio includes many useful keyboard shortcuts. It really pays off to remember some of them, because your workflow will become faster and more efficient. An overview of all keyboard shortcuts is available in the “Help” – “Keyboard Shortcuts Help” menu item.

Here are four important shortcuts that I think everyone should know:

  • The and arrow keys access the command history in the console. You can also edit any command before running it again.
  • If you are searching for a previously entered command starting with specific characters, enter the characters in the console and press on macOS or Ctrl on Windows and Linux.
  • Hitting (macOS) or Ctrl (Windows and Linux) in the editor runs the command under the cursor (or the selected commands) in the console.
  • The shortcut for the pipe operator |> (more on that later) is m (macOS) or Ctrlm (Windows and Linux).

Vectors

The most basic (atomic) data type in R is a vector. A vector is a collection of objects which all have the same type. Even a scalar number like 1 is really a vector in R. The c() function creates vectors consisting of two or more elements (c is short for “concatenate”). The length() function returns the number of elements in a vector.

Important data types include numeric vectors, character vectors, logical vectors, factors, and datetime vectors. We can use the class() function to determine the type of a given object. Here are some examples:

x = 1
class(x)
[1] "numeric"
length(x)
[1] 1
y = c(4, 5.6, -7)
class(y)
[1] "numeric"
length(y)
[1] 3
c("Hello", "world!")  # character
[1] "Hello"  "world!"
c(TRUE, FALSE)  # logical
[1]  TRUE FALSE
y > 4  # a comparison evaluates to a logical vector
[1] FALSE  TRUE FALSE
factor(c("A", "A", "B", "A", "C", "C", "A", "B"))  # factor with three levels
[1] A A B A C C A B
Levels: A B C
as.Date(c("17.3.2020", "22.5.2020", "3.3.2021"), format="%d.%m.%Y")  # datetime
[1] "2020-03-17" "2020-05-22" "2021-03-03"

We can access individual elements of a vector using square brackets containing the indexes of all elements we would like to access:

(x = 11:20)
 [1] 11 12 13 14 15 16 17 18 19 20
x[5]  # fifth element
[1] 15
x[c(7, 1, 4)]  # elements with index 7, 1, and 4
[1] 17 11 14
x[x >= 15]  # all elements >= 5
[1] 15 16 17 18 19 20

Data frames

A data frame is a list of vectors (with identical lengths). These vectors correspond to rows of the data frame. In other words, it represents a table consisting of rows and columns for storing rectangular data.

(df = data.frame(x=1:4, y=c(6, -9.5, 166, 8.8), z=c("A", "X", "X", "B")))
  x     y z
1 1   6.0 A
2 2  -9.5 X
3 3 166.0 X
4 4   8.8 B

The Tidyverse package tibble provides an improved data frame type called tibble. A tibble is a drop-in replacement for a data frame, so we can use tibbles (almost) everywhere data frames are expected.

tibble::tibble(x=1:5, y=c(6, -9.5, 166, 8.8, 0.112), z=c("A", "X", "X", "B", "A"))
# A tibble: 5 × 3
      x       y z    
  <int>   <dbl> <chr>
1     1   6     A    
2     2  -9.5   X    
3     3 166     X    
4     4   8.8   B    
5     5   0.112 A    

Note how data frames and tibbles print differently in the previous examples. Tibbles are more readable and include their dimension (# A tibble: 5 x 3) as well as column data types (<int>, <dbl>, and <chr>, which is short for integer, double, and character). The str() function shows a convenient summary of the structure of a given data frame, which also contains the column data types:

str(df)
'data.frame':   4 obs. of  3 variables:
 $ x: int  1 2 3 4
 $ y: num  6 -9.5 166 8.8
 $ z: chr  "A" "X" "X" "B"

The glimpse() function (part of the dplyr package) provides a similar summary and works better with tibbles:

dplyr::glimpse(df)
Rows: 4
Columns: 3
$ x <int> 1, 2, 3, 4
$ y <dbl> 6.0, -9.5, 166.0, 8.8
$ z <chr> "A", "X", "X", "B"

RStudio offers a nice integrated data frame viewer in the form of the View() function, which visualizes any data frame or tibble in a spreadsheet-like table. For example, the previously created data frame df can be viewed with View(df) (note that the spreadsheet is read-only).

Functions

A function performs some pre-defined computations with (optional) input arguments and (optionally) returns a result. We routinely call functions that have been defined elsewhere, for example the c(), class(), and length() functions. A pair of parentheses () after a function name indicates that we are calling that function. We can also define our own functions, but this is outside the scope of this workshop.

Here are some examples for function calls:

c(1, 2, 3)  # 3 arguments
[1] 1 2 3
class("A")  # 1 argument
[1] "character"
length(c(4, 5, 6))  # two (nested) function calls
[1] 3

The last example shows two nested function calls. First, we call the c() function with three arguments, which we directly use as an argument in the length() function call. R tries to reduce all expressions to a single value, so it works its way from the innermost layer to the outermost one. Therefore, a nested function call is really two function calls in the following order:

(tmp = c(4, 5, 6))
[1] 4 5 6
length(tmp)
[1] 3

Missing values

R represents missing values as NA (“not available”). Missing values are contagious, which means that calculations involving missing values will result in NA. This makes sense if you think of a missing value as “unknown” (we don’t know what the value is). Here are some examples:

NA + 1
[1] NA
NA > 0
[1] NA
1 == NA
[1] NA
NA / 2
[1] NA

Even comparing NA with NA is again NA:

NA == NA
[1] NA

Let’s compute the mean of some numbers involving one missing value:

(x = c(25, 50, NA, 100))
[1]  25  50  NA 100
mean(x)
[1] NA

The mean is also unknown, because we cannot compute it due to the presence of a missing value. However, almost all aggregation functions support the na.rm argument, which by default (FALSE) does not remove missing values. If we set na.rm=TRUE, all missing values are removed before the actual value is computed:

mean(x, na.rm=TRUE)
[1] 58.33333
Important

Trying to find missing values by comparing with NA does not work as expected:

x == NA
[1] NA NA NA NA

Instead, we have to use the is.na() function:

is.na(x)
[1] FALSE FALSE  TRUE FALSE

Help

You can view the documentation for any object by prefixing a ? to the object name and hitting . For example, ?length shows the documentation for the length function. You can also press F1 to display help for the object at the current cursor location.

Exercices

  1. Install the tidyverse, nycflights13, and palmerpenguins packages. After that, check if you have the packages readr, dplyr, ggplot2, and tidyr installed.
  2. What is your current working directory? Create a new directory called tidyverse-workshop in your home directory (use Windows Explorer, macOS Finder or the “Files” pane in RStudio to navigate and create the directory). Then set the working directory to this folder. Check again if the current working directory now points to the correct location.
  3. Compute the areas of circles with radii 5, 7, 19, and \(\pi^{-\frac{1}{2}}\). Put all radii into a vector r and compute the corresponding areas with one command!
  4. Create a vector with 100 random numbers drawn from a standard normal distribution using the function rnorm(). Then extract all positive numbers from this random vector – how many elements are positive?
  5. How many rows and columns does the built-in mtcars data frame have? What are the column data types? What does the drat column represent?