Introduction

Tidyverse

The Tidyverse is a collection of R packages designed to simplify and enhance data analysis and visualization. At its core lies the principle of tidy data – a standardized format where each variable forms a column, each observation forms a row, and each type of observational unit forms a table. This structure makes data easier to manipulate, analyze, and visualize consistently across the Tidyverse ecosystem.

A typical data science workflow can be visualized as follows:

A typical data science workflow (adapted from here).

The Tidyverse offers packages that cover all aspects of this workflow. In this course, we will focus on data wrangling (the green box). In addition, we will also discuss importing and visualizing data. Modeling data and communicating results are outside the scope of this workshop.

Prerequisites

To get started with the Tidyverse, we need to install both R and RStudio. Although RStudio is not strictly necessary, I highly recommend it, because it makes working with R much more pleasant.

We will now go through some basic R commands and workflows. Ideally, you should already be familiar with most of these topics. If not, this hopefully serves as a quick tour and should get you up to speed.

Note

Throughout the course material, we show R code in gray boxes and corresponding output/results as follows:

mean(c(1, 2, 3))

[1] 2

You can copy code from a gray box and paste it into the R console.

This workshop is based on selected chapters from the book “R for Data Science” by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.

R for Data Science (freely available here).

R basics

Packages

A package contains additional functions and/or datasets that extend the capabilities of R. We install a package with the install.packages() function. In this workshop, we are going to use the tidyverse, nycflights13, and palmerpenguins packages. To install them, run the following commands in the R console (note that package names must be surrounded by single or double quotes):

install.packages("tidyverse")
install.packages("nycflights13")
install.packages("palmerpenguins")

Alternatively, you can use the Packages pane in RStudio (bottom right pane in the default layout) to install/update/uninstall R packages.

Once installed, we need to activate a package with the library() function in each R session. If we don’t activate a package, we do not have (direct) access to the objects it contains. Here’s how we activate the packages that we’ve just installed for the current session:

library(tidyverse)
library(nycflights13)
library(palmerpenguins)

Tip

We can use functions without activating their corresponding package by prefixing the function name with the package name followed by two colons. For example, nycflights13::flights accesses the flights data frame contained in the nycflights13 package without having to activate the package first. In contrast, library(nycflights13) enables us to access this data frame with flights directly (together will all other objects contained in this package).

Working directory

When R runs commands, it performs all computations in the so-called working directory. R expects all data files that you want to import in (or relative to) this directory (if not otherwise specified). The working directory can be any directory on your computer, and there are several options to change it.

The function getwd() returns the current working directory. The subtitle of the Console pane in RStudio (bottom left in the default layout) also shows the working directory.

The function setwd() sets the working directory to the folder passed as an argument. For example, we can write setwd("C:/Users/myuser/R"), or use setwd("~/R"), where the tilde symbol is an abbreviation for the current user’s home directory. In RStudio, we can also change the working directory through the Session – Set Working Directory menu, or by clicking the More button in the Files pane, which is usually in the bottom right. Another handy option is to simply double-click an R script in Windows Explorer or macOS Finder – RStudio will open it and automatically set the working directory to the folder where the script is located.

Note

Paths should be separated with a forward slash / even on Windows (which normally uses a backslash \). If you really want to use backslashes, you need to enter double backslashes, for example:

setwd("C:\\Users\\myuser\\R")

Important

Never ever set the working directory in a script! Instead, always refer to files with relative paths (relative to the location of the script). This makes the script portable across different computers, because when you use setwd(), it is very unlikely that another person has the exact same directory that you are trying to set.

R code

Typically, we enter R commands in the console. A prompt symbol > indicates that R is ready to receive input (note that we do not include the prompt symbol in the course material). Once we hit ⏎, R will immediately evaluate what we just typed and print the result. This workflow is called REPL (read-eval-print loop), and it is a convenient way to interactively work with R and try out new things. Here’s an example of some commands with their outputs (try running these commands in your console):

1 + 9  # this is a comment

[1] 10

x = 1:10  # the value of x is not printed in an assignment
sum(x)  # function call

[1] 55

Notice that when creating a new object with the assignment operator = (or equivalently <-), R does not automatically print its value. However, it is often useful to assign a new object and then immediately inspect its value. We could do this with two lines of code:

x = 1:10
x

 [1]  1  2  3  4  5  6  7  8  9 10

A more convenient way is to enclose the assignment in parentheses, which will both create the object and print its value:

(x = 1:10)

 [1]  1  2  3  4  5  6  7  8  9 10

Whenever R prints a vector, it automatically adds the index of the first element of each line in square brackets to the output. We only saw [1] in the previous outputs, because all values fitted into a single line. If you print a longer vector, each line will show the index of its first element:

(y = 1:100)

  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

The console is nice for trying out commands, fixing errors, and playing around with code. However, if we want to store a sequence of R commands for later use, we can put them into so-called R scripts. An R script is a plain-text file (ending in .R) containing R commands, usually one command per line. RStudio includes an editor (top left pane in the default layout), which can be used to write, edit, and run (parts of) a script.

Importantly, a data analysis project stored as an R script can be run over and over again. This means if another person wants to reproduce your analysis, all you need to do is share your script and data files. The person then runs your entire script, for example by clicking on the “Source” icon, which fully reproduces the entire analysis and all results.

RStudio keyboard shortcuts

RStudio includes many useful keyboard shortcuts. It really pays off to remember some of them, because your workflow will become faster and more efficient. An overview of all keyboard shortcuts is available in the “Help” – “Keyboard Shortcuts Help” menu item.

Here are four important shortcuts that I think everyone should know:

The ↑ and ↓ arrow keys access the command history in the console. You can also edit any command before running it again.
If you are searching for a previously entered command starting with specific characters, enter the characters in the console and press ⌘↑ on macOS or Ctrl↑ on Windows and Linux.
Hitting ⌘⏎ (macOS) or Ctrl⏎ (Windows and Linux) in the editor runs the command under the cursor (or the selected commands) in the console.
The shortcut for the pipe operator |> (more on that later) is ⌘⇧M (macOS) or Ctrl⇧M (Windows and Linux).

Vectors

The most basic (atomic) data type in R is a vector. A vector is a collection of objects which all have the same type. Even a scalar number like 1 is represented by a vector in R. The c() function creates vectors consisting of two or more items (c is short for “concatenate”). The length() function returns the number of items in a vector.

Important data types include numeric vectors, character vectors, logical vectors, factors, and date vectors. We can use the class() function to determine the type of a given object. Here are some examples:

x = 1
class(x)

[1] "numeric"

length(x)

[1] 1

y = c(4, 5.6, -7)
class(y)

[1] "numeric"

length(y)

[1] 3

c("Hello", "world!")  # character

[1] "Hello"  "world!"

c(TRUE, FALSE)  # logical

[1]  TRUE FALSE

y > 4  # a comparison evaluates to a logical vector

[1] FALSE  TRUE FALSE

factor(c("A", "A", "B", "A", "C", "C", "A", "B"))  # factor with three levels

[1] A A B A C C A B
Levels: A B C

as.Date(c("17.3.2020", "22.5.2020", "3.3.2021"), format="%d.%m.%Y")  # date

[1] "2020-03-17" "2020-05-22" "2021-03-03"

We can access individual items of a vector using square brackets containing the indexes of all items we would like to access:

(x = 11:20)

 [1] 11 12 13 14 15 16 17 18 19 20

x[5]  # fifth item

[1] 15

x[c(7, 1, 4)]  # items with index 7, 1, and 4

[1] 17 11 14

x[x >= 15]  # all items >= 5

[1] 15 16 17 18 19 20

Data frames

A data frame is a data structure for storing tabular data in R. It consists of one or more columns, where each column is a vector, and all columns must have the same number of rows.

(df = data.frame(x=1:4, y=c(6, -9.5, 166, 8.8), z=c("A", "X", "X", "B")))

  x     y z
1 1   6.0 A
2 2  -9.5 X
3 3 166.0 X
4 4   8.8 B

The tibble package provides an improved data frame type called tibble. A tibble is a drop-in replacement for a data frame, so we can use tibbles (almost) everywhere data frames are expected.

tibble::tibble(x=1:5, y=c(6, -9.5, 166, 8.8, 0.112), z=c("A", "X", "X", "B", "A"))

# A tibble: 5 × 3
      x       y z    
  <int>   <dbl> <chr>
1     1   6     A    
2     2  -9.5   X    
3     3 166     X    
4     4   8.8   B    
5     5   0.112 A

Note how data frames and tibbles print differently in the previous examples. Tibbles are more readable and include their dimension (# A tibble: 5 x 3) as well as column data types (<int>, <dbl>, and <chr>, which is short for integer, double, and character). The str() function shows a convenient summary of the structure of a given data frame, which also contains the column data types:

str(df)

'data.frame':   4 obs. of  3 variables:
 $ x: int  1 2 3 4
 $ y: num  6 -9.5 166 8.8
 $ z: chr  "A" "X" "X" "B"

The glimpse() function (part of the dplyr package) provides a similar summary and works better with tibbles:

dplyr::glimpse(df)

Rows: 4
Columns: 3
$ x <int> 1, 2, 3, 4
$ y <dbl> 6.0, -9.5, 166.0, 8.8
$ z <chr> "A", "X", "X", "B"

RStudio offers a nice integrated data frame viewer in the form of the View() function, which visualizes any data frame or tibble in a spreadsheet-like table. For example, the previously created data frame df can be viewed with View(df) (note that the spreadsheet is read-only).

Functions

A function performs some pre-defined computations with (optional) input arguments and (optionally) returns a result. We routinely call functions that have been defined elsewhere, for example the c(), class(), and length() functions. A pair of parentheses () after a function name indicates that we are calling that function. We can also define our own functions, but this is outside the scope of this workshop.

Here are some examples for function calls:

c(1, 2, 3)  # 3 arguments

[1] 1 2 3

class("A")  # 1 argument

[1] "character"

length(c(4, 5, 6))  # two (nested) function calls

[1] 3

The last example shows two nested function calls. First, we call the c() function with three arguments, which we directly use as an argument in the length() function call. R tries to reduce all expressions to a single value, so it works its way from the innermost layer to the outermost one. Therefore, a nested function call is really two function calls in the following order:

(tmp = c(4, 5, 6))

[1] 4 5 6

length(tmp)

[1] 3

Missing values

R represents missing values as NA (“not available”). Missing values are contagious, which means that calculations involving missing values will result in NA. This makes sense if you think of a missing value as “unknown” (we don’t know what the value is). Here are some examples:

NA + 1

[1] NA

NA > 0

[1] NA

1 == NA

[1] NA

NA / 2

[1] NA

Even comparing NA with NA is again NA:

NA == NA

[1] NA

Let’s compute the mean of some numbers involving one missing value:

(x = c(25, 50, NA, 100))

[1]  25  50  NA 100

mean(x)

[1] NA

The mean is also unknown, because we cannot compute it due to the presence of a missing value. However, almost all aggregation functions support the na.rm argument, which by default is set to FALSE, meaning that the function does not remove missing values. However, if we set na.rm=TRUE, all missing values are removed before the actual value is computed:

mean(x, na.rm=TRUE)

[1] 58.33333

Important

Finding missing values by comparing with NA does not work:

x == NA

[1] NA NA NA NA

Instead, we have to use the is.na() function:

is.na(x)

[1] FALSE FALSE  TRUE FALSE

Help

You can view the documentation for any object by prefixing a ? to the object name and hitting ⏎. For example, ?length shows the documentation for the length function. You can also press F1 to display help for the object at the current cursor location.

Exercises

Install the tidyverse, nycflights13, and palmerpenguins packages. After that, check if you have the packages readr, dplyr, ggplot2, and tidyr installed.
What is your current working directory? Create a new directory called tidyverse-workshop in your home directory (use Windows Explorer, macOS Finder or the Files pane in RStudio to navigate and create the directory). Then set the working directory to the newly created folder. Check again if the current working directory now points to the correct location.
Compute the areas of circles with radii 5, 7, 19, and \(\pi^{-\frac{1}{2}}\). Put all radii into a vector r and compute the corresponding areas with one command!
Create a vector with 100 random numbers drawn from a standard normal distribution using the rnorm() function. Then extract all positive numbers from this random vector – how many items are positive?
How many rows and columns does the built-in mtcars data frame have? What are the column data types? What does the drat column represent?