mean(c(1, 2, 3))
[1] 2
The Tidyverse is a collection of R packages designed to simplify data analysis and visualization. Their fundamental assumption is that data is represented in the so-called “tidy” format. This basically means that it can be represented by a table, in which a row corresponds to an observation and a column corresponds to a variable (more details are available in this article).
A typical data science workflow can be visualized as follows:
The Tidyverse offers packages that cover all aspects of this workflow. In this course, we will focus on data wrangling (the green box). In addition, we will also discuss importing and visualizing data. Modeling data and communicating results are outside the scope of this workshop.
To get started with the Tidyverse, we need to install both R and RStudio. Although RStudio is technically not required, I strongly recommend it, because it makes working with R much more pleasant.
Once you have these programs installed on your computer, start RStudio and install the tidyverse
package, which is basically a meta-package consisting of all packages that are part of the Tidyverse. We will also need the nycflights13
and palmerpenguins
packages, because they provide nice datasets that we are going to explore.
We will now go through some basic R commands and workflows. Ideally, you should already be familiar with most of these topics. If not, this hopefully serves as a quick tour and should get you up to speed.
Throughout the course material, we show R code in gray boxes and corresponding output/results as follows:
mean(c(1, 2, 3))
[1] 2
You can copy code from a gray box and paste it into the R console.
This workshop is based on selected chapters from the book “R for Data Science” by Hadley Wickham and Garrett Grolemund.
A package contains additional functions and/or datasets that extend the capabilities of R. We install a package with the install.packages()
function. In this workshop, we are going to use the tidyverse
, nycflights13
, and palmerpenguins
packages. To install them, run the following commands in the R console (note that package names must be surrounded by single or double quotes):
install.packages("tidyverse")
install.packages("nycflights13")
install.packages("palmerpenguins")
Alternatively, you can use the “Packages” pane in RStudio (bottom right pane in the default layout) to install/update/uninstall R packages.
Once installed, we need to activate a package with the library
function in each R session. If we don’t activate a package, we do not have (direct) access to the objects it contains. Here’s how we activate the packages that we’ve just installed:
library(tidyverse)
library(nycflights13)
library(palmerpenguins)
We can use functions without activating their corresponding package by prefixing the function name with the package name followed by two colons. For example, nycflights13::flights
accesses the flights
data frame contained in the nycflights13
package without having to activate the package first. In contrast, library(nycflights13)
enables us to access this data frame with flights
directly (together will all other objects contained in this package).
When R runs commands, it performs all computations in the so-called working directory. R expects all data files that you want to import in (or relative to) this directory (if not otherwise specified). The working directory can be any directory on your computer, and there are several options to change it.
The function getwd()
returns the current working directory. The subtitle of the “Console” pane in RStudio (bottom left in the default layout) also shows the working directory.
The function setwd()
sets the working directory to the folder passed as an argument, for example setwd("C:/Users/myuser/R")
or equivalently setwd("~/R")
(the tilde symbol is an abbreviation for the current user’s home directory). Alternative methods to change the working directory with RStudio include the “Session” – “Set Working Directory” menu and the “More” icon in the “Files” pane (bottom right). Also, if you double-click an R script in Windows Explorer or macOS Finder, RStudio will open it and automatically set the working directory to the corresponding file location.
Paths should be separated with a forward slash /
even on Windows (which normally uses a backslash \
). If you really want to use backslashes, you need to enter double backslashes, for example:
setwd("C:\\Users\\myuser\\R")
Never ever set the working directory in a script! Instead, always refer to files with relative paths (relative to the location of the script). This makes the script portable across different computers, because when you use setwd()
, it is very unlikely that another person has the exact same directory that you are trying to set.
Typically, we enter R commands in the console. A prompt symbol >
indicates that R is ready for our input (note that we do not show the prompt symbol in the gray code boxes). Once we hit ⏎, R will immediately evaluate what we just typed and print the result. This workflow is called REPL (read-eval-print loop), and it is a convenient way to interactively work with R and try out new things. Here’s an example of some commands with their outputs (try running these commands in your console):
1 + 9 # this is a comment
[1] 10
= 1:10 # the value of x is not printed in an assignment
x sum(x) # function call
[1] 55
Notice that when creating a new object with the assignment operator =
(or equivalently <-
), R does not automatically print its value. However, it is often useful to assign a new object and then immediately inspect its value. We could do this with two lines of code:
= 1:10
x x
[1] 1 2 3 4 5 6 7 8 9 10
A more convenient way is to enclose the assignment in parentheses, which will both create the object and print its value:
x = 1:10) (
[1] 1 2 3 4 5 6 7 8 9 10
Whenever R prints a vector, it automatically adds the index of the first element of each line in square brackets to the output. We only saw [1]
in the previous outputs, because the values fitted into one line. If you print a longer vector, each line will show the index of its first element:
y = 1:100) (
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
The console is nice for trying out commands, fixing errors, and playing around with code. However, if we want to store a sequence of R commands for later use, we can put them into so-called R scripts. An R script is a plain-text file (ending in .R
) containing R commands, usually one command per line. RStudio includes an editor (top left pane in the default layout), which can be used to write, edit, and run (parts of) a script.
Importantly, a data analysis project stored as an R script can be run over and over again. This means if another person wants to reproduce your analysis, all you need to do is share your script and data files. The person then runs your entire script, for example by clicking on the “Source” icon, which fully reproduces the entire analysis and all results.
RStudio includes many useful keyboard shortcuts. It really pays off to remember some of them, because your workflow will become faster and more efficient. An overview of all keyboard shortcuts is available in the “Help” – “Keyboard Shortcuts Help” menu item.
Here are four important shortcuts that I think everyone should know:
|>
(more on that later) is ⌘⇧m (macOS) or Ctrl⇧m (Windows and Linux).The most basic (atomic) data type in R is a vector. A vector is a collection of objects which all have the same type. Even a scalar number like 1
is really a vector in R. The c()
function creates vectors consisting of two or more elements (c
is short for “concatenate”). The length()
function returns the number of elements in a vector.
Important data types include numeric vectors, character vectors, logical vectors, factors, and datetime vectors. We can use the class()
function to determine the type of a given object. Here are some examples:
= 1
x class(x)
[1] "numeric"
length(x)
[1] 1
= c(4, 5.6, -7)
y class(y)
[1] "numeric"
length(y)
[1] 3
c("Hello", "world!") # character
[1] "Hello" "world!"
c(TRUE, FALSE) # logical
[1] TRUE FALSE
> 4 # a comparison evaluates to a logical vector y
[1] FALSE TRUE FALSE
factor(c("A", "A", "B", "A", "C", "C", "A", "B")) # factor with three levels
[1] A A B A C C A B
Levels: A B C
as.Date(c("17.3.2020", "22.5.2020", "3.3.2021"), format="%d.%m.%Y") # datetime
[1] "2020-03-17" "2020-05-22" "2021-03-03"
We can access individual elements of a vector using square brackets containing the indexes of all elements we would like to access:
x = 11:20) (
[1] 11 12 13 14 15 16 17 18 19 20
5] # fifth element x[
[1] 15
c(7, 1, 4)] # elements with index 7, 1, and 4 x[
[1] 17 11 14
>= 15] # all elements >= 5 x[x
[1] 15 16 17 18 19 20
A data frame is a list of vectors (with identical lengths). These vectors correspond to rows of the data frame. In other words, it represents a table consisting of rows and columns for storing rectangular data.
df = data.frame(x=1:4, y=c(6, -9.5, 166, 8.8), z=c("A", "X", "X", "B"))) (
x y z
1 1 6.0 A
2 2 -9.5 X
3 3 166.0 X
4 4 8.8 B
The Tidyverse package tibble
provides an improved data frame type called tibble. A tibble is a drop-in replacement for a data frame, so we can use tibbles (almost) everywhere data frames are expected.
::tibble(x=1:5, y=c(6, -9.5, 166, 8.8, 0.112), z=c("A", "X", "X", "B", "A")) tibble
# A tibble: 5 × 3
x y z
<int> <dbl> <chr>
1 1 6 A
2 2 -9.5 X
3 3 166 X
4 4 8.8 B
5 5 0.112 A
Note how data frames and tibbles print differently in the previous examples. Tibbles are more readable and include their dimension (# A tibble: 5 x 3
) as well as column data types (<int>
, <dbl>
, and <chr>
, which is short for integer, double, and character). The str()
function shows a convenient summary of the structure of a given data frame, which also contains the column data types:
str(df)
'data.frame': 4 obs. of 3 variables:
$ x: int 1 2 3 4
$ y: num 6 -9.5 166 8.8
$ z: chr "A" "X" "X" "B"
The glimpse()
function (part of the dplyr
package) provides a similar summary and works better with tibbles:
::glimpse(df) dplyr
Rows: 4
Columns: 3
$ x <int> 1, 2, 3, 4
$ y <dbl> 6.0, -9.5, 166.0, 8.8
$ z <chr> "A", "X", "X", "B"
RStudio offers a nice integrated data frame viewer in the form of the View()
function, which visualizes any data frame or tibble in a spreadsheet-like table. For example, the previously created data frame df
can be viewed with View(df)
(note that the spreadsheet is read-only).
A function performs some pre-defined computations with (optional) input arguments and (optionally) returns a result. We routinely call functions that have been defined elsewhere, for example the c()
, class()
, and length()
functions. A pair of parentheses ()
after a function name indicates that we are calling that function. We can also define our own functions, but this is outside the scope of this workshop.
Here are some examples for function calls:
c(1, 2, 3) # 3 arguments
[1] 1 2 3
class("A") # 1 argument
[1] "character"
length(c(4, 5, 6)) # two (nested) function calls
[1] 3
The last example shows two nested function calls. First, we call the c()
function with three arguments, which we directly use as an argument in the length()
function call. R tries to reduce all expressions to a single value, so it works its way from the innermost layer to the outermost one. Therefore, a nested function call is really two function calls in the following order:
tmp = c(4, 5, 6)) (
[1] 4 5 6
length(tmp)
[1] 3
R represents missing values as NA
(“not available”). Missing values are contagious, which means that calculations involving missing values will result in NA
. This makes sense if you think of a missing value as “unknown” (we don’t know what the value is). Here are some examples:
NA + 1
[1] NA
NA > 0
[1] NA
1 == NA
[1] NA
NA / 2
[1] NA
Even comparing NA
with NA
is again NA
:
NA == NA
[1] NA
Let’s compute the mean of some numbers involving one missing value:
x = c(25, 50, NA, 100)) (
[1] 25 50 NA 100
mean(x)
[1] NA
The mean is also unknown, because we cannot compute it due to the presence of a missing value. However, almost all aggregation functions support the na.rm
argument, which by default (FALSE
) does not remove missing values. If we set na.rm=TRUE
, all missing values are removed before the actual value is computed:
mean(x, na.rm=TRUE)
[1] 58.33333
Trying to find missing values by comparing with NA
does not work as expected:
== NA x
[1] NA NA NA NA
Instead, we have to use the is.na()
function:
is.na(x)
[1] FALSE FALSE TRUE FALSE
You can view the documentation for any object by prefixing a ?
to the object name and hitting ⏎. For example, ?length
shows the documentation for the length
function. You can also press F1 to display help for the object at the current cursor location.
tidyverse
, nycflights13
, and palmerpenguins
packages. After that, check if you have the packages readr
, dplyr
, ggplot2
, and tidyr
installed.tidyverse-workshop
in your home directory (use Windows Explorer, macOS Finder or the “Files” pane in RStudio to navigate and create the directory). Then set the working directory to this folder. Check again if the current working directory now points to the correct location.r
and compute the corresponding areas with one command!rnorm()
. Then extract all positive numbers from this random vector – how many elements are positive?mtcars
data frame have? What are the column data types? What does the drat
column represent?