Visualization

Introduction

Data needs to be imported, tidied, and transformed before any visualization can be created. We will cover these steps, which are necessary for almost all real-world datasets, in the following chapters. However, because we can use tidy example datasets that are already available in R, we will focus on creating visualizations first.

Specifically, we will use the ggplot2 package, which is part of the Tidyverse. It is based on a layered “grammar of graphics” described in here. In this chapter, we will discuss only the most basic commands that enable you to quickly create simple (but beautiful) visualizations.

Of course, the first step is to activate the package:

library(ggplot2)

Note that you could also use library(tidyverse) to activate all core Tidyverse packages, but I recommend to activate only the specific packages you actually need.

Basic usage

The ggplot2 package contains a toy data frame called mpg. It is a good idea to read its documentation (?mpg) before moving on to the next step. Let’s take a look at the data:

mpg

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Here’s the first question we want to answer from this dataset: do cars with big engines consume more fuel than cars with small engines? We can use the variables displ (engine size in liters) and hwy (fuel efficiency on the highway in miles per gallon) to try to address this question.

Let’s create our first plot using the following code (just copy and run this code for now):

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy))

This scatter plot shows a clear negative relationship, meaning that larger engines tend to have lower fuel efficiency.

Every ggplot2 plot starts with the ggplot() function, to which we pass a data frame that contains the data to be plotted (using the data argument). This function call on its own just produces an empty plot, because we haven’t specified any details about our plot yet.

To add something interesting to the plot, we literally add another layer. In this example, the new layer should contain points, which we can create with the geom_point() function. The ggplot2 package contains many geom functions (functions that create geometric objects representing the data) that enable us to create many different kinds of plots. They all have a mapping argument, which specifies how columns in the data frame should be represented in the plot. In our example, we map the displ column to the x axis and the hwy column to the y axis, resulting in a scatter plot.

Note

In general, we need to wrap mappings that depend on values in the data inside an aes() function call.

We can use the code from this example as a template for ggplot2 plots. It looks as follows:

ggplot(data=<DATA>) +
    <GEOM_FUNCTION>(mapping=aes(<MAPPINGS>))

Note that this template is not valid R code! We need to replace all code in angle brackets with our actual data, geom function, and mappings.

Exercises

What does ggplot(data=mpg) produce?
How many rows and columns does mpg have? What are the column types?
What does the drv column describe?
Efficient cars should consume less fuel on both highways and in cities than inefficient cars. Create a scatter plot to investigate this hypothesis!
How does highway fuel efficiency correlate with the number of cylinders?
Why is the scatter plot of class versus drv not very useful?
Do you see any problem with the scatter plot displ versus hwy that we produced previously?

Aesthetic mappings

Aesthetics are visual properties of a plot. We have already used the x and y aesthetics in our previous mpg example where we mapped these aesthetics to variables (columns) displ and hwy in geom_point(mapping=aes(x=displ, y=hwy)). We have also noted that we need to wrap aesthetic mappings with the aes() function.

There are other aesthetics that we could use. For example, we could map to the color aesthetic to visualize a third variable in the data such as class, which contains the type (or class) of each car:

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy, color=class))

The scatter plot now conveys additional information on the types of vehicles. Notice that ggplot2 automatically adds a legend. Instead of color, we could also use the shape, size, or alpha aesthetics (try it out and see what the result looks like). The documentation of the geom_point() function lists all supported aesthetics. It is also possible to use multiple aesthetics for the same variable.

If you ever want to manually set an aesthetic to a fixed value, such as plotting all points in blue, you need to specify this outside of aes():

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy), color="blue")

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-1.

Facets

Facets can be used to split a plot into multiple subplots, each of which shows a subset of the original data. There are two functions that create facets based on variables (columns) in the data:

facet_wrap() facets by one variable; use ~ <VARIABLE> as its first argument.
facet_grid() facets by two variables; use <VARIABLE_ROWS> ~ <VARIABLE_COLS> as its first argument.

It’s very easy to add facetting to an existing plot by adding one of these functions as if it was a new layer. Here are two examples:

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy)) +
    facet_wrap(~ class, nrow=2)

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy)) +
    facet_grid(drv ~ cyl)

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-2.

Composing plots

Facets create subplots based on subsets of the data. If instead you want to combine multiple independent plots, I recommend the patchwork package. Using operators such as +, |, and /, it is intuitive to create custom arrangements of existing ggplot objects (which we assign to variables for easier access). Here’s an example:

library(patchwork)

p1 = ggplot(data=mpg) +
    geom_point(mapping=aes(x=hwy, y=cty))
p2 = ggplot(data=mtcars) +
    geom_boxplot(mapping=aes(x=cyl, y=qsec, group=cyl))
p3 = ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy))

(p1 | p2) / p3

Notice how | arranges plots side by side, whereas / places plots in separate rows.

Geometric objects

Consider the following two plots:

p1 = ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy))
p2 = ggplot(data=mpg) +
    geom_smooth(mapping=aes(x=displ, y=hwy))
p1 | p2

These plots show the same data (displ versus hwy), but they use different visual representations. Whereas the left plot visualizes each data point, the right plot represents the same data with a smoothing function. In ggplot2, we call these different visual representations geoms (geometric objects). This Data Visualization cheat sheet lists some of the over 40 available geoms in ggplot2.

Every geom function takes a mapping argument where you specify which data columns should be mapped to which aesthetics. Note that any given geom supports only aesthetics that make sense (check the documentation to find out). For example, you can set the shape of a point, but the shape of a line does not make sense. However, you could use the linetype aesthetic instead.

p1 = ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy, shape=drv))
p2 = ggplot(data=mpg) +
    geom_smooth(mapping=aes(x=displ, y=hwy, linetype=drv))
p1 | p2

Importantly, we can use multiple geoms in a plot by adding each one as a new layer:

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy)) +
    geom_smooth(mapping=aes(x=displ, y=hwy))

Because both geoms use the same mapping argument, we can also specify a global mapping inside the ggplot() function call to avoid repetition. All geoms will inherit this global mapping, and the resulting figure is identical to the previous one:

ggplot(data=mpg, mapping=aes(x=displ, y=hwy)) +
    geom_point() +
    geom_smooth()

Global and local mappings can be combined; if you don’t specify a mapping for a particular geom function, the function will use the global mapping. However, if you do specify a local mapping by passing a mapping argument in a specific geom function, it will extend (not replace) the global one:

ggplot(data=mpg, mapping=aes(x=displ, y=hwy)) +
    geom_point(mapping=aes(color=class)) +
    geom_smooth()

We can also apply this concept of global and local settings to the data argument. This is useful if a particular layer is based on a different data (sub)set.

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-3.

Statistical transformations

Consider the following bar chart created with the geom_bar() function:

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class))

Notice how we only specified a mapping for the x aesthetic (the class column). Interestingly, the y-axis shows counts, a variable that is not contained in the data.

Although many plots show the raw data (as we saw with scatter plots), other kinds of plots transform the original data to show counts of the binned data (like in the bar chart), summary statistics, or predictions from a model fitted to the data. This works automatically, because each geom function is associated with a default statistical transformation. You can check the documentation of a geom to see its default transformation listed as the default argument for the stat parameter. For example, geom_point() uses stat="identity" (no transformation), whereas geom_bar() uses stat="count".

Sometimes, it is necessary to change the stat of a geom. For example, let’s assume that we had the count data directly available in a data frame as follows:

library(tibble)

mpg_count = tribble(
    ~class,   ~count,
    "2seater",     5,
    "compact",    47,
    "midsize",    41,
    "minivan",    11,
    "pickup",     33,
    "subcompact", 35,
    "suv",        62
)

If we want to visualize this as a bar chart, we need to pass stat="identity" to avoid the default "count" transformation:

ggplot(data=mpg_count) +
    geom_bar(mapping=aes(x=class, y=count), stat="identity")

Tip

Alternatively, we could use geom_col() directly if the data already represents counts:

ggplot(data=mpg_count) +
    geom_col(mapping=aes(x=class, y=count))

Finally, since each geom function has a default stat function, and each stat function has a default geom function, we can use these functions interchangeably. For example, geom_bar() uses stat_count() and stat_count() uses geom_bar() by default. Therefore, we could also produce the bar chart we have encountered previously with the following code:

ggplot(data=mpg) +
    stat_count(mapping=aes(x=class))

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-4.

Position adjustments

Geoms have a position argument, which influences how charts are visualized when certain grouping aesthetics are used. Consider the following bar chart:

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class, fill=drv))

This is a stacked bar chart, because it uses position="stack" by default. If you don’t want a stacked bar chart, you can also use "identity", "dodge", or "fill":

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class, color=drv), position="identity", fill=NA)

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class, fill=drv), position="dodge")

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class, fill=drv), position="fill")

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-5.

Coordinate systems

Most ggplot2 plots use the Cartesian coordinate system spanned by two orthogonal axes x and y. However, it is possible to change the coordinate system with so-called coord functions. We won’t go into any detail here, mostly because working with coordinate systems is quite an advanced topic. However, there is one handy function that you should remember: coord_flip() flips the x and y axes. You can simply add it like an additional layer to an existing plot:

ggplot(data=mpg, mapping=aes(x=class, y=hwy)) +
    geom_boxplot() +
    coord_flip()

Themes

If you prefer a different look than gray background with white grid lines for your plots, ggplot2 comes with several predefined themes (consult ?theme_gray for a list of all included themes). You can set a theme by adding it to an existing plot, for example:

ggplot(data=mpg, mapping=aes(x=displ, y=hwy)) +
    geom_point() +
    theme_minimal()

It is also possible to set a theme globally using the theme_set() function. For example, to use theme_dark() for all subsequent plots, include the following expression before you create any figures:

theme_set(theme_dark())

Conclusion

With this set of ggplot2 building blocks, you can create a basic version of almost any kind of data visualization. In addition to the Export button in the RStudio Plots pane, you can use ggsave() to save a plot to a file.