8 – Plots

Introduction

The R ecosystem offers several excellent packages for creating data visualizations. However, plots created with different packages cannot be combined. Therefore, we have to decide on a plotting package before creating a visualization.

The graphics package is also known as the base plotting system. It is part of R and does not need to be installed separately. The base plotting system is very well suited for quick visualizations – and with some additional effort, you can also create beautiful graphics for publications.

The following examples use the airquality dataset included in R (a brief description is available in ?airquality):

str(airquality)

'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

The `plot()` function

One of the most important functions in the base plotting system is plot(). This function creates appropriate visualizations depending on the data to be displayed.

If you pass a numerical vector, the function produces a dot plot. The x-axis shows the index of the data points, and the y-axis shows the data values:

plot(airquality$Ozone)

If you pass two numerical vectors, you get a so-called scatter plot. The first argument is displayed on the x-axis, and the second argument is displayed on the y-axis:

plot(airquality$Wind, airquality$Ozone)

The function can also handle date values. First, let’s create a new column named date from the columns Month and Day (from the data description, we know that these were recorded in 1973):

airquality$date = as.Date(
    paste(airquality$Month, airquality$Day, "1973"),
    format="%m %d %Y"
)

Tip

The paste() function concatenates multiple character vectors element-wise into a single vector (by default, it inserts a space between the elements):

paste("Hello", "World")

[1] "Hello World"

The individual arguments can be vectors with any number of elements:

paste(c("Hello", "Sup"), c("Jane", "John"))

[1] "Hello Jane" "Sup John"

The result looks like this:

paste(airquality$Month, airquality$Day, "1973")

  [1] "5 1 1973"  "5 2 1973"  "5 3 1973"  "5 4 1973"  "5 5 1973"  "5 6 1973"  "5 7 1973"  "5 8 1973"  "5 9 1973" 
 [10] "5 10 1973" "5 11 1973" "5 12 1973" "5 13 1973" "5 14 1973" "5 15 1973" "5 16 1973" "5 17 1973" "5 18 1973"
 [19] "5 19 1973" "5 20 1973" "5 21 1973" "5 22 1973" "5 23 1973" "5 24 1973" "5 25 1973" "5 26 1973" "5 27 1973"
 [28] "5 28 1973" "5 29 1973" "5 30 1973" "5 31 1973" "6 1 1973"  "6 2 1973"  "6 3 1973"  "6 4 1973"  "6 5 1973" 
 [37] "6 6 1973"  "6 7 1973"  "6 8 1973"  "6 9 1973"  "6 10 1973" "6 11 1973" "6 12 1973" "6 13 1973" "6 14 1973"
 [46] "6 15 1973" "6 16 1973" "6 17 1973" "6 18 1973" "6 19 1973" "6 20 1973" "6 21 1973" "6 22 1973" "6 23 1973"
 [55] "6 24 1973" "6 25 1973" "6 26 1973" "6 27 1973" "6 28 1973" "6 29 1973" "6 30 1973" "7 1 1973"  "7 2 1973" 
 [64] "7 3 1973"  "7 4 1973"  "7 5 1973"  "7 6 1973"  "7 7 1973"  "7 8 1973"  "7 9 1973"  "7 10 1973" "7 11 1973"
 [73] "7 12 1973" "7 13 1973" "7 14 1973" "7 15 1973" "7 16 1973" "7 17 1973" "7 18 1973" "7 19 1973" "7 20 1973"
 [82] "7 21 1973" "7 22 1973" "7 23 1973" "7 24 1973" "7 25 1973" "7 26 1973" "7 27 1973" "7 28 1973" "7 29 1973"
 [91] "7 30 1973" "7 31 1973" "8 1 1973"  "8 2 1973"  "8 3 1973"  "8 4 1973"  "8 5 1973"  "8 6 1973"  "8 7 1973" 
[100] "8 8 1973"  "8 9 1973"  "8 10 1973" "8 11 1973" "8 12 1973" "8 13 1973" "8 14 1973" "8 15 1973" "8 16 1973"
[109] "8 17 1973" "8 18 1973" "8 19 1973" "8 20 1973" "8 21 1973" "8 22 1973" "8 23 1973" "8 24 1973" "8 25 1973"
[118] "8 26 1973" "8 27 1973" "8 28 1973" "8 29 1973" "8 30 1973" "8 31 1973" "9 1 1973"  "9 2 1973"  "9 3 1973" 
[127] "9 4 1973"  "9 5 1973"  "9 6 1973"  "9 7 1973"  "9 8 1973"  "9 9 1973"  "9 10 1973" "9 11 1973" "9 12 1973"
[136] "9 13 1973" "9 14 1973" "9 15 1973" "9 16 1973" "9 17 1973" "9 18 1973" "9 19 1973" "9 20 1973" "9 21 1973"
[145] "9 22 1973" "9 23 1973" "9 24 1973" "9 25 1973" "9 26 1973" "9 27 1973" "9 28 1973" "9 29 1973" "9 30 1973"

This vector can then be converted into a date vector using the as.Date() function with the argument format="%m %d %Y", which we assign to the date column in the data frame:

class(airquality$date)

[1] "Date"

Now we can use this date column for the x-axis. In this example, the names of the five months are automatically displayed on the x-axis:

plot(airquality$date, airquality$Ozone)

If you pass a factor to the function, you get a bar chart with the frequencies of the individual levels:

plot(factor(airquality$Month))

Histograms

The hist() function creates a histogram of a vector:

hist(airquality$Ozone)

A histogram visualizes the distribution of the values in a vector. You can explicitly set the number of bins in the histogram with the breaks argument:

hist(airquality$Ozone, breaks=30)

Note

The breaks argument is only a recommendation – the actual number of bins is adjusted to keep the plot readable.

Boxplots

Another way to graphically represent the distribution of values is the boxplot() function. A boxplot shows the median, the interquartile range (IQR), as well as the minimum and maximum (plus any outliers) of the data. Here is an example of a boxplot for the Temp column:

boxplot(airquality$Temp)

A single boxplot is relatively uninformative. If you pass a formula instead of a vector, you can create multiple boxplots in a single graph.

Note

A formula in R is defined by the tilde character (~) with expressions on its left and right sides. An example of a formula is y ~ x with y on the left side and x on the right side. The meaning of a formula depends on the specific function. Many functions require formulas as arguments, and we will use formulas extensively when calculating linear models.

The following example creates boxplots for airquality$Temp separately for all levels of airquality$Month:

boxplot(airquality$Temp ~ airquality$Month)

In this case, the left side of the formula determines the values on the y-axis, and the right side determines the values on the x-axis.

Adjusting plots

Often, we need to adjust various plot properties, such as the line type, colors, symbols, titles, axis labels, and so on. Many parameters can be passed as arguments when creating the graphic. An adapted version of the previous scatter plot example is:

plot(
    airquality$Wind,
    airquality$Ozone,
    xlab="Wind (mph)",  # x-axis label
    ylab="Ozone (ppb)",  # y-axis label
    main="New York City air quality (1973)",  # title
    pch=21,  # circle with border
    bg="lightblue"  # background color
)

We can improve the visualization of the ozone values over time by setting the following arguments (here, the type argument specifies the plot type; in this example, we set it to "l", which corresponds to a line plot):

plot(
    airquality$date,
    airquality$Ozone,
    xlab="",
    ylab="Ozone (ppb)",
    main="",  # no title
    type="l",  # line plot
    col="orange"
)

This line plot also clearly shows missing values in the data (where the line is interrupted).

The various plot functions have many common parameters that can be used to influence the appearance of the plots. In the previous examples, we adjusted the following parameters:

xlab: x-axis label
ylab: y-axis label
type: plot type (lines, dots, both, …)
pch: symbol (circle, triangle, square, …)
main: title
col: color

The par() function can be used to query or globally set all relevant parameters (the documentation ?par provides a description of all supported parameters). If you call this function without arguments, you get the current values of all graphical parameters. You can also globally change individual parameters, i.e., after a change, new plots will always be created with the new parameters.

The following example demonstrates the use of par(). First, we query the current value of the col parameter (i.e., the color). This works with the $ notation (which is similar to extracting columns from a data frame):

par()$col

[1] "black"

We can observe that the color is set to black. This is also confirmed by a small example plot, which consists of black elements by default:

plot(sin(seq(0, 2*pi, length.out=50)), type="o", xlab="", ylab="")

After globally setting the color to red, all subsequent plots will use this value as a new default:

par(col="red")
par()$col

[1] "red"

plot(sin(seq(0, 2*pi, length.out=50)), type="o", xlab="", ylab="")

Before creating additional plots, we will reset the color to black:

par(col="black")

Other commonly used parameters include lty (line type) and cex.axis (size of the axis labels):

plot(
    sin(seq(0, 2*pi, length.out=50)),
    type="l",
    xlab="",
    ylab="",
    cex.axis=0.6,  # axis label size
    lty=2,  # line type
    main="Sine"
)

The following line types are possible values for lty:

`lty`	Type
0	empty (no line)
1	solid (default)
2	dashed
3	dotted
4	dot–dash
5	long dash
6	short dash–long dash

The documentation of the points() function lists all available symbols for pch. The cex.axis parameter sets a scaling factor for the axis label; by default, this is 1 – values less than 1 reduce the axis label, values greater than 1 enlarge it.

Adding elements to plots

The base plotting system allows us to create a plot and then add additional graphical elements. To do this, we use specific functions that we will discuss below.

We can add a title with title():

with(airquality, plot(Wind, Ozone))
title(main="Ozone and Wind in NYC")

Tip

This example also uses the with() function. This function allows us to use column names from a data frame directly within the parentheses. For example, instead of airquality$Ozone, we can write Ozone. This means that the first line of the example is equivalent to plot(airquality$Wind, airquality$Ozone). However, note how the default axis labels depend on the x and y arguments!

We can add points with points(). This can be used, for example, to display groups of data in different colors. We start with an empty plot (type="n") and then add points with different colors and symbols. The legend() function adds a legend:

with(airquality, plot(Wind, Ozone, main="", type="n"))
with(subset(airquality, Month != 5), points(Wind, Ozone, col="red", pch=20))
with(subset(airquality, Month == 5), points(Wind, Ozone, col="blue", pch=17))
legend(
    "topright",
    pch=c(17, 20),
    col=c("blue", "red"),
    legend=c("May", "Other Months")
)

We can add a regression line with the abline() function:

with(airquality, plot(Wind, Ozone, main="", pch=20))
model = lm(Ozone ~ Wind, airquality)
abline(model, lwd=2, col="blue")

Here, we pass the line as a linear model computed by lm(), which we will learn more about later in this course. For now, it is sufficient to know that this function expects a formula of the form y ~ x, where you specify the column names corresponding to the x and y axes.

We can add text and arrows with text() and arrows(), respectively:

with(airquality, plot(Wind, Ozone, main="", pch=20))
text(15, 100, "Label")
arrows(14.5, 90, 14, 75, length=0.1)

Displaying raw data

There are many ways to visualize the distribution of a (numerical) variable (e.g., histograms and boxplots). In principle, you should always display the raw data in the plot in addition to summary statistics (such as mean, median, dispersion, and so on). This is possible with the stripchart() function.

Let’s take the airquality$Ozone column as an example. You could display only its mean in a bar chart, but this would not be very informative:

barplot(mean(airquality$Ozone, na.rm=TRUE))

A boxplot would be slightly better:

boxplot(airquality$Ozone)

The best option is to include the raw data in the plot with the stripchart() function:

boxplot(airquality$Ozone)
stripchart(
    airquality$Ozone,
    vertical=TRUE,
    add=TRUE,
    pch=19,
    col=rgb(0, 0, 0, 0.25)
)

Note that add=TRUE must be passed if you want to add the points from stripchart() to an existing plot (otherwise, the function creates a new plot). You could make further improvements here, such as jittering the points a little with method="jitter".

Here is a complete example with boxplots for each month and the underlying raw data:

with(airquality, boxplot(Ozone ~ Month))
stripchart(
    airquality$Ozone ~ airquality$Month,
    vertical=TRUE,
    add=TRUE,
    pch=19,
    col=rgb(0, 0, 0, 0.25)
)

Combining multiple plots

We can display multiple plots side by side (or in any rectangular grid) using the global mfrow or mfcol parameters. Specifically, we set the desired parameter to a vector with two elements, which indicates the number of rows and columns to reserve for subsequent plots. For example, mfrow=c(3, 2) corresponds to three rows and two columns, resulting in a total of six plots. Then we can create the corresponding number of plots with various functions such as plot(), hist(), boxplot(), and so on. We use mfrow if we want to fill the grid row by row or mfcol if we want to fill it column by column.

par(mfrow=c(1, 2))  # 1 row, 2 columns
with(airquality, plot(Wind, Ozone, main="Ozone and Wind", pch=20))  # plot 1
with(airquality, plot(Solar.R, Ozone, main="Ozone and Solar Radiation", pch=20))  # plot 2

Once the plots are complete, it is advisable to reset the global parameter so that the next plot consists of a single visualization:

par(mfrow=c(1, 1))

We can also use the layout() function to create more complex layouts. Here, we specify a matrix that contains the identifiers of the plots to be displayed. For example, to display three plots in two rows and two columns, with the first plot spanning both columns in the first row, we define the matrix as follows:

(grid = matrix(c(1, 1, 2, 3), nrow=2, ncol=2, byrow=TRUE))

     [,1] [,2]
[1,]    1    1
[2,]    2    3

Now we apply this layout using the layout() function and create the three plots one after the other:

layout(grid)
plot(
    airquality$date,
    airquality$Temp,
    type="l",
    xlab="",
    ylab="Temperature (°F)",
    main="Temperature"
)
plot(
    factor(airquality$Month),
    main="Measurements per month",
    xlab="Month",
    ylab="Count",
    col="lightblue"
)
plot(
    airquality$Temp,
    airquality$Ozone,
    type="n",
    xlab="Temperature (°F)",
    ylab="Ozone (ppb)",
    main="Ozone vs. Temperature"
)
abline(lm(Ozone ~ Temp, airquality), col="orange", lwd=2)
points(airquality$Temp, airquality$Ozone, pch=16, col=rgb(0, 0, 0, 0.5))

Note

Specifying the color as col=rgb(0, 0, 0, 0.5) in the last example defines the color black using the first three values (RGB, i.e., red, green, and blue) and the transparency using the fourth value (1 means opaque and 0 means completely transparent – 0.5 is therefore semi-transparent).

After creating the plot, you should reset the parameter, either as shown above with par(mfrow=c(1, 1)) or with:

layout(1)

Extending base plotting with `tinyplot`

The base plotting system is a powerful tool, but it can feel a bit clunky at times. If you want to add more advanced features to your plots, consider using the tinyplot package. It is designed as a drop-in replacement with no additional dependencies, and it allows you to create more complex plots with less code. For example, the scatter plot with regression line from above can be created as follows:

library(tinyplot)

tinytheme("minimal")  # optionally set a theme

plt(
    Ozone ~ Wind,
    data=airquality,
    alpha=0.5,
    main="Ozone and Wind in NYC",
    sub="May–September 1973"
)
plt_add(type="lm", col="blue", lwd=2)

See the tinyplot documentation for more information and examples.

Exercises

Load the penguins dataset from the palmerpenguins package and create a scatter plot of the bill_length_mm column on the x-axis and the bill_depth_mm column on the y-axis. Label the axes with meaningfully!
Recreate the scatter plot from Exercise 1, but this time display the points of the three species in different colors and add an appropriate legend. You can, for example, first create an empty plot with the argument type="n" and then use points() to add data points for the three species in different colors.
Inspect the ToothGrowth dataset (make sure to read its documentation) and create a meaningful plot. Use functions we have discussed in this session (i.e., plot(), hist(), or boxplot()) – of course, multiple plots per figure are also encouraged (using par(mfrow) or layout())!
Use the mtcars dataset and create a boxplot of the mpg variable depending on cyl. Which vehicles consume more or less fuel? Pay attention to the correct interpretation of fuel consumption in MPG (miles per gallon)!
Combine the following three plots in a single figure using the mtcars dataset:
1. Scatter plot of mpg against drat
2. Boxplot of mpg depending on cyl (see Exercise 4)
3. Histogram of mpg
Use a suitable arrangement of the three plots (e.g., using layout())!

Introduction

The plot() function

Histograms

Boxplots

Adjusting plots

Adding elements to plots

Displaying raw data

Combining multiple plots

Extending base plotting with tinyplot

Exercises

The `plot()` function

Extending base plotting with `tinyplot`