4 – Tables

Introduction

Vectors represent one-dimensional data such as a sequence of numbers. In practice, data is often available as a table consisting of two dimensions (rows and columns). R features two data types suitable to represent tabular data: matrix (or more generally array) and data.frame. Whereas a matrix is basically just a slightly enhanced vector (remember that all elements must have the same data type), data frames are more versatile, because they can hold columns with different data types.

Matrix

Relationship to vectors

Matrices are vectors with a special dimension attribute. We can access this attribute with the dim() function:

(v = 1:20)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

dim(v)

NULL

length(v)

[1] 20

A plain old vector does not have a dimension attribute, that’s why we see NULL here. Of course, any vector has a length, which is why length(v) yields the number of elements in that vector.

We can now set the dimension attribute to the desired number of rows and columns (the product of the number of rows and columns must be equal to the total number of elements):

dim(v) = c(4, 5)  # 4 rows, 5 columns
dim(v)

[1] 4 5

attributes(v)

$dim
[1] 4 5

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

class(v)

[1] "matrix" "array"

This example demonstrates that the underlying data does not change when we modify the dimension attribute. In fact, it is just the representation (or interpretation) of the data that is different.

Creating matrices

In addition to changing the dimension attribute of a vector, we can directly create a matrix with the matrix() function:

(m = matrix(1:20, 4, 5))

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

The first argument specifies the data (a vector), whereas the second and third arguments set the number of rows and columns of the matrix, respectively. By default, the data is pushed into the matrix column by column. Alternatively, byrow=TRUE switches to row-wise creation:

matrix(1:20, 4, 5, byrow=TRUE)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20

Note

A matrix is an array with two dimensions. We could also create arrays with an arbitrary number of dimensions, but these are rarely needed in statistical data analysis.

Row and column names

Similar to a named vector, we can provide names for rows and columns of a matrix:

rownames(m) = c("w", "x", "y", "z")
colnames(m) = c("A", "B", "C", "D", "E")
m

  A B  C  D  E
w 1 5  9 13 17
x 2 6 10 14 18
y 3 7 11 15 19
z 4 8 12 16 20

Indexing

Accessing individual elements of a matrix works just like with vectors. However, because matrices consist of rows and columns, we need to provide two indices (separated by a comma) inside the square brackets: a row index and a column index.

Either index can also be omitted, which is a concise way to access an entire row (when omitting the column index) or an entire column (when omitting the row index). Here are some examples:

m[1, 4]  # row 1, column 4

[1] 13

m[, 3]  # column 3

 w  x  y  z 
 9 10 11 12

m[3,]  # row 3

 A  B  C  D  E 
 3  7 11 15 19

m[c(2, 4),]  # rows 2 and 4

  A B  C  D  E
x 2 6 10 14 18
z 4 8 12 16 20

m[c(1, 3), c(1, 2, 5)]

  A B  E
w 1 5 17
y 3 7 19

m[, "C"]  # column C

 w  x  y  z 
 9 10 11 12

m[m[, "A"] > 2,]  # rows where column A > 2

  A B  C  D  E
y 3 7 11 15 19
z 4 8 12 16 20

Coercion

What happens if we try to add a new column of type character to an existing (numeric) matrix? Just like with vectors, R coerces all matrix elements to the type that can accommodate both numbers and characters:

subjects = c("Joe", "Tracy", "Steven", "Chloe")
cbind(subjects, m)

  subjects A   B   C    D    E   
w "Joe"    "1" "5" "9"  "13" "17"
x "Tracy"  "2" "6" "10" "14" "18"
y "Steven" "3" "7" "11" "15" "19"
z "Chloe"  "4" "8" "12" "16" "20"

This example also demonstrates the use of cbind(), which binds matrices or vectors by columns (in the previous example, we combine the vector subjects with the matrix m to a new larger matrix). Analogously, rbind() combines matrices or vectors by rows.

Working with matrices

Operations with matrices are performed elementwise (again, just like with vectors). In addition, there are several useful functions that only make sense with tabular data, for example to compute sums and means across rows or columns:

rowSums(m)

 w  x  y  z 
45 50 55 60

colSums(m)

 A  B  C  D  E 
10 26 42 58 74

rowMeans(m)

 w  x  y  z 
 9 10 11 12

colMeans(m)

   A    B    C    D    E 
 2.5  6.5 10.5 14.5 18.5

A matrix is most useful for purely numerical data. Since matrices are vectors, they can only accommodate elements of the same data type. In practice, we often need to deal with variables of different types, and this is where data frames really shine.

Data frames

Like matrices, data frames are two-dimensional data structures consisting of rows and columns. In contrast to a matrix, a data frame can contain columns of different data types. For example, a data frame can consist of a numerical columns, character columns, categorical columns, and so on. We can think of a data frame as a collection of columns, where each column is represented by a vector.

Note

Technically, data frames are lists of vectors, where each vector corresponds to a column. We will not cover lists in this course and focus only on data frames instead.

Creating data frames

The function data.frame() creates a data frame from a set of vectors, which are usually passed as named arguments as follows:

data.frame(x=1:5, id=c("X", "c1", "V", "RR", "7G"), value=c(12, 18, 19, 3, 8))

  x id value
1 1  X    12
2 2 c1    18
3 3  V    19
4 4 RR     3
5 5 7G     8

This automatically assigns column names corresponding to argument names.

Without named arguments, data.frame() works similar to cbind(), where it combines all arguments by column:

(df = data.frame(subjects, m))

  subjects A B  C  D  E
w      Joe 1 5  9 13 17
x    Tracy 2 6 10 14 18
y   Steven 3 7 11 15 19
z    Chloe 4 8 12 16 20

Like with matrices, we can use colnames() to get and set column names:

colnames(df)

[1] "subjects" "A"        "B"        "C"        "D"        "E"

colnames(df) = c("patient", "age", "weight", "bp", "rating", "test")
df

  patient age weight bp rating test
w     Joe   1      5  9     13   17
x   Tracy   2      6 10     14   18
y  Steven   3      7 11     15   19
z   Chloe   4      8 12     16   20

Although rownames() works with data frames, renaming rows is rarely ever needed and should be avoided.

Printing data frames

The three functions str(), head(), and tail() are essential for getting a quick overview of a data frame. In particular, str() summarizes the structure of an object:

df = data.frame(
    patient=c("Joe", "Tracy", "Steven", "Chloe"),
    age=c(34, 17, 26, 44),
    weight=c(77, 60, 83, 64),
    height=c(175, 169, 185, 170)
)
str(df)

'data.frame':   4 obs. of  4 variables:
 $ patient: chr  "Joe" "Tracy" "Steven" "Chloe"
 $ age    : num  34 17 26 44
 $ weight : num  77 60 83 64
 $ height : num  175 169 185 170

Note

In the previous example, we created a data frame and used line breaks for better readability. If a command gets too long, consider inserting line breaks at suitable locations (such as after each argument).

The function head() displays the first six rows of a data frame, whereas tail() displays the last six rows. If you want to show a different number of rows, pass the desired value as the argument n.

The following example shows these functions in action for a long data frame:

l = data.frame(a=rnorm(5000), b=rpois(5000, 2), x=rep(letters, length.out=5000))
str(l)

'data.frame':   5000 obs. of  3 variables:
 $ a: num  0.644 -0.305 -1.527 -1.183 -0.548 ...
 $ b: int  3 1 1 0 1 2 0 0 2 1 ...
 $ x: chr  "a" "b" "c" "d" ...

head(l)

           a b x
1  0.6442736 3 a
2 -0.3050923 1 b
3 -1.5265347 1 c
4 -1.1833522 0 d
5 -0.5484042 1 e
6 -0.7408292 2 f

tail(l, n=4)

               a b x
4997  0.08232772 1 e
4998  0.89888686 2 f
4999  0.56983446 2 g
5000 -2.10436755 0 h

Tip

The View() function will show the entire data frame in a spreadsheet. This is very convenient, but note that this view is read-only!

Indexing

We can access individual columns of a data frame with $ followed by the column name:

df$patient

[1] "Joe"    "Tracy"  "Steven" "Chloe"

df$weight

[1] 77 60 83 64

Note that the result is a vector (because columns in a data frame are vectors).

This syntax also works for adding new columns. For example, we can add a new column called new as follows:

df$new = c("yes", "no", "no", "yes")
df

  patient age weight height new
1     Joe  34     77    175 yes
2   Tracy  17     60    169  no
3  Steven  26     83    185  no
4   Chloe  44     64    170 yes

Note

Alternatively, we can use cbind() and rbind() to add new columns and rows, respectively.

Columns in a data frame can also be removed by assigning the special value NULL:

df$new = NULL  # remove column "new"
df

  patient age weight height
1     Joe  34     77    175
2   Tracy  17     60    169
3  Steven  26     83    185
4   Chloe  44     64    170

Tip

We can also access individual columns as follows (note the double square brackets):

df[["patient"]]

[1] "Joe"    "Tracy"  "Steven" "Chloe"

df[["height"]]

[1] 175 169 185 170

If we would like to access a subset of the data frame consisting of arbitrary rows and columns, we use indexing within square brackets. Similar to matrices, we need to provide both row and column indexes (separated by a comma). Omitting the row or column index implies all elements in that dimension are implicitly selected. The following examples show how to grab entire rows of a data frame:

df[1,]

  patient age weight height
1     Joe  34     77    175

df[2:3,]

  patient age weight height
2   Tracy  17     60    169
3  Steven  26     83    185

Similarly, here is how we can get entire columns:

df[, 1]

[1] "Joe"    "Tracy"  "Steven" "Chloe"

df[, 4]

[1] 175 169 185 170

In addition to column numbers (positions), we can also provide column names (within quotes):

df[, "patient"]

[1] "Joe"    "Tracy"  "Steven" "Chloe"

df[, "height"]

[1] 175 169 185 170

Combining both row and column indexes is also entirely possible:

df[1:2, c(1, 3:4)]

  patient weight height
1     Joe     77    175
2   Tracy     60    169

Tibbles

Data frames are one of the most widely used data types in R. However, getting a quick and clear overview of their contents can be cumbersome for two reasons. First, R prints too many data frame rows, which makes it difficult to get an overview of the data. The same is true if a data frame has many columns, R will usually print all columns even if they do not fit within the width of the R console.

Tip

To illustrate the problem with displaying a data frame with many rows, try displaying a data frame with 2000 rows and 5 columns:

data.frame(matrix(1:10000, 2000, 5))

Similarly, display the following data frame with many columns (this one consists of 100 rows and 100 columns):

data.frame(matrix(1:10000, 100, 100))

Another minor annoyance with data frames is that column types are only displayed when explicitly calling str() – they are not shown in the normal representation of the data frame.

The tibble package addresses these (and other) issues by providing an “extended” data frame type – the so-called tibble. A tibble is a drop-in replacement for a data frame, which means that it can be used everywhere a data frame is expected (because it basically is a data frame with some special behavior).

In contrast to data frames, tibbles are not part of base R, so we need to install and activate the tibble package:

library(tibble)

To create a tibble, we can use tibble() instead of data.frame():

(t = tibble(
    subjects=c("Hans", "Birgit", "Ferdinand", "Johanna"),
    A=1:4,
    B=5:8,
    C=9:12,
    D=13:16,
    E=17:20
))

# A tibble: 4 × 6
  subjects      A     B     C     D     E
  <chr>     <int> <int> <int> <int> <int>
1 Hans          1     5     9    13    17
2 Birgit        2     6    10    14    18
3 Ferdinand     3     7    11    15    19
4 Johanna       4     8    12    16    20

Displaying a tibble includes column types and also uses subtle colors in other places (at least in the R console) to make it easier to view the associated data values. In this example, <chr> means character and <int> refers to integer (a numeric vector consisting of integer numbers only). The first line of the output also includes the dimensions of the tibble.

Let’s take a look at the airquality data frame, which is part of base R:

str(airquality)

'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

Because all 153 rows of this data frame are displayed when inspecting the object (by typing airquality in the console), we need to use head() or tail() to get a glimpse of the data (in addition to str()). This is typical for a data frame.

We can convert any data frame to a tibble with as_tibble():

(airquality_tibble = as_tibble(airquality))

# A tibble: 153 × 6
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118   8      72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    NA      NA  14.3    56     5     5
 6    28      NA  14.9    66     5     6
 7    23     299   8.6    65     5     7
 8    19      99  13.8    59     5     8
 9     8      19  20.1    61     5     9
10    NA     194   8.6    69     5    10
# ℹ 143 more rows

As discussed previously, tibbles have a much nicer representation:

Only the first 10 rows are shown.
Column data types are included in the output.
The dimensions are listed at the top of the table.

If you want to view all (or any other number of) rows of a tibble, use print() with a suitable argument n (the number of displayed rows):

print(co2_tibble, n=Inf)  # n=Inf shows all rows

As we have already discussed previously, View() creates a spreadsheet-like view of the data, which is useful for interactive exploration:

View(co2_tibble)  # opens a spreadsheet

Exercises

Create a vector u consisting of integers from 50 to 99 and a vector v with integers from 0 to –49 (in descending order). Next, convert both u and v into two matrices with 10 rows each. Finally, create a new variable r containing both u and v concatenated horizontally (columnwise).
Answer the following questions about the matrix r from the previous exercise:
- What are the dimensions of r?
- How many elements does r have in total?
- What is the element in row 7 and column 9?
- What are the row and column means?
- What is the mean of elements in rows 3–7 and columns 4–6?
Create a data frame or tibble df with 10 rows and 3 columns with the following contents:
- The first column name contains the names Ben, Emma, Frank, Mia, Paul, Hannah, Lucas, Sophia, Jonathan, and Emily.
- The second column gender contains the gender of each person (abbreviated with m or f).
- The third column age contains the age of each person (a number between 10 and 90).
Based on df from the previous exercise, create two data frames df_f and df_m with only female and male persons, respectively.
List four ways to access the first column name of df_f created in the previous exercise.
Create a data frame mtcars1 based on the built-in dataset mtcars, where mtcars1 should only contain those rows of mtcars with values in column mpg greater than 25. Determine the number of rows for mtcars and mtcars1.
There are some differences between data frames and tibbles when it comes to indexing. Try to pinpoint the different behavior using the built-in airquality data frame and air = as_tibble(airquality). Compare the results when indexing the Ozone column using each of these four methods:
- [, 1]
- [, "Ozone"]
- $Ozone
- [["Ozone"]]
Which object (data.frame or tibble) is more consistent in terms of indexing?

Tip

A comprehensive overview of data frame subsetting operations is available here.