v = 1:20) (
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
dim(v)
NULL
length(v)
[1] 20
Vectors represent one-dimensional data such as a sequence of numbers. In practice, data is often available as a table consisting of two dimensions (rows and columns). R features two data types suitable to represent tabular data: matrix
(or more generally array
) and data.frame
. Whereas a matrix
is basically just a slightly enhanced vector (remember that all elements must have the same data type), data frames are more versatile, because they can hold columns with different data types.
Matrices are vectors with a special dimension attribute. We can access this attribute with the dim()
function:
v = 1:20) (
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
dim(v)
NULL
length(v)
[1] 20
A plain old vector does not have a dimension attribute, that’s why we see NULL
here. Of course, any vector has a length, which is why length(v)
yields the number of elements in that vector.
We can now set the dimension attribute to the desired number of rows and columns (the product of the number of rows and columns must be equal to the total number of elements):
dim(v) = c(4, 5) # 4 rows, 5 columns
dim(v)
[1] 4 5
attributes(v)
$dim
[1] 4 5
v
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
class(v)
[1] "matrix" "array"
This example demonstrates that the underlying data does not change when we modify the dimension attribute. In fact, it is just the representation (or interpretation) of the data that is different.
In addition to changing the dimension attribute of a vector, we can directly create a matrix with the matrix()
function:
m = matrix(1:20, 4, 5)) (
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
The first argument specifies the data (a vector), whereas the second and third arguments set the number of rows and columns of the matrix, respectively. By default, the data is pushed into the matrix column by column. Alternatively, byrow=TRUE
switches to row-wise creation:
matrix(1:20, 4, 5, byrow=TRUE)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
A matrix
is an array
with two dimensions. We could also create arrays with an arbitrary number of dimensions, but these are rarely needed in statistical data analysis.
Similar to a named vector, we can provide names for rows and columns of a matrix:
rownames(m) = c("w", "x", "y", "z")
colnames(m) = c("A", "B", "C", "D", "E")
m
A B C D E
w 1 5 9 13 17
x 2 6 10 14 18
y 3 7 11 15 19
z 4 8 12 16 20
Accessing individual elements of a matrix works just like with vectors. However, because matrices consist of rows and columns, we need to provide two indices (separated by a comma) inside the square brackets: a row index and a column index.
Either index can also be omitted, which is a concise way to access an entire row (when omitting the column index) or an entire column (when omitting the row index). Here are some examples:
1, 4] # row 1, column 4 m[
[1] 13
3] # column 3 m[,
w x y z
9 10 11 12
3,] # row 3 m[
A B C D E
3 7 11 15 19
c(2, 4),] # rows 2 and 4 m[
A B C D E
x 2 6 10 14 18
z 4 8 12 16 20
c(1, 3), c(1, 2, 5)] m[
A B E
w 1 5 17
y 3 7 19
"C"] # column C m[,
w x y z
9 10 11 12
"A"] > 2,] # rows where column A > 2 m[m[,
A B C D E
y 3 7 11 15 19
z 4 8 12 16 20
What happens if we try to add a new column of type character
to an existing (numeric) matrix? Just like with vectors, R coerces all matrix elements to the type that can accommodate both numbers and characters:
= c("Joe", "Tracy", "Steven", "Chloe")
subjects cbind(subjects, m)
subjects A B C D E
w "Joe" "1" "5" "9" "13" "17"
x "Tracy" "2" "6" "10" "14" "18"
y "Steven" "3" "7" "11" "15" "19"
z "Chloe" "4" "8" "12" "16" "20"
This example also demonstrates the use of cbind()
, which binds matrices or vectors by columns (in the previous example, we combine the vector subjects
with the matrix m
to a new larger matrix). Analogously, rbind()
combines matrices or vectors by rows.
Operations with matrices are performed elementwise (again, just like with vectors). In addition, there are several useful functions that only make sense with tabular data, for example to compute sums and means across rows or columns:
rowSums(m)
w x y z
45 50 55 60
colSums(m)
A B C D E
10 26 42 58 74
rowMeans(m)
w x y z
9 10 11 12
colMeans(m)
A B C D E
2.5 6.5 10.5 14.5 18.5
A matrix is most useful for purely numerical data. Since matrices are vectors, they can only accommodate elements of the same data type. In practice, we often need to deal with variables of different types, and this is where data frames really shine.
Like matrices, data frames are two-dimensional data structures consisting of rows and columns. In contrast to a matrix, a data frame can contain columns of different data types. For example, a data frame can consist of a numerical columns, character columns, categorical columns, and so on. We can think of a data frame as a collection of columns, where each column is represented by a vector.
Technically, data frames are lists of vectors, where each vector corresponds to a column. We will not cover lists in this course and focus only on data frames instead.
The function data.frame()
creates a data frame from a set of vectors, which are usually passed as named arguments as follows:
data.frame(x=1:5, id=c("X", "c1", "V", "RR", "7G"), value=c(12, 18, 19, 3, 8))
x id value
1 1 X 12
2 2 c1 18
3 3 V 19
4 4 RR 3
5 5 7G 8
This automatically assigns column names corresponding to argument names.
Without named arguments, data.frame()
works similar to cbind()
, where it combines all arguments by column:
df = data.frame(subjects, m)) (
subjects A B C D E
w Joe 1 5 9 13 17
x Tracy 2 6 10 14 18
y Steven 3 7 11 15 19
z Chloe 4 8 12 16 20
Like with matrices, we can use colnames()
to get and set column names:
colnames(df)
[1] "subjects" "A" "B" "C" "D" "E"
colnames(df) = c("patient", "age", "weight", "bp", "rating", "test")
df
patient age weight bp rating test
w Joe 1 5 9 13 17
x Tracy 2 6 10 14 18
y Steven 3 7 11 15 19
z Chloe 4 8 12 16 20
Although rownames()
works with data frames, renaming rows is rarely ever needed and should be avoided.
The three functions str()
, head()
, and tail()
are essential for getting a quick overview of a data frame. In particular, str()
summarizes the structure of an object:
= data.frame(
df patient=c("Joe", "Tracy", "Steven", "Chloe"),
age=c(34, 17, 26, 44),
weight=c(77, 60, 83, 64),
height=c(175, 169, 185, 170)
)str(df)
'data.frame': 4 obs. of 4 variables:
$ patient: chr "Joe" "Tracy" "Steven" "Chloe"
$ age : num 34 17 26 44
$ weight : num 77 60 83 64
$ height : num 175 169 185 170
In the previous example, we created a data frame and used line breaks for better readability. If a command gets too long, consider inserting line breaks at suitable locations (such as after each argument).
The function head()
displays the first six rows of a data frame, whereas tail()
displays the last six rows. If you want to show a different number of rows, pass the desired value as the argument n
.
The following example shows these functions in action for a long data frame:
= data.frame(a=rnorm(5000), b=rpois(5000, 2), x=rep(letters, length.out=5000))
l str(l)
'data.frame': 5000 obs. of 3 variables:
$ a: num 0.644 -0.305 -1.527 -1.183 -0.548 ...
$ b: int 3 1 1 0 1 2 0 0 2 1 ...
$ x: chr "a" "b" "c" "d" ...
head(l)
a b x
1 0.6442736 3 a
2 -0.3050923 1 b
3 -1.5265347 1 c
4 -1.1833522 0 d
5 -0.5484042 1 e
6 -0.7408292 2 f
tail(l, n=4)
a b x
4997 0.08232772 1 e
4998 0.89888686 2 f
4999 0.56983446 2 g
5000 -2.10436755 0 h
The View()
function will show the entire data frame in a spreadsheet. This is very convenient, but note that this view is read-only!
We can access individual columns of a data frame with $
followed by the column name:
$patient df
[1] "Joe" "Tracy" "Steven" "Chloe"
$weight df
[1] 77 60 83 64
Note that the result is a vector (because columns in a data frame are vectors).
This syntax also works for adding new columns. For example, we can add a new column called new
as follows:
$new = c("yes", "no", "no", "yes")
df df
patient age weight height new
1 Joe 34 77 175 yes
2 Tracy 17 60 169 no
3 Steven 26 83 185 no
4 Chloe 44 64 170 yes
Alternatively, we can use cbind()
and rbind()
to add new columns and rows, respectively.
Columns in a data frame can also be removed by assigning the special value NULL
:
$new = NULL # remove column "new"
df df
patient age weight height
1 Joe 34 77 175
2 Tracy 17 60 169
3 Steven 26 83 185
4 Chloe 44 64 170
We can also access individual columns as follows (note the double square brackets):
"patient"]] df[[
[1] "Joe" "Tracy" "Steven" "Chloe"
"height"]] df[[
[1] 175 169 185 170
If we would like to access a subset of the data frame consisting of arbitrary rows and columns, we use indexing within square brackets. Similar to matrices, we need to provide both row and column indexes (separated by a comma). Omitting the row or column index implies all elements in that dimension are implicitly selected. The following examples show how to grab entire rows of a data frame:
1,] df[
patient age weight height
1 Joe 34 77 175
2:3,] df[
patient age weight height
2 Tracy 17 60 169
3 Steven 26 83 185
Similarly, here is how we can get entire columns:
1] df[,
[1] "Joe" "Tracy" "Steven" "Chloe"
4] df[,
[1] 175 169 185 170
In addition to column numbers (positions), we can also provide column names (within quotes):
"patient"] df[,
[1] "Joe" "Tracy" "Steven" "Chloe"
"height"] df[,
[1] 175 169 185 170
Combining both row and column indexes is also entirely possible:
1:2, c(1, 3:4)] df[
patient weight height
1 Joe 77 175
2 Tracy 60 169
Data frames are one of the most widely used data types in R. However, getting a quick and clear overview of their contents can be cumbersome for two reasons. First, R prints too many data frame rows, which makes it difficult to get an overview of the data. The same is true if a data frame has many columns, R will usually print all columns even if they do not fit within the width of the R console.
To illustrate the problem with displaying a data frame with many rows, try displaying a data frame with 2000 rows and 5 columns:
data.frame(matrix(1:10000, 2000, 5))
Similarly, display the following data frame with many columns (this one consists of 100 rows and 100 columns):
data.frame(matrix(1:10000, 100, 100))
Another minor annoyance with data frames is that column types are only displayed when explicitly calling str()
– they are not shown in the normal representation of the data frame.
The tibble
package addresses these (and other) issues by providing an “extended” data frame type – the so-called tibble
. A tibble is a drop-in replacement for a data frame, which means that it can be used everywhere a data frame is expected (because it basically is a data frame with some special behavior).
In contrast to data frames, tibbles are not part of base R, so we need to install and activate the tibble
package:
library(tibble)
To create a tibble, we can use tibble()
instead of data.frame()
:
t = tibble(
(subjects=c("Hans", "Birgit", "Ferdinand", "Johanna"),
A=1:4,
B=5:8,
C=9:12,
D=13:16,
E=17:20
))
# A tibble: 4 × 6
subjects A B C D E
<chr> <int> <int> <int> <int> <int>
1 Hans 1 5 9 13 17
2 Birgit 2 6 10 14 18
3 Ferdinand 3 7 11 15 19
4 Johanna 4 8 12 16 20
Displaying a tibble includes column types and also uses subtle colors in other places (at least in the R console) to make it easier to view the associated data values. In this example, <chr>
means character
and <int>
refers to integer
(a numeric vector consisting of integer numbers only). The first line of the output also includes the dimensions of the tibble.
Let’s take a look at the airquality
data frame, which is part of base R:
str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
Because all 153 rows of this data frame are displayed when inspecting the object (by typing airquality
in the console), we need to use head()
or tail()
to get a glimpse of the data (in addition to str()
). This is typical for a data frame.
We can convert any data frame to a tibble with as_tibble()
:
airquality_tibble = as_tibble(airquality)) (
# A tibble: 153 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
# ℹ 143 more rows
As discussed previously, tibbles have a much nicer representation:
If you want to view all (or any other number of) rows of a tibble, use print()
with a suitable argument n
(the number of displayed rows):
print(co2_tibble, n=Inf) # n=Inf shows all rows
As we have already discussed previously, View()
creates a spreadsheet-like view of the data, which is useful for interactive exploration:
View(co2_tibble) # opens a spreadsheet
Create a vector u
consisting of integers from 50 to 99 and a vector v
with integers from 0 to –49 (in descending order). Next, convert both u
and v
into two matrices with 10 rows each. Finally, create a new variable r
containing both u
and v
concatenated horizontally (columnwise).
Answer the following questions about the matrix r
from the previous exercise:
r
?r
have in total?Create a data frame or tibble df
with 10 rows and 3 columns with the following contents:
name
contains the names Ben, Emma, Frank, Mia, Paul, Hannah, Lucas, Sophia, Jonathan, and Emily.gender
contains the gender of each person (abbreviated with m
or f
).age
contains the age of each person (a number between 10 and 90).Based on df
from the previous exercise, create two data frames df_f
and df_m
with only female and male persons, respectively.
List four ways to access the first column name
of df_f
created in the previous exercise.
Create a data frame mtcars1
based on the built-in dataset mtcars
, where mtcars1
should only contain those rows of mtcars
with values in column mpg
greater than 25. Determine the number of rows for mtcars
and mtcars1
.
There are some differences between data frames and tibbles when it comes to indexing. Try to pinpoint the different behavior using the built-in airquality
data frame and air = as_tibble(airquality)
. Compare the results when indexing the Ozone
column using each of these four methods:
[, 1]
[, "Ozone"]
$Ozone
[["Ozone"]]
Which object (data.frame
or tibble
) is more consistent in terms of indexing?
A comprehensive overview of data frame subsetting operations is available here.