library(readr)
Introduction
Data is stored in a wide variety of file formats, such as Excel spreadsheets, SPSS tables, or plain text files. R can import many different formats, which means that specialized reader functions make the data available as a data frame.
Several key properties make plain text files a popular choice for storing data:
- The format is simple and human-readable, which means that we can access the data by opening it with any text editor.
- Plain text files are supported by almost every data processing tool and programming language (including R).
- Changes can be efficiently tracked by version control systems such as Git.
We will discuss importing plain text files (as well as Excel and SPSS files) into R as well as exporting data frames to plain text files in the following sections.
Text files
Data in text files are often separated by commas – this allows representing values from multiple columns of a table. Such formatted text files often have the extension .csv
(“comma-separated values”). However, there are also other ways to separate values (or columns) from each other, e.g. with semicolons or tabs. Files with these separators often have the extensions .dat
or .tsv
(“tab-separated values”).
File extensions are just conventions and do not always reflect the actual content of the file. Text files often use extensions like .csv
, .tsv
, or .dat
, but a .csv
file might use tabs or semicolons instead of commas. The only way to know for sure is to open the file.
The Tidyverse includes the readr
package, which can import text files in a wide variety of formats. This often works better and faster than the functions that come with R by default. The data imported with the readr
package is also immediately available as a tibble. Therefore, in this course, we will only use functions from readr
to import text files.
Functions for importing text data from the readr
package all start with read_
, whereas functions that come with R start with read.
.
The key function in the readr
package is read_delim()
, which lets you adjust many parameters to match the file you want to import, such as the column separator, presence of a header row, comments, or missing values. For comma-separated data, there is the wrapper function read_csv()
, which uses reasonable default settings. For tab-separated data, you can use read_tsv()
. There is also read_csv2()
, which is specifically designed for semicolon-separated data and uses a comma as the decimal separator. However, the read_delim()
function can handle all of these cases with the appropriate arguments.
These “wrapper functions” are essentially just calling read_delim()
with specific arguments.
As an example, let’s look at a text file called lecturer.csv
. If you don’t know how the data in a text file is actually formatted, you can open its contents in the RStudio editor with the file.show()
function:
file.show("lecturer.csv")
name,birth_date,job,friends,alcohol,income,neurotic
Ben,7/3/1977,1,5,10,20000,10
Martin,5/24/1969,1,2,15,40000,17
Andy,6/21/1973,1,0,20,35000,14
Paul,7/16/1970,1,4,5,22000,13
Graham,10/10/1949,1,1,30,50000,21
Carina,11/5/1983,2,10,25,5000,7
Karina,10/8/1987,2,12,20,100,13
Doug,1/23/1989,2,15,16,3000,9
Mark,5/20/1973,2,12,17,10000,14
Zoe,11/12/1984,2,17,18,10,13
You can see that in this specific file, the columns are indeed separated by commas (so the .csv
extension matches the content). The first row is a header row with column names.
The read_delim()
function usually recognizes the column separator automatically, so you only need to pass the file name as an argument:
read_delim("lecturer.csv")
# A tibble: 10 × 7
name birth_date job friends alcohol income neurotic
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ben 7/3/1977 1 5 10 20000 10
2 Martin 5/24/1969 1 2 15 40000 17
3 Andy 6/21/1973 1 0 20 35000 14
4 Paul 7/16/1970 1 4 5 22000 13
5 Graham 10/10/1949 1 1 30 50000 21
6 Carina 11/5/1983 2 10 25 5000 7
7 Karina 10/8/1987 2 12 20 100 13
8 Doug 1/23/1989 2 15 16 3000 9
9 Mark 5/20/1973 2 12 17 10000 14
10 Zoe 11/12/1984 2 17 18 10 13
If the automatic detection fails, you can explicitly pass the separator with the delim
argument:
read_delim("lecturer.csv", delim=",")
# A tibble: 10 × 7
name birth_date job friends alcohol income neurotic
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ben 7/3/1977 1 5 10 20000 10
2 Martin 5/24/1969 1 2 15 40000 17
3 Andy 6/21/1973 1 0 20 35000 14
4 Paul 7/16/1970 1 4 5 22000 13
5 Graham 10/10/1949 1 1 30 50000 21
6 Carina 11/5/1983 2 10 25 5000 7
7 Karina 10/8/1987 2 12 20 100 13
8 Doug 1/23/1989 2 15 16 3000 9
9 Mark 5/20/1973 2 12 17 10000 14
10 Zoe 11/12/1984 2 17 18 10 13
The result (a tibble) looks correct – it has 10 rows and 7 columns with sensible data types in all columns. We can assign it directly to a variable to continue working with it:
= read_delim("lecturer.csv") df
Alternatively, we could have used the wrapper function read_csv()
, which assumes a comma as the column separator:
read_csv("lecturer.csv")
Let’s look at the same data, but this time it is separated by tabs in a .dat
file called lecturer.dat
. To read this file, we can either use the generic read_delim()
function with automatic detection (or if that doesn’t work, with delim="\t"
, where \t
represents the tab character) or directly use the specialized read_tsv()
function:
read_delim("lecturer.dat")
read_delim("lecturer.dat", delim="\t")
read_tsv("lecturer.dat")
Another important feature of such text files is the decimal separator for numbers. In English, we use a period for decimals (e.g., 12.3
or 3.1415
), whereas in German, we use a comma (e.g., 12,3
or 3,1415
). You can specify the decimal separator in read_delim()
using the locale
argument, which defaults to a period (as in read_csv()
and read_tsv()
). When numbers follow German notation, commas cannot also separate columns, so semicolons are typically used. For such files, set delim=";"
and locale=locale(decimal_mark=",")
, or use the read_csv2()
wrapper, which assumes semicolons and commas as defaults.
Regardless of how decimal separators are represented in text files, R always uses a period for decimal numbers!
SPSS data
If you want to read existing SPSS datasets (.sav
), you can use the read_sav()
function from the haven
package, resulting in a tibble. The haven
package can also import data from SAS and Stata. The following example imports data from the SPSS file lecturer.sav
:
library(haven)
= read_sav("lecturer.sav") df
Excel data
If data is available as an Excel file (extension .xlsx
or .xls
), you can use the read_excel()
function from the readxl
package. Since this package is part of the Tidyverse, it also produces a tibble, as shown in the following example using lecturer.xlsx
:
library(readxl)
= read_excel("lecturer.xlsx") df
Importing data with RStudio
RStudio provides a graphical dialog to import data in many formats. Conveniently, you always get the corresponding R code that correctly imports the data – this code can then be used in your own scripts.
To import data, you can click on the desired file in the Files view in the lower right corner and select Import Dataset. Alternatively, click on Import Dataset in the Environment view in the upper right corner, followed by the desired format (for text files, make sure to select “From Text (readr)…”). This functionality is also available in the File menu under the Import Dataset entry. A new window opens, showing a preview of the file to be imported. You can also tweak various parameters (depending on the file format to be imported) – the effect on the imported data is immediately visible in the preview. The corresponding R code is located at the bottom right. The following image shows this dialog for the previous example SPSS file.
Data preparation
Regardless of the file format, the data ultimately ends up in a data frame (or tibble), which we can then use in R. In most cases, the data frame will still need to be processed so that each column is in the desired format. Let’s look at our example data:
df = read_csv("lecturer.csv")) (
# A tibble: 10 × 7
name birth_date job friends alcohol income neurotic
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ben 7/3/1977 1 5 10 20000 10
2 Martin 5/24/1969 1 2 15 40000 17
3 Andy 6/21/1973 1 0 20 35000 14
4 Paul 7/16/1970 1 4 5 22000 13
5 Graham 10/10/1949 1 1 30 50000 21
6 Carina 11/5/1983 2 10 25 5000 7
7 Karina 10/8/1987 2 12 20 100 13
8 Doug 1/23/1989 2 15 16 3000 9
9 Mark 5/20/1973 2 12 17 10000 14
10 Zoe 11/12/1984 2 17 18 10 13
Apparently, all numerical values were correctly recognized as numbers (the abbreviation <dbl>
means “double” and corresponds to decimal numbers, i.e., a numerical vector). However, the third column job
has only two values, which actually represent two different professions (1
stands for “Lecturer”, 2
for “Student”). Here, a categorical column (a factor, more on this later) would be better suited than a numerical one. The first column name
was correctly recognized as chr
(character vector). The second column birth_date
was also recognized as a character vector, but there is a specific data type for date values in R (which allows calculations with such values) that we should prefer.
Factors
In R, categorical data — values that can only take a limited set of options — are represented using the factor
type. You can create these vectors with the factor()
function, which by default creates unordered (nominal) factors. To create an ordered (ordinal) factor, set the argument ordered=TRUE
.
In our example, the job
column is of type dbl
but should be interpreted as a factor. We can reassign this column accordingly (assuming we know which numbers correspond to the respective factor levels, i.e., 1
corresponds to “Lecturer” and 2
corresponds to “Student”):
$job = factor(df$job, levels=c(1, 2), labels=c("Lecturer", "Student"))
df df
# A tibble: 10 × 7
name birth_date job friends alcohol income neurotic
<chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 Ben 7/3/1977 Lecturer 5 10 20000 10
2 Martin 5/24/1969 Lecturer 2 15 40000 17
3 Andy 6/21/1973 Lecturer 0 20 35000 14
4 Paul 7/16/1970 Lecturer 4 5 22000 13
5 Graham 10/10/1949 Lecturer 1 30 50000 21
6 Carina 11/5/1983 Student 10 25 5000 7
7 Karina 10/8/1987 Student 12 20 100 13
8 Doug 1/23/1989 Student 15 16 3000 9
9 Mark 5/20/1973 Student 12 17 10000 14
10 Zoe 11/12/1984 Student 17 18 10 13
The three arguments are:
df$job
is the original vector.levels=c(1, 2)
specifies which values (levels) are present in the original data and should be used.labels=c("Lecturer", "Student")
assigns appropriate names (labels) to the levels.
The job
column now has the desired factor
type (<fct>
in the tibble representation). We can also explicitly check this with the class()
function:
class(df$job)
[1] "factor"
Dates
R has a specific data type for date values, making it easier to work with them. The as.Date()
function converts date values in text form into this type. You need to specify the format
argument, which describes the original date format. The following example converts the birth_date
column to date format:
$birth_date = as.Date(df$birth_date, format="%m/%d/%Y")
df df
# A tibble: 10 × 7
name birth_date job friends alcohol income neurotic
<chr> <date> <fct> <dbl> <dbl> <dbl> <dbl>
1 Ben 1977-07-03 Lecturer 5 10 20000 10
2 Martin 1969-05-24 Lecturer 2 15 40000 17
3 Andy 1973-06-21 Lecturer 0 20 35000 14
4 Paul 1970-07-16 Lecturer 4 5 22000 13
5 Graham 1949-10-10 Lecturer 1 30 50000 21
6 Carina 1983-11-05 Student 10 25 5000 7
7 Karina 1987-10-08 Student 12 20 100 13
8 Doug 1989-01-23 Student 15 16 3000 9
9 Mark 1973-05-20 Student 12 17 10000 14
10 Zoe 1984-11-12 Student 17 18 10 13
In this example, the format
argument "%m/%d/%Y"
indicates that the dates are originally in the month/day/year order, separated by /
. Here, %m
represents the month, %d
the day, and %Y
a four-digit year.
Let’s consider another example. Assume we have the following character vector dates
with date values:
dates = c("23.3.95", "17.7.96", "9.12.04", "1.1.10", "23.2.17")) (
[1] "23.3.95" "17.7.96" "9.12.04" "1.1.10" "23.2.17"
class(dates)
[1] "character"
The individual date components are separated by a .
. Now, let’s try to find out the meaning of the three numbers. By examining all values, we find that the order is day, month, and year (two-digit). The corresponding format
argument is therefore format="%d.%m.%y"
:
dates = as.Date(dates, format="%d.%m.%y")) (
[1] "1995-03-23" "1996-07-17" "2004-12-09" "2010-01-01" "2017-02-23"
class(dates)
[1] "Date"
The abbreviations like %d
, %m
, and %y
, along with many others, are explained in the documentation for as.Date()
(or actually its base function strptime()
). You can look them up by using ?strptime
, so there’s no need to memorize them.
Saving data
To save an existing data frame or tibble as a text file, the simplest method is to use write_delim()
(or write_csv()
and write_tsv()
) from the readr
package. These functions work similarly to the reading functions mentioned earlier, but you need to specify both the data frame and the file name.
Text files are best for small to medium-sized data sets. Large data sets can result in unnecessarily large files, so more efficient binary data formats like .RData
, which can be created with the save()
function and read with load()
, should be used. Even better alternatives are the highly efficient open formats Parquet and Feather, which can be used across R, Python, and many other languages.
Exercises
Inspect the help for the
read_delim()
function from thereadr
package. Which argument sets the column separator? Which argument sets the decimal separator? Which argument sets the character(s) for missing values?Import the file
homework.csv
and display the resulting tibble on the screen. Make sure to use the correct column and decimal separators! What are the data types of the four columns?The file
wahl16.csv
contains the results of the 2016 Austrian presidential election (after the first round with the six original candidates). Import this data into a tibble calledwahl16
and calculate the relative total number of votes for each candidate (the functionscolSums()
,rowSums()
, andsum()
might be helpful).The file
covid19.csv
contains data on daily Covid19 infections in Austria from February 2020 to June 2023. Import this file into R (make sure to use suitable arguments for the column and decimal separators). How many rows and columns does this data set have?NotePay attention to the column
SiebenTageInzidenzFaelle
– this column contains decimal numbers and should therefore be numeric!Convert the
Time
column to a date format (you can ignore the time in this column)!Given the following numeric vector
x
:= c(1, 2, 3) x
What happens if you convert this vector into a factor, but only want to use the levels
1
and2
(with the labels"one"
and"two"
, respectively)? How does R handle the (unused) value3
in this case?