5 – Importing data

Introduction to R

Author

Clemens Brunner

Published

December 16, 2024

Introduction

Data is stored in a wide variety of file formats, such as Excel spreadsheets, SPSS tables, or plain text files. R can import many different formats, which means that specialized reader functions make the data available as a data frame.

Several key properties make plain text files a popular choice for storing data:

  • The format is simple and human-readable, which means that we can access the data by opening it with any text editor.
  • Plain text files are supported by almost every data processing tool and programming language (including R).
  • Changes can be efficiently tracked by version control systems such as Git.

We will discuss importing plain text files (as well as Excel and SPSS files) into R as well as exporting data frames to plain text files in the following sections.

Text files

Data in text files are often separated by commas – this allows representing values from multiple columns of a table. Such formatted text files often have the extension .csv (“comma-separated values”). However, there are also other ways to separate values (or columns) from each other, e.g. with semicolons or tabs. Files with these separators often have the extensions .dat or .tsv (“tab-separated values”).

Note

File extensions are just conventions and do not always reflect the actual content of the file. Text files often use extensions like .csv, .tsv, or .dat, but a .csv file might use tabs or semicolons instead of commas. The only way to know for sure is to open the file.

The Tidyverse includes the readr package, which can import text files in a wide variety of formats. This often works better and faster than the functions that come with R by default. The data imported with the readr package is also immediately available as a tibble. Therefore, in this course, we will only use functions from readr to import text files.

library(readr)
Important

Functions for importing text data from the readr package all start with read_, whereas functions that come with R start with read..

The key function in the readr package is read_delim(), which lets you adjust many parameters to match the file you want to import, such as the column separator, presence of a header row, comments, or missing values. For comma-separated data, there is the wrapper function read_csv(), which uses reasonable default settings. For tab-separated data, you can use read_tsv(). There is also read_csv2(), which is specifically designed for semicolon-separated data and uses a comma as the decimal separator. However, the read_delim() function can handle all of these cases with the appropriate arguments.

Note

These “wrapper functions” are essentially just calling read_delim() with specific arguments.

As an example, let’s look at a text file called lecturer.csv. If you don’t know how the data in a text file is actually formatted, you can open its contents in the RStudio editor with the file.show() function:

file.show("lecturer.csv")
name,birth_date,job,friends,alcohol,income,neurotic
Ben,7/3/1977,1,5,10,20000,10
Martin,5/24/1969,1,2,15,40000,17
Andy,6/21/1973,1,0,20,35000,14
Paul,7/16/1970,1,4,5,22000,13
Graham,10/10/1949,1,1,30,50000,21
Carina,11/5/1983,2,10,25,5000,7
Karina,10/8/1987,2,12,20,100,13
Doug,1/23/1989,2,15,16,3000,9
Mark,5/20/1973,2,12,17,10000,14
Zoe,11/12/1984,2,17,18,10,13

You can see that in this specific file, the columns are indeed separated by commas (so the .csv extension matches the content). The first row is a header row with column names.

The read_delim() function usually recognizes the column separator automatically, so you only need to pass the file name as an argument:

read_delim("lecturer.csv")
# A tibble: 10 × 7
   name   birth_date   job friends alcohol income neurotic
   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977       1       5      10  20000       10
 2 Martin 5/24/1969      1       2      15  40000       17
 3 Andy   6/21/1973      1       0      20  35000       14
 4 Paul   7/16/1970      1       4       5  22000       13
 5 Graham 10/10/1949     1       1      30  50000       21
 6 Carina 11/5/1983      2      10      25   5000        7
 7 Karina 10/8/1987      2      12      20    100       13
 8 Doug   1/23/1989      2      15      16   3000        9
 9 Mark   5/20/1973      2      12      17  10000       14
10 Zoe    11/12/1984     2      17      18     10       13

If the automatic detection fails, you can explicitly pass the separator with the delim argument:

read_delim("lecturer.csv", delim=",")
# A tibble: 10 × 7
   name   birth_date   job friends alcohol income neurotic
   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977       1       5      10  20000       10
 2 Martin 5/24/1969      1       2      15  40000       17
 3 Andy   6/21/1973      1       0      20  35000       14
 4 Paul   7/16/1970      1       4       5  22000       13
 5 Graham 10/10/1949     1       1      30  50000       21
 6 Carina 11/5/1983      2      10      25   5000        7
 7 Karina 10/8/1987      2      12      20    100       13
 8 Doug   1/23/1989      2      15      16   3000        9
 9 Mark   5/20/1973      2      12      17  10000       14
10 Zoe    11/12/1984     2      17      18     10       13

The result (a tibble) looks correct – it has 10 rows and 7 columns with sensible data types in all columns. We can assign it directly to a variable to continue working with it:

df = read_delim("lecturer.csv")

Alternatively, we could have used the wrapper function read_csv(), which assumes a comma as the column separator:

read_csv("lecturer.csv")

Let’s look at the same data, but this time it is separated by tabs in a .dat file called lecturer.dat. To read this file, we can either use the generic read_delim() function with automatic detection (or if that doesn’t work, with delim="\t", where \t represents the tab character) or directly use the specialized read_tsv() function:

read_delim("lecturer.dat")
read_delim("lecturer.dat", delim="\t")
read_tsv("lecturer.dat")

Another important feature of such text files is the decimal separator for numbers. In English, we use a period for decimals (e.g., 12.3 or 3.1415), whereas in German, we use a comma (e.g., 12,3 or 3,1415). You can specify the decimal separator in read_delim() using the locale argument, which defaults to a period (as in read_csv() and read_tsv()). When numbers follow German notation, commas cannot also separate columns, so semicolons are typically used. For such files, set delim=";" and locale=locale(decimal_mark=","), or use the read_csv2() wrapper, which assumes semicolons and commas as defaults.

Important

Regardless of how decimal separators are represented in text files, R always uses a period for decimal numbers!

SPSS data

If you want to read existing SPSS datasets (.sav), you can use the read_sav() function from the haven package, resulting in a tibble. The haven package can also import data from SAS and Stata. The following example imports data from the SPSS file lecturer.sav:

library(haven)
df = read_sav("lecturer.sav")

Excel data

If data is available as an Excel file (extension .xlsx or .xls), you can use the read_excel() function from the readxl package. Since this package is part of the Tidyverse, it also produces a tibble, as shown in the following example using lecturer.xlsx:

library(readxl)
df = read_excel("lecturer.xlsx")

Importing data with RStudio

RStudio provides a graphical dialog to import data in many formats. Conveniently, you always get the corresponding R code that correctly imports the data – this code can then be used in your own scripts.

To import data, you can click on the desired file in the Files view in the lower right corner and select Import Dataset. Alternatively, click on Import Dataset in the Environment view in the upper right corner, followed by the desired format (for text files, make sure to select “From Text (readr)…”). This functionality is also available in the File menu under the Import Dataset entry. A new window opens, showing a preview of the file to be imported. You can also tweak various parameters (depending on the file format to be imported) – the effect on the imported data is immediately visible in the preview. The corresponding R code is located at the bottom right. The following image shows this dialog for the previous example SPSS file.

Data preparation

Regardless of the file format, the data ultimately ends up in a data frame (or tibble), which we can then use in R. In most cases, the data frame will still need to be processed so that each column is in the desired format. Let’s look at our example data:

(df = read_csv("lecturer.csv"))
# A tibble: 10 × 7
   name   birth_date   job friends alcohol income neurotic
   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977       1       5      10  20000       10
 2 Martin 5/24/1969      1       2      15  40000       17
 3 Andy   6/21/1973      1       0      20  35000       14
 4 Paul   7/16/1970      1       4       5  22000       13
 5 Graham 10/10/1949     1       1      30  50000       21
 6 Carina 11/5/1983      2      10      25   5000        7
 7 Karina 10/8/1987      2      12      20    100       13
 8 Doug   1/23/1989      2      15      16   3000        9
 9 Mark   5/20/1973      2      12      17  10000       14
10 Zoe    11/12/1984     2      17      18     10       13

Apparently, all numerical values were correctly recognized as numbers (the abbreviation <dbl> means “double” and corresponds to decimal numbers, i.e., a numerical vector). However, the third column job has only two values, which actually represent two different professions (1 stands for “Lecturer”, 2 for “Student”). Here, a categorical column (a factor, more on this later) would be better suited than a numerical one. The first column name was correctly recognized as chr (character vector). The second column birth_date was also recognized as a character vector, but there is a specific data type for date values in R (which allows calculations with such values) that we should prefer.

Factors

In R, categorical data — values that can only take a limited set of options — are represented using the factor type. You can create these vectors with the factor() function, which by default creates unordered (nominal) factors. To create an ordered (ordinal) factor, set the argument ordered=TRUE.

In our example, the job column is of type dbl but should be interpreted as a factor. We can reassign this column accordingly (assuming we know which numbers correspond to the respective factor levels, i.e., 1 corresponds to “Lecturer” and 2 corresponds to “Student”):

df$job = factor(df$job, levels=c(1, 2), labels=c("Lecturer", "Student"))
df
# A tibble: 10 × 7
   name   birth_date job      friends alcohol income neurotic
   <chr>  <chr>      <fct>      <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977   Lecturer       5      10  20000       10
 2 Martin 5/24/1969  Lecturer       2      15  40000       17
 3 Andy   6/21/1973  Lecturer       0      20  35000       14
 4 Paul   7/16/1970  Lecturer       4       5  22000       13
 5 Graham 10/10/1949 Lecturer       1      30  50000       21
 6 Carina 11/5/1983  Student       10      25   5000        7
 7 Karina 10/8/1987  Student       12      20    100       13
 8 Doug   1/23/1989  Student       15      16   3000        9
 9 Mark   5/20/1973  Student       12      17  10000       14
10 Zoe    11/12/1984 Student       17      18     10       13

The three arguments are:

  • df$job is the original vector.
  • levels=c(1, 2) specifies which values (levels) are present in the original data and should be used.
  • labels=c("Lecturer", "Student") assigns appropriate names (labels) to the levels.

The job column now has the desired factor type (<fct> in the tibble representation). We can also explicitly check this with the class() function:

class(df$job)
[1] "factor"

Dates

R has a specific data type for date values, making it easier to work with them. The as.Date() function converts date values in text form into this type. You need to specify the format argument, which describes the original date format. The following example converts the birth_date column to date format:

df$birth_date = as.Date(df$birth_date, format="%m/%d/%Y")
df
# A tibble: 10 × 7
   name   birth_date job      friends alcohol income neurotic
   <chr>  <date>     <fct>      <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    1977-07-03 Lecturer       5      10  20000       10
 2 Martin 1969-05-24 Lecturer       2      15  40000       17
 3 Andy   1973-06-21 Lecturer       0      20  35000       14
 4 Paul   1970-07-16 Lecturer       4       5  22000       13
 5 Graham 1949-10-10 Lecturer       1      30  50000       21
 6 Carina 1983-11-05 Student       10      25   5000        7
 7 Karina 1987-10-08 Student       12      20    100       13
 8 Doug   1989-01-23 Student       15      16   3000        9
 9 Mark   1973-05-20 Student       12      17  10000       14
10 Zoe    1984-11-12 Student       17      18     10       13

In this example, the format argument "%m/%d/%Y" indicates that the dates are originally in the month/day/year order, separated by /. Here, %m represents the month, %d the day, and %Y a four-digit year.

Let’s consider another example. Assume we have the following character vector dates with date values:

(dates = c("23.3.95", "17.7.96", "9.12.04", "1.1.10", "23.2.17"))
[1] "23.3.95" "17.7.96" "9.12.04" "1.1.10"  "23.2.17"
class(dates)
[1] "character"

The individual date components are separated by a .. Now, let’s try to find out the meaning of the three numbers. By examining all values, we find that the order is day, month, and year (two-digit). The corresponding format argument is therefore format="%d.%m.%y":

(dates = as.Date(dates, format="%d.%m.%y"))
[1] "1995-03-23" "1996-07-17" "2004-12-09" "2010-01-01" "2017-02-23"
class(dates)
[1] "Date"

The abbreviations like %d, %m, and %y, along with many others, are explained in the documentation for as.Date() (or actually its base function strptime()). You can look them up by using ?strptime, so there’s no need to memorize them.

Saving data

To save an existing data frame or tibble as a text file, the simplest method is to use write_delim() (or write_csv() and write_tsv()) from the readr package. These functions work similarly to the reading functions mentioned earlier, but you need to specify both the data frame and the file name.

Tip

Text files are best for small to medium-sized data sets. Large data sets can result in unnecessarily large files, so more efficient binary data formats like .RData, which can be created with the save() function and read with load(), should be used. Even better alternatives are the highly efficient open formats Parquet and Feather, which can be used across R, Python, and many other languages.

Exercises

  1. Inspect the help for the read_delim() function from the readr package. Which argument sets the column separator? Which argument sets the decimal separator? Which argument sets the character(s) for missing values?

  2. Import the file homework.csv and display the resulting tibble on the screen. Make sure to use the correct column and decimal separators! What are the data types of the four columns?

  3. The file wahl16.csv contains the results of the 2016 Austrian presidential election (after the first round with the six original candidates). Import this data into a tibble called wahl16 and calculate the relative total number of votes for each candidate (the functions colSums(), rowSums(), and sum() might be helpful).

  4. The file covid19.csv contains data on daily Covid19 infections in Austria from February 2020 to June 2023. Import this file into R (make sure to use suitable arguments for the column and decimal separators). How many rows and columns does this data set have?

    Note

    Pay attention to the column SiebenTageInzidenzFaelle – this column contains decimal numbers and should therefore be numeric!

    Convert the Time column to a date format (you can ignore the time in this column)!

  5. Given the following numeric vector x:

    x = c(1, 2, 3)

    What happens if you convert this vector into a factor, but only want to use the levels 1 and 2 (with the labels "one" and "two", respectively)? How does R handle the (unused) value 3 in this case?