After importing a file correctly (i.e., when the data is available in a data frame or tibble and all columns have a suitable data type), you can start with the statistical analysis. The first step is usually to create an overview of the available data using descriptive statistics.
As an example, we import the file lecturer.csv from the previous chapter:
We no longer need the name column for our subsequent considerations, so we remove it:
df$name =NULL
R provides a number of functions that calculate summary statistics for a variable (or a vector). Useful functions include mean(), sd(), var(), min(), max(), median(), range(), and quantile(). The mean of a vector (and thus also a column of the data frame df) can be calculated as follows:
mean(df$friends)
[1] 7.8
For factors, we might be interested in knowing how many levels they contain. Also, for numerical or text vectors with multiple occurrences of values, it is sometimes useful to find out the unique values. For this purpose, R providers the unique() function, which returns the unique elements of a vector:
unique(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2))
[1] 1 2
unique(df$friends)
[1] 5 2 0 4 1 10 12 15 17
unique(df$job)
[1] Lecturer Student
Levels: Lecturer Student
We would have to repeat these calculations for each column of interest (because these functions only work with vectors), which would be relatively cumbersome. Therefore, we will introduce the sapply() function, which applies a function to each column of a data frame individually. So, if you want to calculate the mean for each numerical column of df, you can do this as follows:
sapply(df[, -c(1, 2)], mean)
friends alcohol income neurotic
7.8 17.6 18511.0 13.1
The first argument of sapply() is the data frame, and the second argument is a function that should be applied to each column. Here, we provide a subset of the data frame df that excludes the first two columns (the name and birth_date columns). The function mean() is then applied to each column of this subset, resulting in a vector with column means.
Tip
Alternatively, you can use the colMeans() to calculate column means.
We can apply the same principle to calculate other summary statistics for each column, such as the standard deviation:
sapply(df[, -c(1, 2)], sd)
friends alcohol income neurotic
6.142746 7.042727 18001.354548 3.984693
However, there are also functions that can deal with data frames directly to calculate multiple statistical measures for all columns. In the following paragraphs, we will take a closer look at three of these functions, namely summary(), describe(), and stat.desc(). Only summary() is included in base R; the other two functions require additional packages.
The summary() function
The summary() function provides a suitable summary for each column of a data frame (or tibble). Numerical columns and date columns are described with six values:
Minimum
First quartile (25th percentile)
Median (50th percentile)
Third quartile (75th percentile)
Maximum
Arithmetic mean
For factors, the summary includes the levels and the number of cases per level.
summary(df)
birth_date job friends alcohol income neurotic
Min. :1949-10-10 Lecturer:5 Min. : 0.0 Min. : 5.00 Min. : 10 Min. : 7.00
1st Qu.:1971-04-01 Student :5 1st Qu.: 2.5 1st Qu.:15.25 1st Qu.: 3500 1st Qu.:10.75
Median :1975-06-27 Median : 7.5 Median :17.50 Median :15000 Median :13.00
Mean :1975-12-17 Mean : 7.8 Mean :17.60 Mean :18511 Mean :13.10
3rd Qu.:1984-08-10 3rd Qu.:12.0 3rd Qu.:20.00 3rd Qu.:31750 3rd Qu.:14.00
Max. :1989-01-23 Max. :17.0 Max. :30.00 Max. :50000 Max. :21.00
The summary() function also works for other column types and often provides a useful short description.
The describe() function
Another way to output even more statistical measures for numerical columns is provided by the describe() function from the psych package. This function cannot handle non-numeric columns properly, so you should only summarize numerical columns.
library(psych)describe(df[, 3:6])
vars n mean sd median trimmed mad min max range skew kurtosis se
friends 1 10 7.8 6.14 7.5 7.62 7.41 0 17 17 0.11 -1.75 1.94
alcohol 2 10 17.6 7.04 17.5 17.62 3.71 5 30 25 -0.03 -0.75 2.23
income 3 10 18511.0 18001.35 15000.0 16887.50 19940.97 10 50000 49990 0.45 -1.48 5692.53
neurotic 4 10 13.1 3.98 13.0 12.88 2.97 7 21 14 0.36 -0.66 1.26
Note
We can use functions from packages without activating them. To do this, we have to prefix the package name followed by ::. In the previous example, we could therefore omit library(psych) and still use the function with psych::describe().
The related describeBy() function provides summary statistics for individual groups (such as for individual levels of the factor df$job):
Descriptive statistics by group
group: Lecturer
vars n mean sd median trimmed mad min max range skew kurtosis se
friends 1 5 2.4 2.07 2 2.4 2.97 0 5 5 0.11 -2.03 0.93
alcohol 2 5 16.0 9.62 15 16.0 7.41 5 30 25 0.28 -1.72 4.30
income 3 5 33400.0 12561.85 35000 33400.0 19273.80 20000 50000 30000 0.10 -1.98 5617.83
neurotic 4 5 15.0 4.18 14 15.0 4.45 10 21 11 0.25 -1.72 1.87
------------------------------------------------------------------------------------------
group: Student
vars n mean sd median trimmed mad min max range skew kurtosis se
friends 1 5 13.2 2.77 12 13.2 2.97 10 17 7 0.23 -1.89 1.24
alcohol 2 5 19.2 3.56 18 19.2 2.97 16 25 9 0.66 -1.43 1.59
income 3 5 3622.0 4135.69 3000 3622.0 4299.54 10 10000 9990 0.48 -1.64 1849.54
neurotic 4 5 11.2 3.03 13 11.2 1.48 7 14 7 -0.37 -2.01 1.36
The first argument is the data frame, and the second argument is a vector which groups the data (usually a factor).
The stat.desc() function
The pastecs package includes the stat.desc() function for summarizing data frames. Since the output of the function is relatively cluttered, you should also specify how many decimal places should be displayed with the round() function. If the argument norm=TRUE is set, tests for normal distribution are performed for all columns.
R supports decimal numbers in the so-called scientific notation. This notation uses powers of ten, which can be entered with e (e can be read as “times ten to the power of”).
1e0# 1 times 10 to the power of 0
[1] 1
-4e0# -4 times 10 to the power of 0
[1] -4
1e1# 1 times 10 to the power of 1
[1] 10
3.5e2# 3.5 times 10 to the power of 2
[1] 350
1e-2# 1 times 10 to the power of -2
[1] 0.01
1e-15# 1 times 10 to the power of -15
[1] 1e-15
Tip
If you want to avoid the output of numbers in scientific notation as much as possible, you can set the following option at the beginning of a session:
options(scipen=100)
However, this does not guarantee that all numbers are displayed in decimal notation. Another option is to display numbers as normal decimal numbers using format(). This always results in a character vector, but does not require changing an option:
format(1e-15, scientific=FALSE)
[1] "0.000000000000001"
Grouping with by()
There is no direct equivalent for grouped data for the stat.desc() function (like describeBy() for describe()). However, R provides the generic by() function, which applies any given function to grouped data. The first argument is the data frame, the second argument is the grouping column (vector), and the third argument is the function to be applied to the grouped data.
df$job: Lecturer
vars n mean sd median trimmed mad min max range skew kurtosis se
friends 1 5 2.4 2.07 2 2.4 2.97 0 5 5 0.11 -2.03 0.93
alcohol 2 5 16.0 9.62 15 16.0 7.41 5 30 25 0.28 -1.72 4.30
income 3 5 33400.0 12561.85 35000 33400.0 19273.80 20000 50000 30000 0.10 -1.98 5617.83
neurotic 4 5 15.0 4.18 14 15.0 4.45 10 21 11 0.25 -1.72 1.87
------------------------------------------------------------------------------------------
df$job: Student
vars n mean sd median trimmed mad min max range skew kurtosis se
friends 1 5 13.2 2.77 12 13.2 2.97 10 17 7 0.23 -1.89 1.24
alcohol 2 5 19.2 3.56 18 19.2 2.97 16 25 9 0.66 -1.43 1.59
income 3 5 3622.0 4135.69 3000 3622.0 4299.54 10 10000 9990 0.48 -1.64 1849.54
neurotic 4 5 11.2 3.03 13 11.2 1.48 7 14 7 -0.37 -2.01 1.36
If you want to pass arguments to the function in the third argument itself (e.g., norm=TRUE for the stat.desc() function), you can do this with additional arguments at the very end:
df$job: Lecturer
friends alcohol income neurotic
Min. :0.0 Min. : 5 Min. :20000 Min. :10
1st Qu.:1.0 1st Qu.:10 1st Qu.:22000 1st Qu.:13
Median :2.0 Median :15 Median :35000 Median :14
Mean :2.4 Mean :16 Mean :33400 Mean :15
3rd Qu.:4.0 3rd Qu.:20 3rd Qu.:40000 3rd Qu.:17
Max. :5.0 Max. :30 Max. :50000 Max. :21
------------------------------------------------------------------------------------------
df$job: Student
friends alcohol income neurotic
Min. :10.0 Min. :16.0 Min. : 10 Min. : 7.0
1st Qu.:12.0 1st Qu.:17.0 1st Qu.: 100 1st Qu.: 9.0
Median :12.0 Median :18.0 Median : 3000 Median :13.0
Mean :13.2 Mean :19.2 Mean : 3622 Mean :11.2
3rd Qu.:15.0 3rd Qu.:20.0 3rd Qu.: 5000 3rd Qu.:13.0
Max. :17.0 Max. :25.0 Max. :10000 Max. :14.0
The stat.desc() function already provides the result of the Shapiro-Wilk test for normality (the entries normtest.W and normtest.p contain the the test statistic and the significance, respectively). If normtest.p is significant (e.g., less than 0.05), the null hypothesis of normality can be rejected. You can also call the Shapiro-Wilk test directly with the shapiro.test() function:
shapiro.test(df$income)
Shapiro-Wilk normality test
data: df$income
W = 0.89721, p-value = 0.2041
Using the by() function, you can also apply the test separately to different groups. For example, to test the normality of the income column grouped by the job column:
by(df$income, df$job, shapiro.test)
df$job: Lecturer
Shapiro-Wilk normality test
data: dd[x, ]
W = 0.93425, p-value = 0.6256
------------------------------------------------------------------------------------------
df$job: Student
Shapiro-Wilk normality test
data: dd[x, ]
W = 0.89439, p-value = 0.3796
The Kolmogorov-Smirnov test tests if given data follows any given distribution, including the normal distribution. However, the Shapiro-Wilk test is preferable for testing normality, as it is specifically tailored to the normal distribution and therefore has more statistical power.
Exact one-sample Kolmogorov-Smirnov test
data: df$income
D = 0.18182, p-value = 0.8389
alternative hypothesis: two-sided
Since the sample size in our example is very small, no reasonable statements can be made about the distribution of the data anyway.
Test for homogeneity of variances
Levene’s test assesses the equality of variances (homogeneity of variance or homoscedasticity) of two or more groups. It tests the null hypothesis that the variances are equal in all groups. You can run the test with the leveneTest() function from the car package. To illustrate this, we look at the example data Moore, which is automatically loaded with the car package.
partner.status conformity fcategory fscore
42 high 13 high 57
43 high 16 low 35
44 high 10 high 52
45 high 15 medium 44
summary(Moore)
partner.status conformity fcategory fscore
high:23 Min. : 4.00 high :15 Min. :15.00
low :22 1st Qu.: 8.00 low :15 1st Qu.:35.00
Median :12.00 medium:15 Median :43.00
Mean :12.13 Mean :43.11
3rd Qu.:15.00 3rd Qu.:55.00
Max. :24.00 Max. :68.00
We can run Levene’s test for the column conformity grouped by the column fcategory as follows:
leveneTest(Moore$conformity, Moore$fcategory)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.046 0.9551
42
In this example, the null hypothesis of variance equality of the variable conformity across the groups in fcategory cannot be rejected, as the p-value (here denoted as Pr(>F)) is very large.
Exercises
Calculate descriptive statistics such as the mean, median, minimum, and maximum for the four numerical variables Global_active_power, Global_reactive_power, Voltage, and Global_intensity from the electric power consumption data you imported in the previous session (Individual Household Electric Power Consumption).
Calculate the statistics with the sapply() function.
Calculate the statistics with the summary() function.
Calculate the statistics with the describe() function from the psych package.
Run the stat.desc() function from the pastecs package (round the results to one decimal place).
What is the mean measured voltage Voltage?
What is the median of the global active power Global_active_power?
How many missing values are there in total and per column?
The palmerpenguins package contains the tibble penguins with measurements of physical characteristics of three penguin species (Adélie, Chinstrap, and Gentoo). Activate the package and answer the following questions with suitable R commands:
How many rows and columns does the penguins dataset have?
The columns species, island, and sex are factors – how many levels do they each have?
Calculate descriptive statistics separately for the three penguin species! What are the mean values of the columns bill_length_mm, bill_depth_mm, and flipper_length_mm for the three species?
Are there missing values in the data? If so, how many are there in total and per column?
The dataset mtcars contains various characteristics of 32 cars.
Calculate the minimum and maximum as well as the mean and median of all columns.
Check if the data in the column mpg is normally distributed.
Check for variance homogeneity of the data in the column mpg for the three groups defined by the column cyl.
Load the dataset lecturer.dat and calculate the means of all numerical columns grouped by the column job using the by() function. Why does this calculation not work when passing the mean function as an argument to by()?
Generate a vector with 5001 standard normally distributed random numbers using the rnorm() function and check if this vector is normally distributed. Use the Shapiro-Wilk test for this purpose. How do you interpret the result?