6 – Descriptive statistics

Introduction

After importing a file correctly (i.e., when the data is available in a data frame or tibble and all columns have a suitable data type), you can start with the statistical analysis. The first step is usually to create an overview of the available data using descriptive statistics.

As an example, we import the file lecturer.csv from the previous chapter:

library(readr)
(df = read_csv("lecturer.csv"))

# A tibble: 10 × 7
   name   birth_date   job friends alcohol income neurotic
   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977       1       5      10  20000       10
 2 Martin 5/24/1969      1       2      15  40000       17
 3 Andy   6/21/1973      1       0      20  35000       14
 4 Paul   7/16/1970      1       4       5  22000       13
 5 Graham 10/10/1949     1       1      30  50000       21
 6 Carina 11/5/1983      2      10      25   5000        7
 7 Karina 10/8/1987      2      12      20    100       13
 8 Doug   1/23/1989      2      15      16   3000        9
 9 Mark   5/20/1973      2      12      17  10000       14
10 Zoe    11/12/1984     2      17      18     10       13

Next, we convert the columns birth_date and job to date and factor types, respectively:

df$birth_date = as.Date(df$birth_date, format="%m/%d/%Y")
df$job = factor(df$job, levels=c(1, 2), labels=c("Lecturer", "Student"))
df

# A tibble: 10 × 7
   name   birth_date job      friends alcohol income neurotic
   <chr>  <date>     <fct>      <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    1977-07-03 Lecturer       5      10  20000       10
 2 Martin 1969-05-24 Lecturer       2      15  40000       17
 3 Andy   1973-06-21 Lecturer       0      20  35000       14
 4 Paul   1970-07-16 Lecturer       4       5  22000       13
 5 Graham 1949-10-10 Lecturer       1      30  50000       21
 6 Carina 1983-11-05 Student       10      25   5000        7
 7 Karina 1987-10-08 Student       12      20    100       13
 8 Doug   1989-01-23 Student       15      16   3000        9
 9 Mark   1973-05-20 Student       12      17  10000       14
10 Zoe    1984-11-12 Student       17      18     10       13

We no longer need the name column for our subsequent considerations, so we remove it:

df$name = NULL

R provides a number of functions that calculate summary statistics for a variable (or a vector). Useful functions include mean(), sd(), var(), min(), max(), median(), range(), and quantile(). The mean of a vector (and thus also a column of the data frame df) can be calculated as follows:

mean(df$friends)

[1] 7.8

For factors, we might be interested in knowing how many levels they contain. Also, for numerical or text vectors with multiple occurrences of values, it is sometimes useful to find out the unique values. For this purpose, R providers the unique() function, which returns the unique elements of a vector:

unique(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2))

[1] 1 2

unique(df$friends)

[1]  5  2  0  4  1 10 12 15 17

unique(df$job)

[1] Lecturer Student 
Levels: Lecturer Student

We would have to repeat these calculations for each column of interest (because these functions only work with vectors), which would be relatively cumbersome. Therefore, we will introduce the sapply() function, which applies a function to each column of a data frame individually. So, if you want to calculate the mean for each numerical column of df, you can do this as follows:

sapply(df[, -c(1, 2)], mean)

 friends  alcohol   income neurotic 
     7.8     17.6  18511.0     13.1

The first argument of sapply() is the data frame, and the second argument is a function that should be applied to each column. Here, we provide a subset of the data frame df that excludes the first two columns (the name and birth_date columns). The function mean() is then applied to each column of this subset, resulting in a vector with column means.

Tip

Alternatively, you can use the colMeans() to calculate column means.

We can apply the same principle to calculate other summary statistics for each column, such as the standard deviation:

sapply(df[, -c(1, 2)], sd)

     friends      alcohol       income     neurotic 
    6.142746     7.042727 18001.354548     3.984693

However, there are also functions that can deal with data frames directly to calculate multiple statistical measures for all columns. In the following paragraphs, we will take a closer look at three of these functions, namely summary(), describe(), and stat.desc(). Only summary() is included in base R; the other two functions require additional packages.

The `summary()` function

The summary() function provides a suitable summary for each column of a data frame (or tibble). Numerical columns and date columns are described with six values:

Minimum
First quartile (25th percentile)
Median (50th percentile)
Third quartile (75th percentile)
Maximum
Arithmetic mean

For factors, the summary includes the levels and the number of cases per level.

summary(df)

   birth_date               job       friends        alcohol          income         neurotic    
 Min.   :1949-10-10   Lecturer:5   Min.   : 0.0   Min.   : 5.00   Min.   :   10   Min.   : 7.00  
 1st Qu.:1971-04-01   Student :5   1st Qu.: 2.5   1st Qu.:15.25   1st Qu.: 3500   1st Qu.:10.75  
 Median :1975-06-27                Median : 7.5   Median :17.50   Median :15000   Median :13.00  
 Mean   :1975-12-17                Mean   : 7.8   Mean   :17.60   Mean   :18511   Mean   :13.10  
 3rd Qu.:1984-08-10                3rd Qu.:12.0   3rd Qu.:20.00   3rd Qu.:31750   3rd Qu.:14.00  
 Max.   :1989-01-23                Max.   :17.0   Max.   :30.00   Max.   :50000   Max.   :21.00

The summary() function also works for other column types and often provides a useful short description.

The `describe()` function

Another way to output even more statistical measures for numerical columns is provided by the describe() function from the psych package. This function cannot handle non-numeric columns properly, so you should only summarize numerical columns.

library(psych)
describe(df[, 3:6])

         vars  n    mean       sd  median  trimmed      mad min   max range  skew kurtosis      se
friends     1 10     7.8     6.14     7.5     7.62     7.41   0    17    17  0.11    -1.75    1.94
alcohol     2 10    17.6     7.04    17.5    17.62     3.71   5    30    25 -0.03    -0.75    2.23
income      3 10 18511.0 18001.35 15000.0 16887.50 19940.97  10 50000 49990  0.45    -1.48 5692.53
neurotic    4 10    13.1     3.98    13.0    12.88     2.97   7    21    14  0.36    -0.66    1.26

Note

We can use functions from packages without activating them. To do this, we have to prefix the package name followed by ::. In the previous example, we could therefore omit library(psych) and still use the function with psych::describe().

The related describeBy() function provides summary statistics for individual groups (such as for individual levels of the factor df$job):

describeBy(df[, c("friends", "alcohol", "income", "neurotic")], df$job)


 Descriptive statistics by group 
group: Lecturer
         vars n    mean       sd median trimmed      mad   min   max range skew kurtosis      se
friends     1 5     2.4     2.07      2     2.4     2.97     0     5     5 0.11    -2.03    0.93
alcohol     2 5    16.0     9.62     15    16.0     7.41     5    30    25 0.28    -1.72    4.30
income      3 5 33400.0 12561.85  35000 33400.0 19273.80 20000 50000 30000 0.10    -1.98 5617.83
neurotic    4 5    15.0     4.18     14    15.0     4.45    10    21    11 0.25    -1.72    1.87
------------------------------------------------------------------------------------------ 
group: Student
         vars n   mean      sd median trimmed     mad min   max range  skew kurtosis      se
friends     1 5   13.2    2.77     12    13.2    2.97  10    17     7  0.23    -1.89    1.24
alcohol     2 5   19.2    3.56     18    19.2    2.97  16    25     9  0.66    -1.43    1.59
income      3 5 3622.0 4135.69   3000  3622.0 4299.54  10 10000  9990  0.48    -1.64 1849.54
neurotic    4 5   11.2    3.03     13    11.2    1.48   7    14     7 -0.37    -2.01    1.36

The first argument is the data frame, and the second argument is a vector which groups the data (usually a factor).

The `stat.desc()` function

The pastecs package includes the stat.desc() function for summarizing data frames. Since the output of the function is relatively cluttered, you should also specify how many decimal places should be displayed with the round() function. If the argument norm=TRUE is set, tests for normal distribution are performed for all columns.

library(pastecs)
round(
    stat.desc(df[, c("friends", "alcohol", "income", "neurotic")], norm=TRUE),
    digits=2
)

             friends alcohol       income neurotic
nbr.val        10.00   10.00        10.00    10.00
nbr.null        1.00    0.00         0.00     0.00
nbr.na          0.00    0.00         0.00     0.00
min             0.00    5.00        10.00     7.00
max            17.00   30.00     50000.00    21.00
range          17.00   25.00     49990.00    14.00
sum            78.00  176.00    185110.00   131.00
median          7.50   17.50     15000.00    13.00
mean            7.80   17.60     18511.00    13.10
SE.mean         1.94    2.23      5692.53     1.26
CI.mean.0.95    4.39    5.04     12877.39     2.85
var            37.73   49.60 324048765.56    15.88
std.dev         6.14    7.04     18001.35     3.98
coef.var        0.79    0.40         0.97     0.30
skewness        0.11   -0.03         0.45     0.36
skew.2SE        0.08   -0.03         0.33     0.26
kurtosis       -1.75   -0.75        -1.48    -0.66
kurt.2SE       -0.66   -0.28        -0.55    -0.25
normtest.W      0.92    0.98         0.90     0.95
normtest.p      0.37    0.94         0.20     0.65

Note

R supports decimal numbers in the so-called scientific notation. This notation uses powers of ten, which can be entered with e (e can be read as “times ten to the power of”).

1e0  # 1 times 10 to the power of 0

[1] 1

-4e0  # -4 times 10 to the power of 0

[1] -4

1e1  # 1 times 10 to the power of 1

[1] 10

3.5e2  # 3.5 times 10 to the power of 2

[1] 350

1e-2  # 1 times 10 to the power of -2

[1] 0.01

1e-15  # 1 times 10 to the power of -15

[1] 1e-15

Tip

If you want to avoid the output of numbers in scientific notation as much as possible, you can set the following option at the beginning of a session:

options(scipen=100)

However, this does not guarantee that all numbers are displayed in decimal notation. Another option is to display numbers as normal decimal numbers using format(). This always results in a character vector, but does not require changing an option:

format(1e-15, scientific=FALSE)

[1] "0.000000000000001"

Grouping with `by()`

There is no direct equivalent for grouped data for the stat.desc() function (like describeBy() for describe()). However, R provides the generic by() function, which applies any given function to grouped data. The first argument is the data frame, the second argument is the grouping column (vector), and the third argument is the function to be applied to the grouped data.

by(df[, c("friends", "alcohol", "income", "neurotic")], df$job, describe)

df$job: Lecturer
         vars n    mean       sd median trimmed      mad   min   max range skew kurtosis      se
friends     1 5     2.4     2.07      2     2.4     2.97     0     5     5 0.11    -2.03    0.93
alcohol     2 5    16.0     9.62     15    16.0     7.41     5    30    25 0.28    -1.72    4.30
income      3 5 33400.0 12561.85  35000 33400.0 19273.80 20000 50000 30000 0.10    -1.98 5617.83
neurotic    4 5    15.0     4.18     14    15.0     4.45    10    21    11 0.25    -1.72    1.87
------------------------------------------------------------------------------------------ 
df$job: Student
         vars n   mean      sd median trimmed     mad min   max range  skew kurtosis      se
friends     1 5   13.2    2.77     12    13.2    2.97  10    17     7  0.23    -1.89    1.24
alcohol     2 5   19.2    3.56     18    19.2    2.97  16    25     9  0.66    -1.43    1.59
income      3 5 3622.0 4135.69   3000  3622.0 4299.54  10 10000  9990  0.48    -1.64 1849.54
neurotic    4 5   11.2    3.03     13    11.2    1.48   7    14     7 -0.37    -2.01    1.36

If you want to pass arguments to the function in the third argument itself (e.g., norm=TRUE for the stat.desc() function), you can do this with additional arguments at the very end:

by(
    df[, c("friends", "alcohol", "income", "neurotic")],
    df$job,
    stat.desc,
    norm=TRUE
)

df$job: Lecturer
                 friends    alcohol        income   neurotic
nbr.val       5.00000000  5.0000000  5.000000e+00  5.0000000
nbr.null      1.00000000  0.0000000  0.000000e+00  0.0000000
nbr.na        0.00000000  0.0000000  0.000000e+00  0.0000000
min           0.00000000  5.0000000  2.000000e+04 10.0000000
max           5.00000000 30.0000000  5.000000e+04 21.0000000
range         5.00000000 25.0000000  3.000000e+04 11.0000000
sum          12.00000000 80.0000000  1.670000e+05 75.0000000
median        2.00000000 15.0000000  3.500000e+04 14.0000000
mean          2.40000000 16.0000000  3.340000e+04 15.0000000
SE.mean       0.92736185  4.3011626  5.617829e+03  1.8708287
CI.mean.0.95  2.57476927 11.9419419  1.559759e+04  5.1942532
var           4.30000000 92.5000000  1.578000e+08 17.5000000
std.dev       2.07364414  9.6176920  1.256185e+04  4.1833001
coef.var      0.86401839  0.6011058  3.761032e-01  0.2788867
skewness      0.11304669  0.2832618  9.869949e-02  0.2458756
skew.2SE      0.06191822  0.1551489  5.405994e-02  0.1346716
kurtosis     -2.03411574 -1.7235062 -1.980205e+00 -1.7239184
kurt.2SE     -0.50852893 -0.4308766 -4.950513e-01 -0.4309796
normtest.W    0.95235149  0.9787162  9.342460e-01  0.9785712
normtest.p    0.75397300  0.9276364  6.255965e-01  0.9268345
------------------------------------------------------------------------------------------ 
df$job: Student
                friends    alcohol        income   neurotic
nbr.val       5.0000000  5.0000000  5.000000e+00  5.0000000
nbr.null      0.0000000  0.0000000  0.000000e+00  0.0000000
nbr.na        0.0000000  0.0000000  0.000000e+00  0.0000000
min          10.0000000 16.0000000  1.000000e+01  7.0000000
max          17.0000000 25.0000000  1.000000e+04 14.0000000
range         7.0000000  9.0000000  9.990000e+03  7.0000000
sum          66.0000000 96.0000000  1.811000e+04 56.0000000
median       12.0000000 18.0000000  3.000000e+03 13.0000000
mean         13.2000000 19.2000000  3.622000e+03 11.2000000
SE.mean       1.2409674  1.5937377  1.849536e+03  1.3564660
CI.mean.0.95  3.4454778  4.4249254  5.135136e+03  3.7661534
var           7.7000000 12.7000000  1.710392e+07  9.2000000
std.dev       2.7748874  3.5637059  4.135689e+03  3.0331502
coef.var      0.2102187  0.1856097  1.141825e+00  0.2708170
skewness      0.2291423  0.6649718  4.835220e-01 -0.3663862
skew.2SE      0.1255064  0.3642200  2.648359e-01 -0.2006780
kurtosis     -1.8935200 -1.4346010 -1.644573e+00 -2.0145180
kurt.2SE     -0.4733800 -0.3586503 -4.111433e-01 -0.5036295
normtest.W    0.9385501  0.8852008  8.943871e-01  0.8576350
normtest.p    0.6557061  0.3335463  3.796449e-01  0.2198809

by(df[, -c(1, 2)], df$job, summary)

df$job: Lecturer
    friends       alcohol       income         neurotic 
 Min.   :0.0   Min.   : 5   Min.   :20000   Min.   :10  
 1st Qu.:1.0   1st Qu.:10   1st Qu.:22000   1st Qu.:13  
 Median :2.0   Median :15   Median :35000   Median :14  
 Mean   :2.4   Mean   :16   Mean   :33400   Mean   :15  
 3rd Qu.:4.0   3rd Qu.:20   3rd Qu.:40000   3rd Qu.:17  
 Max.   :5.0   Max.   :30   Max.   :50000   Max.   :21  
------------------------------------------------------------------------------------------ 
df$job: Student
    friends        alcohol         income         neurotic   
 Min.   :10.0   Min.   :16.0   Min.   :   10   Min.   : 7.0  
 1st Qu.:12.0   1st Qu.:17.0   1st Qu.:  100   1st Qu.: 9.0  
 Median :12.0   Median :18.0   Median : 3000   Median :13.0  
 Mean   :13.2   Mean   :19.2   Mean   : 3622   Mean   :11.2  
 3rd Qu.:15.0   3rd Qu.:20.0   3rd Qu.: 5000   3rd Qu.:13.0  
 Max.   :17.0   Max.   :25.0   Max.   :10000   Max.   :14.0

by(df$friends, df$job, mean)

df$job: Lecturer
[1] 2.4
------------------------------------------------------------------------------------------ 
df$job: Student
[1] 13.2

Test for normality

The stat.desc() function already provides the result of the Shapiro-Wilk test for normality (the entries normtest.W and normtest.p contain the the test statistic and the significance, respectively). If normtest.p is significant (e.g., less than 0.05), the null hypothesis of normality can be rejected. You can also call the Shapiro-Wilk test directly with the shapiro.test() function:

shapiro.test(df$income)


    Shapiro-Wilk normality test

data:  df$income
W = 0.89721, p-value = 0.2041

Using the by() function, you can also apply the test separately to different groups. For example, to test the normality of the income column grouped by the job column:

by(df$income, df$job, shapiro.test)

df$job: Lecturer

    Shapiro-Wilk normality test

data:  dd[x, ]
W = 0.93425, p-value = 0.6256

------------------------------------------------------------------------------------------ 
df$job: Student

    Shapiro-Wilk normality test

data:  dd[x, ]
W = 0.89439, p-value = 0.3796

The Kolmogorov-Smirnov test tests if given data follows any given distribution, including the normal distribution. However, the Shapiro-Wilk test is preferable for testing normality, as it is specifically tailored to the normal distribution and therefore has more statistical power.

ks.test(df$income, "pnorm", mean(df$income), sd(df$income))


    Exact one-sample Kolmogorov-Smirnov test

data:  df$income
D = 0.18182, p-value = 0.8389
alternative hypothesis: two-sided

Since the sample size in our example is very small, no reasonable statements can be made about the distribution of the data anyway.

Test for homogeneity of variances

Levene’s test assesses the equality of variances (homogeneity of variance or homoscedasticity) of two or more groups. It tests the null hypothesis that the variances are equal in all groups. You can run the test with the leveneTest() function from the car package. To illustrate this, we look at the example data Moore, which is automatically loaded with the car package.

library(car)
head(Moore, 4)

  partner.status conformity fcategory fscore
1            low          8       low     37
2            low          4      high     57
3            low          8      high     65
4            low          7       low     20

tail(Moore, 4)

   partner.status conformity fcategory fscore
42           high         13      high     57
43           high         16       low     35
44           high         10      high     52
45           high         15    medium     44

summary(Moore)

 partner.status   conformity     fcategory      fscore     
 high:23        Min.   : 4.00   high  :15   Min.   :15.00  
 low :22        1st Qu.: 8.00   low   :15   1st Qu.:35.00  
                Median :12.00   medium:15   Median :43.00  
                Mean   :12.13               Mean   :43.11  
                3rd Qu.:15.00               3rd Qu.:55.00  
                Max.   :24.00               Max.   :68.00

We can run Levene’s test for the column conformity grouped by the column fcategory as follows:

leveneTest(Moore$conformity, Moore$fcategory)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2   0.046 0.9551
      42

In this example, the null hypothesis of variance equality of the variable conformity across the groups in fcategory cannot be rejected, as the p-value (here denoted as Pr(>F)) is very large.

Exercises

Calculate descriptive statistics such as the mean, median, minimum, and maximum for the four numerical variables Global_active_power, Global_reactive_power, Voltage, and Global_intensity from the electric power consumption data you imported in the previous session (Individual Household Electric Power Consumption).
- Calculate the statistics with the sapply() function.
- Calculate the statistics with the summary() function.
- Calculate the statistics with the describe() function from the psych package.
- Run the stat.desc() function from the pastecs package (round the results to one decimal place).
- What is the mean measured voltage Voltage?
- What is the median of the global active power Global_active_power?
- How many missing values are there in total and per column?
The palmerpenguins package contains the tibble penguins with measurements of physical characteristics of three penguin species (Adélie, Chinstrap, and Gentoo). Activate the package and answer the following questions with suitable R commands:
- How many rows and columns does the penguins dataset have?
- The columns species, island, and sex are factors – how many levels do they each have?
- Calculate descriptive statistics separately for the three penguin species! What are the mean values of the columns bill_length_mm, bill_depth_mm, and flipper_length_mm for the three species?
- Are there missing values in the data? If so, how many are there in total and per column?
The dataset mtcars contains various characteristics of 32 cars.
- Calculate the minimum and maximum as well as the mean and median of all columns.
- Check if the data in the column mpg is normally distributed.
- Check for variance homogeneity of the data in the column mpg for the three groups defined by the column cyl.
Load the dataset lecturer.dat and calculate the means of all numerical columns grouped by the column job using the by() function. Why does this calculation not work when passing the mean function as an argument to by()?
Generate a vector with 5001 standard normally distributed random numbers using the rnorm() function and check if this vector is normally distributed. Use the Shapiro-Wilk test for this purpose. How do you interpret the result?

Introduction

The summary() function

The describe() function

The stat.desc() function

Grouping with by()

Test for normality

Test for homogeneity of variances

Exercises

The `summary()` function

The `describe()` function

The `stat.desc()` function

Grouping with `by()`