import pandas as pd
Introduction
Why do we need another package on top of NumPy? Could we not just use NumPy arrays for our data analysis tasks? Unfortunately, in many cases this is not going to work. NumPy arrays are homogenous arrays, which means that all array items must have the same type. However, real-world tabular data generally consists of variables (columns) that have different types. For example, whereas some columns might contain integers, others might hold floating point numbers, text (strings), dates, or categorical data. Such a diverse collection of variables cannot be represented by a NumPy array1.
Pandas2 offers a flexible data type for tabular data which builds upon NumPy arrays, but can accommodate heterogeneous column types: the so-called data frame.
By convention, we import pandas as follows:
Before we discuss data frames, we will introduce series as essential building blocks.
Series
A pandas series (pd.Series
) is a sequence-like homogeneous object, similar to a one-dimensional NumPy array. In fact, series are based on NumPy arrays under the hood, so they share important properties. For example, just like a NumPy array, a series has a specific data type (which means that all elements have the same type).
We can construct a series from any iterable, such as a standard list. Here’s an example:
1, 2, 3, 4, 5]) pd.Series([
0 1
1 2
2 3
3 4
4 5
dtype: int64
The output shows two columns: the values are in the right column, whereas the left column shows the index (which by default starts with zero). The bottom row displays the data type (int64
in this case).
We can enforce a specific data type with the dtype
argument:
1, 2, 3, 4, 5], dtype="float64") pd.Series([
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
Pandas data types are based on NumPy array scalars. The most common sized numerical types are:
- Integers:
int8
,int16
,int32
, andint64
- Unsigned integers:
uint8
,uint16
,uint32
, anduint64
- Floating point numbers:
float16
,float32
,float64
, andfloat128
- Complex numbers:
complex64
,complex128
, andcomplex256
The shorthand notations int
, uint
, float
, and complex
map to int64
, uint64
, float64
, and complex128
on most machines, respectively. The numbers indicate the bits used to store one element of a particular type (for example, an int16
value requires 16 bits, or 2 bytes, of memory). Type specifications should be passed as a string when used as the dtype
argument:
1, 2, 3], dtype="int") pd.Series([
0 1
1 2
2 3
dtype: int64
1, 2, 3], dtype="uint8") pd.Series([
0 1
1 2
2 3
dtype: uint8
1, 2, 3], dtype="float32") pd.Series([
0 1.0
1 2.0
2 3.0
dtype: float32
1, 2, 3], dtype="complex128") pd.Series([
0 1.0+0.0j
1 2.0+0.0j
2 3.0+0.0j
dtype: complex128
Strings are stored as object
series by default (object
is the base class in Python that all other data types inherit from – so everything is an object
, including strings):
"a", "b", "c"]) pd.Series([
0 a
1 b
2 c
dtype: object
Finally, boolean series are also supported:
True, True, False]) pd.Series([
0 True
1 True
2 False
dtype: bool
Besides these basic data types, pandas also features so-called extension data types. Currently, the most important extension data types are:
category
for categorical dataInt8
,Int16
,Int32
,Int64
,UInt8
,UInt16
,UInt32
, andUInt64
for integer types with support for missing values (note the upper case initial letters)string
for string databoolean
for boolean data with support for missing values (note that this is different from the basicbool
data type)
"a", "b", "c"], dtype="category") pd.Series([
0 a
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
1, 2, 3], dtype="Int64") pd.Series([
0 1
1 2
2 3
dtype: Int64
"a", "b", "c"], dtype="string") pd.Series([
0 a
1 b
2 c
dtype: string
True, True, False], dtype="boolean") pd.Series([
0 True
1 True
2 False
dtype: boolean
In general, it is recommended to use these extension data types because they perform as well as the core types, but offer better support for missing values or other functionality.
Data frames
Most real-world data can be represented as a table consisting of rows and columns. Pandas represents tabular data with a data frame, which is available as the pd.DataFrame
type. It is basically a collection of columns, which correspond to individual pd.Series
objects.
Creating data frames
From arrays
Let’s start with a simple example, which creates a data frame from a two-dimensional NumPy array:
import numpy as np
= np.arange(1, 21).reshape((4, 5))
data pd.DataFrame(data)
0 1 2 3 4
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15
3 16 17 18 19 20
The resulting data frame does not look very different from a NumPy array, except for row and column labels. However, keep in mind that each column could hold values with a different data type (because each column is actually a pd.Series
object).
We can supply custom column labels when creating a data frame:
=["A", "B", "C", "D", "E"]) pd.DataFrame(data, columns
A B C D E
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15
3 16 17 18 19 20
Similarly, we could also set custom row labels with the index
argument. However, this is rarely needed since consecutively numbered rows starting at zero are fine for most use cases.
Column-wise
Another way to generate a data frame is to pass a dictionary, where each key corresponds to a column label and the corresponding value is a sequence-like object containing the column elements:
"A": [1., 2, 3, 4], "B": ["w", "x", "y", "z"], "C": np.arange(4)}) pd.DataFrame({
A B C
0 1.0 w 0
1 2.0 x 1
2 3.0 y 2
3 4.0 z 3
In this example, each column has a different data type. The first column contains floating point numbers, the second column contains strings, and the third column holds integers.
The first element in the list [1., 2, 3, 4]
is a float. Since all elements within a column must have the same type, all other values in that column are coerced to float as well.
Row-wise
Sometimes, we might want to create a data frame row by row. We can use pd.DataFrame.from_dict()
in combination with orient="index"
for this purpose:
1: [1, "a", 3, 5.], 2: [2, "b", 4, 6.]}, orient="index") pd.DataFrame.from_dict({
0 1 2 3
1 1 a 3 5.0
2 2 b 4 6.0
The dictionary keys are used for the row labels this time.
Viewing data frames
Every data frame has useful attributes and methods that we can use to inspect it in more detail. Let’s start with one of the example data frames we saw previously:
= pd.DataFrame({"A": [1., 2, 3, 4], "B": ["w", "x", "y", "z"], "C": np.arange(4)})
df df
A B C
0 1.0 w 0
1 2.0 x 1
2 3.0 y 2
3 4.0 z 3
We can access its row and column names with the index
and columns
attributes, respectively:
df.index
RangeIndex(start=0, stop=4, step=1)
df.columns
Index(['A', 'B', 'C'], dtype='object')
One of the most useful methods is info()
, which displays essential information such as the data type and number of non-missing values for each column:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null float64
1 B 4 non-null object
2 C 4 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 228.0+ bytes
In this example, we see that df
consists of four rows and three columns. For each column, the label, number of non-missing (non-null) values, and data type are listed. Column A contains float64
values (double precision floating point numbers), column B contains strings (available as object
), and column C contains int64
integers.
Obviously, typing df
in an interactive interpreter prints the data frame:
df
A B C
0 1.0 w 0
1 2.0 x 1
2 3.0 y 2
3 4.0 z 3
When the data frame has many rows and/or columns, the output only shows the first and last few rows and columns:
import string # needed to create a list of the 26 letters
= pd.DataFrame(
df -100, 100, (1000, 26)),
np.random.randint(=list(string.ascii_uppercase) # custom column labels
columns
) df
A B C D E F G H I ... R S T U V W X Y Z
0 65 -42 -13 52 -94 -11 22 92 20 ... 7 -43 27 58 2 -72 -11 52 54
1 -52 63 -38 78 -3 84 51 50 -80 ... 96 -93 -16 13 18 54 -100 75 -90
2 -58 -30 14 -94 -93 11 -7 8 18 ... -5 82 -15 65 -76 -26 3 -65 67
3 -80 43 79 46 -86 -81 -57 -53 -39 ... 54 47 63 68 12 22 -83 79 83
4 -44 -59 -56 -40 41 44 96 -62 -65 ... -63 74 -72 -33 -72 97 -30 -99 -53
.. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. ... .. ..
995 -63 59 90 33 35 66 -11 13 -60 ... 38 -89 11 -81 -28 -72 -99 92 56
996 28 -47 35 -47 -54 -84 -44 79 80 ... 69 72 -42 -83 82 -16 -97 96 7
997 -58 77 18 13 -3 -95 4 4 -7 ... 44 -83 -74 8 66 -14 -96 54 -7
998 2 96 -40 -95 47 1 -91 92 10 ... 0 29 99 88 -67 14 -37 -38 83
999 89 26 37 39 0 -86 64 -11 34 ... 28 53 -16 2 55 -86 -53 69 64
[1000 rows x 26 columns]
We can change this behavior by setting the following two options to appropriate values (None
means unlimited, whereas 0
uses the defaults):
"display.max_rows", None) # print all rows
pd.set_option("display.max_columns", None) # print all columns pd.set_option(
This changes the display options globally, so from now on all data frames will use these display settings. However, we often want to change these settings only for a particular data frame. We can use a context manager for this purpose:
with pd.option_context("display.max_rows", 4, "display.max_columns", 4):
df
A B ... Y Z
0 65 -42 ... 52 54
1 -52 63 ... 75 -90
.. .. .. ... .. ..
998 2 96 ... -38 83
999 89 26 ... 69 64
[1000 rows x 26 columns]
The modified display options are only valid within the context manager (the indented block of code).
Another way to inspect the first or last rows of a data frame is available with the head()
and tail()
methods, respectively:
# first 5 rows df.head()
A B C D E F G H I ... R S T U V W X Y Z
0 65 -42 -13 52 -94 -11 22 92 20 ... 7 -43 27 58 2 -72 -11 52 54
1 -52 63 -38 78 -3 84 51 50 -80 ... 96 -93 -16 13 18 54 -100 75 -90
2 -58 -30 14 -94 -93 11 -7 8 18 ... -5 82 -15 65 -76 -26 3 -65 67
3 -80 43 79 46 -86 -81 -57 -53 -39 ... 54 47 63 68 12 22 -83 79 83
4 -44 -59 -56 -40 41 44 96 -62 -65 ... -63 74 -72 -33 -72 97 -30 -99 -53
[5 rows x 26 columns]
# last 5 rows df.tail()
A B C D E F G H I ... R S T U V W X Y Z
995 -63 59 90 33 35 66 -11 13 -60 ... 38 -89 11 -81 -28 -72 -99 92 56
996 28 -47 35 -47 -54 -84 -44 79 80 ... 69 72 -42 -83 82 -16 -97 96 7
997 -58 77 18 13 -3 -95 4 4 -7 ... 44 -83 -74 8 66 -14 -96 54 -7
998 2 96 -40 -95 47 1 -91 92 10 ... 0 29 99 88 -67 14 -37 -38 83
999 89 26 37 39 0 -86 64 -11 34 ... 28 53 -16 2 55 -86 -53 69 64
[5 rows x 26 columns]
10) # first 10 rows df.head(
A B C D E F G H I ... R S T U V W X Y Z
0 65 -42 -13 52 -94 -11 22 92 20 ... 7 -43 27 58 2 -72 -11 52 54
1 -52 63 -38 78 -3 84 51 50 -80 ... 96 -93 -16 13 18 54 -100 75 -90
2 -58 -30 14 -94 -93 11 -7 8 18 ... -5 82 -15 65 -76 -26 3 -65 67
3 -80 43 79 46 -86 -81 -57 -53 -39 ... 54 47 63 68 12 22 -83 79 83
4 -44 -59 -56 -40 41 44 96 -62 -65 ... -63 74 -72 -33 -72 97 -30 -99 -53
5 83 -20 25 50 -18 -98 -58 -50 10 ... -48 41 91 1 24 28 98 59 -35
6 67 -24 -7 17 -33 -66 97 -69 92 ... 80 -58 -2 3 -54 70 9 -58 -72
7 44 -33 -33 18 -60 -84 86 75 -19 ... -44 58 59 35 -60 -73 42 89 -92
8 -78 88 -75 -79 -56 82 6 -57 82 ... -91 -10 -82 -75 -14 65 -49 -56 4
9 -54 1 98 -21 55 71 95 -78 -84 ... -79 -92 36 -43 65 -72 -28 55 31
[10 rows x 26 columns]
Finally, we can generate basic summary statistics for each column, which is especially useful for numeric columns:
df.describe()
A B ... Y Z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean -2.331000 0.919000 ... 0.422000 -1.573000
std 59.563624 58.540904 ... 58.125575 57.518754
min -100.000000 -100.000000 ... -100.000000 -100.000000
25% -54.250000 -47.250000 ... -49.000000 -51.000000
50% -5.500000 -1.000000 ... 3.000000 -3.500000
75% 50.000000 54.000000 ... 52.000000 48.000000
max 99.000000 99.000000 ... 99.000000 99.000000
[8 rows x 26 columns]
We can round the values in this table to fit more columns on the screen:
round(2) # rounded to two decimal places df.describe().
A B C D ... W X Y Z
count 1000.00 1000.00 1000.00 1000.00 ... 1000.00 1000.00 1000.00 1000.00
mean -2.33 0.92 0.90 -2.12 ... -3.07 -3.25 0.42 -1.57
std 59.56 58.54 57.15 58.54 ... 59.13 57.61 58.13 57.52
min -100.00 -100.00 -100.00 -100.00 ... -100.00 -100.00 -100.00 -100.00
25% -54.25 -47.25 -50.00 -53.00 ... -57.00 -52.25 -49.00 -51.00
50% -5.50 -1.00 2.00 -1.00 ... -4.00 -6.00 3.00 -3.50
75% 50.00 54.00 50.00 49.00 ... 50.00 48.00 52.00 48.00
max 99.00 99.00 99.00 99.00 ... 99.00 99.00 99.00 99.00
[8 rows x 26 columns]
Missing data
Almost any real-word data contains missing values. Pandas encodes missing numeric values as NaN
(“Not a Number”), a sentinel value defined in the IEEE 754 floating point standard. Therefore, this missing value only exists for floating point data, and integer series are automatically converted to float if they contain missing values. Datetime types have a special NaT
(“Not a Time”) value to represent missing dates. Strings and boolean types can use None
for missing values. In summary, missing data is represented with various different values depending on the underlying data type.
Recently, pandas has started to unify missing data support with a native NA
(“Not Available”) value (available as pd.NA
). Future versions of pandas will likely support NA
for all data types, but right now we have to live with the variety of different missing value representations (which of course are all handled correctly by pandas).
To illustrate these ideas, let’s create a data frame with some missing values:
= pd.DataFrame({"A": [1, None, 3], "B": [1., 2., None], "C": [True, False, None]})
df df
A B C
0 1.0 1.0 True
1 NaN 2.0 False
2 3.0 NaN None
df.dtypes
A float64
B float64
C object
dtype: object
Notice that the first column is coerced to float64
, because it contains a missing value (otherwise, it would be int64
). In both numerical columns, missing values are represented as NaN
, whereas the last column is interpreted as a string (object
) column because of the None
element.
These types are probably not what we expected, so it might be a good idea to use extension data types discussed earlier. Most extension data types support the native NA
value, which is displayed as <NA>
:
= pd.DataFrame(
df
{"A": pd.Series([1, pd.NA, 2], dtype="Int64"),
"B": pd.Series([True, False, pd.NA], dtype="boolean"),
"C": pd.Series([pd.NA, "b", "c"], dtype="string"),
"D": [0.0, 0.5, 1.5],
"E": [10, 11, 12]
},=[2, 0, 1]
index
) df
A B C D E
2 2 <NA> c 0.0 10
0 1 True <NA> 0.5 11
1 <NA> False b 1.5 12
df.dtypes
A Int64
B boolean
C string[python]
D float64
E int64
dtype: object
It is easy to check which values are missing in a data frame with the isna()
method:
df.isna()
A B C D E
2 False True False False False
0 False False True False False
1 True False False False False
This works for all kinds of missing value representations, including NaN
, None
, and NA
.
Indexing
Basic indexing
Pandas data frames can be indexed to create subsets of the underlying data. Using square bracket notation, we can select single columns by label:
"B"] df[
2 <NA>
0 True
1 False
Name: B, dtype: boolean
As we already know, this returns a pandas series.
Instead of using square bracket notation, we can also index columns with dot notation, for example:
# same as df["B"] df.B
2 <NA>
0 True
1 False
Name: B, dtype: boolean
However, keep in mind that this only works for column labels that are also valid Python names. Square bracket notation always works.
We can select multiple columns by specifying a list of column labels within the square brackets:
"A", "C"]] df[[
A C
2 2 c
0 1 <NA>
1 <NA> b
This returns a new data frame with only the specified columns.
The same notation can also be used for filtering rows. Instead of column labels, we have to provide a boolean list (or series), for example as generated by a comparison:
"A"] > 1] df[df[
A B C D E
2 2 <NA> c 0.0 10
True, True, False]] # rows 0 and 1 df[[
A B C D E
2 2 <NA> c 0.0 10
0 1 True <NA> 0.5 11
Indexing with loc
and iloc
More sophisticated indexing is available via the loc
and iloc
accessors. Whereas loc
uses row and column labels, iloc
works with integer indices instead. Both variants support simultaneous row and column indexing (by passing a tuple with the desired row and column labels or indices, respectively).
Here are some examples showing loc
in action (note that slices include the end point). First, we can grab a single column by its label:
"B"] df.loc[:,
2 <NA>
0 True
1 False
Name: B, dtype: boolean
The same works for getting a single row with the index 1
(note that this refers to the index labeled 1
and not position 1
):
1, :] df.loc[
A <NA>
B False
C b
D 1.5
E 12
Name: 1, dtype: object
We can even specify a range of columns (or rows). In contrast to almost everything else in Python, the end point of slices in loc
are always inclusive. This example gets columns B
through D
(inclusive):
"B":"D"] df.loc[:,
B C D
2 <NA> c 0.0
0 True <NA> 0.5
1 False b 1.5
Combining row and column slices is also possible (again, note that we are referring to index and column labels here):
0:1, "B":"D"] df.loc[
B C D
0 True <NA> 0.5
1 False b 1.5
In contrast, iloc
always refers to row and column positions instead of labels. Slices do not include the end point. Here are some examples:
1] # column 1 df.iloc[:,
2 <NA>
0 True
1 False
Name: B, dtype: boolean
0, :] # row 0 df.iloc[
A 2
B <NA>
C c
D 0.0
E 10
Name: 2, dtype: object
1:3] # columns 1 and 2 df.iloc[:,
B C
2 <NA> c
0 True <NA>
1 False b
1, 1:3] # row 0 and columns 1 and 2 df.iloc[:
B C
2 <NA> c
Importing data
Pandas supports importing data from a multitude of different formats. In this section, we will showcase how to import some common data formats. In almost all cases, data from a successfully imported file will be available as a data frame.
Text files
Text files are a popular option for storing small to medium-sized data sets. Because text files are simple, they can be opened with any text editor and almost any tool you might use for data analysis. However, importing text files might involve a lot of parameter tweaking, because there is no universal standard on how data should be formatted inside the file (for example, text files can use arbitrary column separators, header rows, comments, decimal number formats, and so on). Moreover, text files store data inefficiently, which makes them a bad choice for very large data sets.
The function pd.read_csv()
is the main workhorse. It can import almost any kind of text data and features a large number of parameters that can be tuned to the idiosyncracies of a given file. Let’s start with a simple text file named data.csv
(located in the working directory) with the following contents:
name,id,height,weight,gender,bp
Bob,4,177,83,m,93.2
Sue,2,179,,f,98.8
Alice,1,166,65,f,87.1
Hank,3,187,90,m,91
Tom,5,182,77,m,101.7
Mara,6,171,69,f,88.6
This looks pretty straightforward. The first row is a header with the column labels. The six columns are separated by ,
(commas), and there are six rows of data (not including the header). Note that the weight
value in the second row is missing.
Importing this file is as simple as:
= pd.read_csv("data.csv")
df df
name id height weight gender bp
0 Bob 4 177 83.0 m 93.2
1 Sue 2 179 NaN f 98.8
2 Alice 1 166 65.0 f 87.1
3 Hank 3 187 90.0 m 91.0
4 Tom 5 182 77.0 m 101.7
5 Mara 6 171 69.0 f 88.6
Let’s take a look at the inferred column data types:
df.dtypes
name object
id int64
height int64
weight float64
gender object
bp float64
dtype: object
Pandas uses the NumPy-based data types by default, so columns name
and gender
(which contain strings) are available as object
columns. Columns id
and height
contain integers (int64
), and column bp
holds floating point numbers (float64
). Although column weight
originally contains integers, pandas converts it to float64
due to the presence of a missing value.
If we are not happy with one of the automatically inferred data types, we can always change individual columns after importing with the astype()
method. For example, the gender
column could be represented with a categorical data type. We could also use the Int64
data type for the weight
column, which supports missing values. Finally, we could convert the id
column from object
to the more modern string
data type.
We do not have to perform any of these conversions, because the automatically inferred data types are fine in most cases (and in particular, weight
is actually a continuous variable, so it is a good idea to use float64
for it).
"gender"] = df["gender"].astype("category")
df["weight"] = df["weight"].astype("Int64")
df["id"] = df["id"].astype("string")
df[ df.dtypes
name object
id string[python]
height int64
weight Int64
gender category
bp float64
dtype: object
df
name id height weight gender bp
0 Bob 4 177 83 m 93.2
1 Sue 2 179 <NA> f 98.8
2 Alice 1 166 65 f 87.1
3 Hank 3 187 90 m 91.0
4 Tom 5 182 77 m 101.7
5 Mara 6 171 69 f 88.6
Alternatively, we can specify the desired data types when importing the file with the dtype
argument. This is more efficient, especially if the data frame is large:
= pd.read_csv(
df "data.csv",
={"gender": "category", "weight": "Int64", "id": "string"}
dtype
) df.dtypes
name object
id string[python]
height int64
weight Int64
gender category
bp float64
dtype: object
Let’s now take a look at another text file data.txt
with the following contents:
Bob;4;177;83;m;93,2
Sue;2;179;;f;98,8
Alice;1;166;65;f;87,1
Hank;3;187;90;m;91
Tom;5;182;77;m;101,7
Mara;6;171;69;f;88,6
This file differs from the previous one in three ways:
- There is no header row.
- Columns are separated by semi-colons (
;
). - Decimal numbers use a comma (
,
) instead of a dot (.
) as the decimal mark (which is common in German-speaking regions).
In order to import this file, we need to tweak some parameters, but let’s first see what happens if we just read the file as before:
"data.txt") pd.read_csv(
Bob;4;177;83;m;93 2
0 Sue;2;179;;f;98 8.0
1 Alice;1;166;65;f;87 1.0
2 Hank;3;187;90;m;91 NaN
3 Tom;5;182;77;m;101 7.0
4 Mara;6;171;69;f;88 6.0
As expected, this does not work, because the resulting data frame consists of only two columns, and the column labels contain actual data. We can use the following arguments to adapt to the three changes we identified previously:
header=None
specifies that there is no header row in the data file.sep=";"
sets the column separator.decimal=","
sets the decimal marker.
With these additional arguments, the resulting data frame looks much better:
"data.txt", header=None, sep=";", decimal=",") pd.read_csv(
0 1 2 3 4 5
0 Bob 4 177 83.0 m 93.2
1 Sue 2 179 NaN f 98.8
2 Alice 1 166 65.0 f 87.1
3 Hank 3 187 90.0 m 91.0
4 Tom 5 182 77.0 m 101.7
5 Mara 6 171 69.0 f 88.6
The only thing missing are proper column names, which we can provide with the names
argument:
pd.read_csv("data.txt",
=None,
header=";",
sep=",",
decimal=["name", "id", "height", "weight", "gender", "bp"]
names )
name id height weight gender bp
0 Bob 4 177 83.0 m 93.2
1 Sue 2 179 NaN f 98.8
2 Alice 1 166 65.0 f 87.1
3 Hank 3 187 90.0 m 91.0
4 Tom 5 182 77.0 m 101.7
5 Mara 6 171 69.0 f 88.6
Of course, we could also specify custom column types as we did in our previous example. The pd.read_csv()
function has many additional parameters that we do not have time to discuss. Instead, I encourage you to consult the documentation whenever you have problems importing a specific file. Chances are that many issues can be solved by setting a parameter to a suitable value.
A quick way to create a toy data frame is to use pd.read_csv()
in combination with a string representing the contents of a ficticious CSV file. Since the function always expects a file name, we use io.StringIO()
to imitate this behavior (we can read from the given StringIO
object as if it was a file):
from io import StringIO
"A,B,C\n1.,2,x\n4.,5,y\n7.,8,z")) pd.read_csv(StringIO(
A B C
0 1.0 2 x
1 4.0 5 y
2 7.0 8 z
Other useful functions for reading text data include pd.read_clipboard()
, pd.read_fwf()
, pd.read_json()
, and pd.read_xml()
. Consult the pandas documentation to learn more.
Excel files
Like it or not, Excel is still a popular tool for analyzing (small) data sets. It is also common as a data storage format (.xlsx
). Pandas can import Excel sheets with pd.read_excel()
.
To import Excel files with pandas, you need to install the following packages depending on the file format:
openpyxl
for modern Excel files (.xlsx)xlrd
for older Excel files (.xls)odfpy
for OpenDocument files (.ods)
It is generally a good idea to install all three dependencies if you are planning to work with Excel files.
Let’s import data from an Excel file named data.xlsx
(located in the working directory), which contains the same table we saw in the previous section:
"data.xlsx") pd.read_excel(
name id height weight gender bp
0 Bob 4 177 83.0 m 93.2
1 Sue 2 179 NaN f 98.8
2 Alice 1 166 65.0 f 87.1
3 Hank 3 187 90.0 m 91.0
4 Tom 5 182 77.0 m 101.7
5 Mara 6 171 69.0 f 88.6
We can use many of the arguments we already know from pd.read_csv()
(such as header
and dtype
). In addition, Excel files can contain multiple sheets. To specify which sheet you want to import, use the sheet_name
argument. For more information on all available parameters, consult the pd.read_excel()
reference.
SPSS/Stata/SAS files
Similar to Excel files, pandas can import data stored in SPSS, Stata, and SAS files with dedicated reader functions pd.read_spss()
, pd.read_stata()
, and pd.read_sas()
.
Reading SPSS, Stata, and SAS files requires the pyreadstat
package.
Data wrangling
Data wrangling can be broken down into the following five basic tasks (as defined in R for Data Science):
- Filtering observations (rows)
- Arranging (reordering) rows
- Selecting variables (columns)
- Creating new variables (columns) based on existing ones
- Summarizing data, typically in combination with grouping data
We will illustrate each operation using the flights
data frame from the nycflights13 data set. This data frame is available as a zipped CSV file flights.zip
in the working directory, and we can import it as follows:
= pd.read_csv(
flights "flights.zip",
={
dtype"dep_time": "Int64",
"dep_delay": "Int64",
"arr_time": "Int64",
"arr_delay": "Int64",
"air_time": "Int64",
"carrier": "string",
"tailnum": "string",
"origin": "string",
"dest": "string"
}
)=["year", "hour", "minute", "time_hour"], inplace=True)
flights.drop(columns flights.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 month 336776 non-null int64
1 day 336776 non-null int64
2 dep_time 328521 non-null Int64
3 sched_dep_time 336776 non-null int64
4 dep_delay 328521 non-null Int64
5 arr_time 328063 non-null Int64
6 sched_arr_time 336776 non-null int64
7 arr_delay 327346 non-null Int64
8 carrier 336776 non-null string
9 flight 336776 non-null int64
10 tailnum 334264 non-null string
11 origin 336776 non-null string
12 dest 336776 non-null string
13 air_time 327346 non-null Int64
14 distance 336776 non-null int64
dtypes: Int64(5), int64(6), string(4)
memory usage: 40.1 MB
Notice that in order to get better support for missing values, we specify the data types for integer and string columns manually as one of the extension types Int64
and string
, respectively. We also drop the year
, hour
, minute
, and time_hour
columns, because we won’t need them in our analyses (and they contain only redundant information anyway).
The nycflights13
data set contains data on all 336,776 flights departing from New York City airports (JFK, LGA, and EWR) in 2013. Let’s see what we can find out about these flights.
Filtering observations
First, we might want to know how many flights departed on New Year’s Day 2013. Each flight is represented by a row, so we need to filter rows with flights that departed on January 1 (columns month
and day
contain the dates).
The query()
method filters rows according to a specified condition based on values in one or more columns. The query must be passed as a single string, which may contain column names without having to use any special indexing syntax. The query returns a new data frame containing only the filtered rows, which in our example consists of 842 rows:
"month == 1 and day == 1") flights.query(
month day dep_time sched_dep_time ... origin dest air_time distance
0 1 1 517 515 ... EWR IAH 227 1400
1 1 1 533 529 ... LGA IAH 227 1416
2 1 1 542 540 ... JFK MIA 160 1089
3 1 1 544 545 ... JFK BQN 183 1576
4 1 1 554 600 ... LGA ATL 116 762
.. ... ... ... ... ... ... ... ... ...
837 1 1 2356 2359 ... JFK BQN 186 1576
838 1 1 <NA> 1630 ... EWR RDU <NA> 416
839 1 1 <NA> 1935 ... LGA DFW <NA> 1389
840 1 1 <NA> 1500 ... LGA MIA <NA> 1096
841 1 1 <NA> 600 ... JFK FLL <NA> 1069
[842 rows x 15 columns]
Notice that we combined two conditions with and
; we can also use or
, for example to filter flights that departed in November or December:
"month == 11 or month == 12") flights.query(
month day dep_time sched_dep_time ... origin dest air_time distance
55893 11 1 5 2359 ... JFK PSE 205 1617
55894 11 1 35 2250 ... JFK SYR 36 209
55895 11 1 455 500 ... EWR CLT 88 529
55896 11 1 539 545 ... LGA IAH 229 1416
55897 11 1 542 545 ... JFK MIA 147 1089
... ... ... ... ... ... ... ... ... ...
111291 12 31 <NA> 705 ... EWR DEN <NA> 1605
111292 12 31 <NA> 825 ... JFK CLT <NA> 541
111293 12 31 <NA> 1615 ... LGA RDU <NA> 431
111294 12 31 <NA> 600 ... EWR ORD <NA> 719
111295 12 31 <NA> 830 ... JFK LAX <NA> 2475
[55403 rows x 15 columns]
A common mistake is to use the query "month == 11 or 12"
, which does not work because the or
operator requires two complete conditions on both sides.
The air_time
column contains the flight time in minutes. Are there any flights with a duration of less than 25 minutes?
"air_time < 25") flights.query(
month day dep_time sched_dep_time ... origin dest air_time distance
826 1 1 2302 2200 ... EWR BDL 24 116
1281 1 2 1335 1322 ... EWR BDL 24 116
3953 1 5 1323 1325 ... EWR BDL 23 116
4202 1 5 1822 1829 ... EWR BDL 24 116
5130 1 6 2125 2129 ... EWR PHL 22 80
... ... ... ... ... ... ... ... ... ...
330634 9 24 1147 1153 ... EWR BDL 23 116
331117 9 24 1939 1945 ... JFK PHL 24 94
333416 9 27 844 845 ... JFK PHL 23 94
334279 9 28 741 738 ... EWR BDL 23 116
336708 9 30 2032 1945 ... JFK PHL 24 94
[235 rows x 15 columns]
What about flights that made up time? In other words, how many flights had a smaller arrival delay than their corresponding departure delay?
"arr_delay < dep_delay") flights.query(
month day dep_time sched_dep_time ... origin dest air_time distance
3 1 1 544 545 ... JFK BQN 183 1576
4 1 1 554 600 ... LGA ATL 116 762
7 1 1 557 600 ... LGA IAD 53 229
8 1 1 557 600 ... JFK MCO 140 944
11 1 1 558 600 ... JFK TPA 158 1005
... ... ... ... ... ... ... ... ... ...
336765 9 30 2240 2245 ... JFK SYR 41 209
336766 9 30 2240 2250 ... JFK BUF 52 301
336767 9 30 2241 2246 ... JFK ROC 47 264
336768 9 30 2307 2255 ... JFK BOS 33 187
336769 9 30 2349 2359 ... JFK PSE 196 1617
[221565 rows x 15 columns]
Instead of query()
, we can also filter rows with indexing. For example, flights with an air time of less than 25 minutes can be obtained with:
"air_time"] < 25] flights[flights[
month day dep_time sched_dep_time ... origin dest air_time distance
826 1 1 2302 2200 ... EWR BDL 24 116
1281 1 2 1335 1322 ... EWR BDL 24 116
3953 1 5 1323 1325 ... EWR BDL 23 116
4202 1 5 1822 1829 ... EWR BDL 24 116
5130 1 6 2125 2129 ... EWR PHL 22 80
... ... ... ... ... ... ... ... ... ...
330634 9 24 1147 1153 ... EWR BDL 23 116
331117 9 24 1939 1945 ... JFK PHL 24 94
333416 9 27 844 845 ... JFK PHL 23 94
334279 9 28 741 738 ... EWR BDL 23 116
336708 9 30 2032 1945 ... JFK PHL 24 94
[235 rows x 15 columns]
However, in most cases this is more verbose than query()
, especially when combining several conditions.
Filtering rows implicitly drops missing values that are present in the column(s) of interest. To explicitly drop missing values, we can use the dropna()
method. The following command drops all rows containing a missing value in any column:
flights.dropna()
month day dep_time sched_dep_time ... origin dest air_time distance
0 1 1 517 515 ... EWR IAH 227 1400
1 1 1 533 529 ... LGA IAH 227 1416
2 1 1 542 540 ... JFK MIA 160 1089
3 1 1 544 545 ... JFK BQN 183 1576
4 1 1 554 600 ... LGA ATL 116 762
... ... ... ... ... ... ... ... ... ...
336765 9 30 2240 2245 ... JFK SYR 41 209
336766 9 30 2240 2250 ... JFK BUF 52 301
336767 9 30 2241 2246 ... JFK ROC 47 264
336768 9 30 2307 2255 ... JFK BOS 33 187
336769 9 30 2349 2359 ... JFK PSE 196 1617
[327346 rows x 15 columns]
If we only want to drop rows with missing values in one (or more) specific columns, we can use the subset
argument. Here’s how to drop all rows that have missing values in the tailnum
column:
="tailnum") flights.dropna(subset
month day dep_time sched_dep_time ... origin dest air_time distance
0 1 1 517 515 ... EWR IAH 227 1400
1 1 1 533 529 ... LGA IAH 227 1416
2 1 1 542 540 ... JFK MIA 160 1089
3 1 1 544 545 ... JFK BQN 183 1576
4 1 1 554 600 ... LGA ATL 116 762
... ... ... ... ... ... ... ... ... ...
336769 9 30 2349 2359 ... JFK PSE 196 1617
336770 9 30 <NA> 1842 ... LGA BNA <NA> 764
336773 9 30 <NA> 1210 ... LGA BNA <NA> 764
336774 9 30 <NA> 1159 ... LGA CLE <NA> 419
336775 9 30 <NA> 840 ... LGA RDU <NA> 431
[334264 rows x 15 columns]
The dropna()
method does not modify the original data frame by default (instead, it returns a new data frame). This can be changed by setting inplace=True
.
Arranging rows
Another frequently used operation is sorting rows in a data frame according to values in other columns. The sort_values()
method accomplishes this task, and by default it arranges rows in ascending order. For example, we can use this method to sort all flights by scheduled departure time within the entire year as follows:
"month", "day", "sched_dep_time"]) flights.sort_values([
month day dep_time sched_dep_time ... origin dest air_time distance
0 1 1 517 515 ... EWR IAH 227 1400
1 1 1 533 529 ... LGA IAH 227 1416
2 1 1 542 540 ... JFK MIA 160 1089
3 1 1 544 545 ... JFK BQN 183 1576
5 1 1 554 558 ... EWR ORD 150 719
... ... ... ... ... ... ... ... ... ...
111276 12 31 2328 2330 ... EWR SJU 198 1608
110520 12 31 13 2359 ... JFK BQN 189 1576
110521 12 31 18 2359 ... JFK SJU 192 1598
111278 12 31 2355 2359 ... JFK SJU 195 1598
111279 12 31 2356 2359 ... JFK PSE 200 1617
[336776 rows x 15 columns]
Which flights had the largest departure delay? To answer this question, we can sort by the dep_delay
column in descending order (so that flights with the largest delays are on top):
"dep_delay", ascending=False) flights.sort_values(
month day dep_time sched_dep_time ... origin dest air_time distance
7072 1 9 641 900 ... JFK HNL 640 4983
235778 6 15 1432 1935 ... JFK CMH 74 483
8239 1 10 1121 1635 ... EWR ORD 111 719
327043 9 20 1139 1845 ... JFK SFO 354 2586
270376 7 22 845 1600 ... JFK CVG 96 589
... ... ... ... ... ... ... ... ... ...
336771 9 30 <NA> 1455 ... JFK DCA <NA> 213
336772 9 30 <NA> 2200 ... LGA SYR <NA> 198
336773 9 30 <NA> 1210 ... LGA BNA <NA> 764
336774 9 30 <NA> 1159 ... LGA CLE <NA> 419
336775 9 30 <NA> 840 ... LGA RDU <NA> 431
[336776 rows x 15 columns]
Missing values will always be sorted at the end (for both ascending and descending order).
Unfortunately, the dep_delay
column is not visible in the output, but we can select (and therefore show) this column with indexing:
"dep_delay", ascending=False)["dep_delay"] flights.sort_values(
7072 1301
235778 1137
8239 1126
327043 1014
270376 1005
...
336771 <NA>
336772 <NA>
336773 <NA>
336774 <NA>
336775 <NA>
Name: dep_delay, Length: 336776, dtype: Int64
1301 minutes are almost 22 hours! That’s a long delay for sure!
Selecting variables
Next, we will learn how to select specific columns with the filter()
method. Let’s start with a basic example, where we pick two columns passed as a list of column labels:
filter(["month", "day"]) flights.
month day
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
... ... ...
336771 9 30
336772 9 30
336773 9 30
336774 9 30
336775 9 30
[336776 rows x 2 columns]
We can also pick columns with regular expressions (using the regex
argument). For example, we can pick all columns containing delay
as follows:
filter(regex="delay") flights.
dep_delay arr_delay
0 2 11
1 4 20
2 2 33
3 -1 -18
4 -6 -25
... ... ...
336771 <NA> <NA>
336772 <NA> <NA>
336773 <NA> <NA>
336774 <NA> <NA>
336775 <NA> <NA>
[336776 rows x 2 columns]
Or what about columns that contain the string arr_
?
filter(regex="arr_") flights.
arr_time sched_arr_time arr_delay
0 830 819 11
1 850 830 20
2 923 850 33
3 1004 1022 -18
4 812 837 -25
... ... ... ...
336771 <NA> 1634 <NA>
336772 <NA> 2312 <NA>
336773 <NA> 1330 <NA>
336774 <NA> 1344 <NA>
336775 <NA> 1020 <NA>
[336776 rows x 3 columns]
The next example selects all columns starting with the letter “a”:
filter(regex="^a") flights.
arr_time arr_delay air_time
0 830 11 227
1 850 20 227
2 923 33 160
3 1004 -18 183
4 812 -25 116
... ... ... ...
336771 <NA> <NA> <NA>
336772 <NA> <NA> <NA>
336773 <NA> <NA> <NA>
336774 <NA> <NA> <NA>
336775 <NA> <NA> <NA>
[336776 rows x 3 columns]
Finally, we pick all columns ending with the word “time”:
filter(regex="time$") flights.
dep_time sched_dep_time arr_time sched_arr_time air_time
0 517 515 830 819 227
1 533 529 850 830 227
2 542 540 923 850 160
3 544 545 1004 1022 183
4 554 600 812 837 116
... ... ... ... ... ...
336771 <NA> 1455 <NA> 1634 <NA>
336772 <NA> 2200 <NA> 2312 <NA>
336773 <NA> 1210 <NA> 1330 <NA>
336774 <NA> 1159 <NA> 1344 <NA>
336775 <NA> 840 <NA> 1020 <NA>
[336776 rows x 5 columns]
Regular expressions are a powerful way to describe text patterns. We do not have time to go into more details, but it pays off to learn at least the basic concepts (for example with this tutorial).
Creating new variables
First, let’s create a smaller data frame with only a few columns. This will make it easier to see the new columns that we will add in this section.
= flights.filter(["dep_delay", "arr_delay", "distance", "air_time"]) df
Now imagine we wanted to add a new column containing the gained time of each flight, which can be computed as the difference between arrival delay and departure delay. If the gain is negative, the airplane made up at least some of its departure delay, and therefore gained time.
Pandas offers the assign()
method to compute new columns. Let’s add the previously discussed gain
column. In addition, let’s also compute the average speed of each flight (in miles per hour) using the existing air_time
and distance
columns:
df.assign(=df["arr_delay"] - df["dep_delay"],
gain=df["air_time"] / 60,
hours=lambda x: x["distance"] / x["hours"]
speed )
dep_delay arr_delay distance air_time gain hours speed
0 2 11 1400 227 9 3.783333 370.044053
1 4 20 1416 227 16 3.783333 374.273128
2 2 33 1089 160 31 2.666667 408.375
3 -1 -18 1576 183 -17 3.05 516.721311
4 -6 -25 762 116 -19 1.933333 394.137931
... ... ... ... ... ... ... ...
336771 <NA> <NA> 213 <NA> <NA> <NA> <NA>
336772 <NA> <NA> 198 <NA> <NA> <NA> <NA>
336773 <NA> <NA> 764 <NA> <NA> <NA> <NA>
336774 <NA> <NA> 419 <NA> <NA> <NA> <NA>
336775 <NA> <NA> 431 <NA> <NA> <NA> <NA>
[336776 rows x 7 columns]
Notice that we can use columns that we just added to create another new variable like speed
, where we use the hours
column from the previous line (but in the same method call). We have to use a lambda function though, because df["hours"]
does not exist in the original data frame.
Summarizing data
Summarizing data can yield valuable insights, especially when performed on grouped data. In most cases, our goal is to summarize or aggregate values from a single column. Each column is represented by a pd.Series
, and we can use the agg()
method to summarize its values with a given aggregation function. An aggregation function can be any function which returns exactly one value (a scalar) such as the mean, median, standard deviation, and so on.
Here’s an example which summarizes the dep_delay
column by its mean:
"dep_delay"].agg("mean") flights[
np.float64(12.639070257304708)
We could also select the column with the filter()
method we saw before:
filter(["dep_delay"]).agg("mean") flights.
dep_delay 12.63907
dtype: Float64
Missing values are automatically removed before agg()
calculates the summary statistic.
The first example flights["dep_delay"].agg("mean")
returns a scalar, whereas the second example flights.filter(["dep_delay"]).agg("mean")
returns the value inside a pd.Series
. The reason for this difference is that the filter()
method always returns a data frame, even if it consists of only one column. Aggregating columns of a data frame results in a series (of aggregated values). Indexing with square bracket notation returns a pd.Series
if we are asking for only one column. Aggregating a series results in a value.
We can even compute several summary statistics simultaneously by passing a list of functions:
filter(["dep_delay"]).agg(["mean", "std", "min", "max"]) flights.
dep_delay
mean 12.639070
std 40.210061
min -43.000000
max 1301.000000
The most important aggregation functions are available as strings (as in the previous example), but in general we can pass an arbitrary function which takes a sequence-like object and returns a single value. For example, we could write the previous mean aggregation as:
"dep_delay"].agg(lambda x: sum(x.dropna()) / len(x.dropna())) flights[
np.float64(12.639070257304708)
For even moderately complex functions, lambda functions tend to get unreadable, so you should prefer a regular (named) function instead.
Things get more interesting when we group the data first and aggregate within each group individually. The groupby()
data frame method splits the data into groups based on one or more columns. For example, we can group by month and day to calculate the mean departure delay on each single day:
"month", "day"])["dep_delay"].agg("mean") flights.groupby([
month day
1 1 11.548926
2 13.858824
3 10.987832
4 8.951595
5 5.732218
...
12 27 10.93763
28 7.98155
29 22.309551
30 10.698113
31 6.996053
Name: dep_delay, Length: 365, dtype: Float64
Similarly, here’s the average departure delay per month:
"month")["dep_delay"].agg("mean") flights.groupby(
month
1 10.036665
2 10.816843
3 13.227076
4 13.938038
5 12.986859
6 20.846332
7 21.727787
8 12.61104
9 6.722476
10 6.243988
11 5.435362
12 16.576688
Name: dep_delay, dtype: Float64
Notice how we select the dep_delay
column of the grouped data frame with square bracket indexing. Another option is to specify the columns that should be aggregated in the argument of the agg()
method as follows:
"month").agg({"dep_delay": "mean"}) flights.groupby(
dep_delay
month
1 10.036665
2 10.816843
3 13.227076
4 13.938038
5 12.986859
6 20.846332
7 21.727787
8 12.61104
9 6.722476
10 6.243988
11 5.435362
12 16.576688
Do you see the difference in the outputs of the previous two examples? The first output is a pd.Series
, whereas the second output is a pd.Dataframe
with one column.
The count()
method counts the number of non-missing values in a column:
"month").agg({"dep_delay": "count"}) flights.groupby(
dep_delay
month
1 26483
2 23690
3 27973
4 27662
5 28233
6 27234
7 28485
8 28841
9 27122
10 28653
11 27035
12 27110
In contrast, the number of missing values in that column can be obtained with:
"month")["dep_delay"].agg(lambda x: x.isna().sum()) flights.groupby(
month
1 521
2 1261
3 861
4 668
5 563
6 1009
7 940
8 486
9 452
10 236
11 233
12 1025
Name: dep_delay, dtype: Int64
The value_counts()
method determines the number of occurrences for each unique value in a column. For example, an individual airplane is identified by its tail number. If we want to know how many unique airplanes there are in the data set, and how many flights each individual airplane had, we can use the following command:
"tailnum"].value_counts() flights[
tailnum
N725MQ 575
N722MQ 513
N723MQ 507
N711MQ 486
N713MQ 483
...
N7ASAA 1
N932DN 1
N318AS 1
N651UA 1
N557AS 1
Name: count, Length: 4043, dtype: Int64
This output shows that there are 4043 unique airplanes, each of which is listed with its corresponding number of flights. The ten airplanes with the most flights are:
"tailnum"].value_counts().head(10) flights[
tailnum
N725MQ 575
N722MQ 513
N723MQ 507
N711MQ 486
N713MQ 483
N258JB 427
N298JB 407
N353JB 404
N351JB 402
N735MQ 396
Name: count, dtype: Int64
How many airplanes have just one flight?
"tailnum"].value_counts() == 1).sum() (flights[
np.int64(171)
Here, we exploited the fact that summing a boolean series counts True
as 1 and False
as 0.
Pipelines
Almost all interesting data wrangling tasks require a combination of the five previously mentioned basic operations. You might have already noticed this in some examples, where we first picked a subset of columns and then performed some operations (like adding new variables). Usually, this requires multiple steps involving temporary names, for example:
= flights.filter(["dep_delay", "arr_delay", "distance", "air_time"])
df1 = df1.assign(
df2 =df1["arr_delay"] - df1["dep_delay"],
gain=df1["air_time"] / 60,
hours=lambda x: x["distance"] / x["hours"]
speed
)"speed") df2.sort_values(
dep_delay arr_delay distance air_time gain hours speed
24098 52 103 96 75 51 1.25 76.8
248839 -5 86 199 141 91 2.35 84.680851
305879 -8 25 94 61 33 1.016667 92.459016
25455 42 81 94 59 39 0.983333 95.59322
80009 -4 29 96 60 33 1.0 96.0
... ... ... ... ... ... ... ...
336771 <NA> <NA> 213 <NA> <NA> <NA> <NA>
336772 <NA> <NA> 198 <NA> <NA> <NA> <NA>
336773 <NA> <NA> 764 <NA> <NA> <NA> <NA>
336774 <NA> <NA> 419 <NA> <NA> <NA> <NA>
336775 <NA> <NA> 431 <NA> <NA> <NA> <NA>
[336776 rows x 7 columns]
The problem here is that we do not really need the names df1
and df2
other than to perform the desired operations sequentially. Instead, we can use a so-called pipeline to chain two or more operations without the need to create temporary names:
(flightsfilter(["dep_delay", "arr_delay", "distance", "air_time"])
.
.assign(=flights["arr_delay"] - flights["dep_delay"],
gain=flights["air_time"] / 60,
hours=lambda x: x["distance"] / x["hours"]
speed
)"speed", ascending=False)
.sort_values( )
dep_delay arr_delay distance air_time gain hours speed
216447 9 -14 762 65 -23 1.083333 703.384615
251999 45 26 1008 93 -19 1.55 650.322581
205388 15 -1 594 55 -16 0.916667 648.0
157516 4 2 748 70 -2 1.166667 641.142857
10223 -1 -28 1035 105 -27 1.75 591.428571
... ... ... ... ... ... ... ...
336771 <NA> <NA> 213 <NA> <NA> <NA> <NA>
336772 <NA> <NA> 198 <NA> <NA> <NA> <NA>
336773 <NA> <NA> 764 <NA> <NA> <NA> <NA>
336774 <NA> <NA> 419 <NA> <NA> <NA> <NA>
336775 <NA> <NA> 431 <NA> <NA> <NA> <NA>
[336776 rows x 7 columns]
Breaking up a pipeline into multiple rows (as shown in the previous example) creates a nicely formatted representation of the sequence of steps. However, we need to enclose the entire pipeline in parentheses.
Tidy data
There are many ways to organize data into rows and columns, but the tidy data format has some compelling advantages. A table is tidy when
- each variable corresponds to exactly one column,
- each observation corresponds to exactly one row,
- and each value corresponds to exactly one cell.
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham
Consider the following table (taken from the tidyr
R package):
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
This table is tidy because it meets all three criteria. It might be difficult to see why this is the case, so let’s look at other tables (containing the same data) that are not tidy. The following table combines two variables into one column, which makes the table twice as long:
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
For the same reason, the next table is also not tidy. It contains a combination of two variables in one column. In addition, this column is not numeric, which makes it difficult to work with the actual data.
country | year | rate |
---|---|---|
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
In pandas, working with tidy data facilitates many operations, so whenever possible this should be the preferred table format. However, depending on the task, this might not always be possible or practical, so it is completely fine if another data organization works better.
Pandas makes it pretty straightforward to convert any given table to a tidy format (or vice versa) using the pivot()
method. Let’s start with data from Table 2 and try to make it tidy:
= pd.read_csv(StringIO("""country,year,type,count
messy1 Afghanistan,1999,cases,745
Afghanistan,1999,population,19987071
Afghanistan,2000,cases,2666
Afghanistan,2000,population,20595360
Brazil,1999,cases,37737
Brazil,1999,population,172006362
Brazil,2000,cases,80488
Brazil,2000,population,174504898
China,1999,cases,212258
China,1999,population,1272915272
China,2000,cases,213766
China,2000,population,1280428583"""))
messy1
country year type count
0 Afghanistan 1999 cases 745
1 Afghanistan 1999 population 19987071
2 Afghanistan 2000 cases 2666
3 Afghanistan 2000 population 20595360
4 Brazil 1999 cases 37737
5 Brazil 1999 population 172006362
6 Brazil 2000 cases 80488
7 Brazil 2000 population 174504898
8 China 1999 cases 212258
9 China 1999 population 1272915272
10 China 2000 cases 213766
11 China 2000 population 1280428583
The index
argument specifies which columns should be moved to new rows (the index), whereas the columns
argument specifies which columns should be split up into new columns. Finally, the (optional) values
argument lists the columns containing the values of the new data frame:
=["country", "year"], columns="type", values="count") messy1.pivot(index
type cases population
country year
Afghanistan 1999 745 19987071
2000 2666 20595360
Brazil 1999 37737 172006362
2000 80488 174504898
China 1999 212258 1272915272
2000 213766 1280428583
This result is a data frame consisting of two columns cases
and population
and an index consisting of country
and year
(a so-called hierarchical index or multi-index). We can convert a multi-index into separate columns using the reset_index()
method (which resets the index to the default integer series starting at zero):
(messy1=["country", "year"], columns="type", values="count")
.pivot(index
.reset_index() )
type country year cases population
0 Afghanistan 1999 745 19987071
1 Afghanistan 2000 2666 20595360
2 Brazil 1999 37737 172006362
3 Brazil 2000 80488 174504898
4 China 1999 212258 1272915272
5 China 2000 213766 1280428583
The type
in the output is the name of the columns (which is usually empty). If this bothers you, the column name attribute can be set to None
(but we cannot do this inside a pipeline):
= (messy1
tidy1 =["country", "year"], columns="type", values="count")
.pivot(index
.reset_index()
)= None
tidy1.columns.name tidy1
country year cases population
0 Afghanistan 1999 745 19987071
1 Afghanistan 2000 2666 20595360
2 Brazil 1999 37737 172006362
3 Brazil 2000 80488 174504898
4 China 1999 212258 1272915272
5 China 2000 213766 1280428583
Let’s try to tidy the data from Table 3:
= pd.read_csv(StringIO("""country,year,rate
messy2 Afghanistan,1999,745/19987071
Afghanistan,2000,2666/20595360
Brazil,1999,37737/172006362
Brazil,2000,80488/174504898
China,1999,212258/1272915272
China,2000,213766/1280428583"""))
messy2
country year rate
0 Afghanistan 1999 745/19987071
1 Afghanistan 2000 2666/20595360
2 Brazil 1999 37737/172006362
3 Brazil 2000 80488/174504898
4 China 1999 212258/1272915272
5 China 2000 213766/1280428583
The problem here is that two variables are combined into one column rate
, which is of type object
:
messy2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 6 non-null object
1 year 6 non-null int64
2 rate 6 non-null object
dtypes: int64(1), object(2)
memory usage: 276.0+ bytes
String columns offer a lot of useful methods such as slicing or splitting using the str
method namespace. In our example, we would like to split the column on the /
character, resulting in two new columns:
"rate"].str.split("/", expand=True) messy2[
0 1
0 745 19987071
1 2666 20595360
2 37737 172006362
3 80488 174504898
4 212258 1272915272
5 213766 1280428583
Note that this operation creates a new data frame because we set expand=True
(by default, expand=False
creates a series). We can add this resulting data frame to our tidy result:
= messy2.loc[:, ["country", "year"]]
tidy2 "count", "population"]] = messy2["rate"].str.split("/", expand=True) tidy2[[
Now we are almost done, the data frame is tidy, but the column types of count
and population
are still not numeric (they are object
just like the original columns). We need to explicitly convert them, for example with pd.to_numeric()
:
"count"] = pd.to_numeric(tidy2["count"])
tidy2["population"] = pd.to_numeric(tidy2["population"])
tidy2[ tidy2
country year count population
0 Afghanistan 1999 745 19987071
1 Afghanistan 2000 2666 20595360
2 Brazil 1999 37737 172006362
3 Brazil 2000 80488 174504898
4 China 1999 212258 1272915272
5 China 2000 213766 1280428583
COVID-19 example
Let’s apply some of the things we learned in this chapter to a real-world data set: historic COVID-19 cases in Austria. The data is not available on any of the official government websites anymore, but we can use a snapshot from June 28 2023 (renamed to covid19.csv
) and put it in our working directory.
Using pd.read_csv()
, we can then import this file as follows:
= pd.read_csv(
covid19 "covid19.csv",
=";",
sep=",",
decimal=["Time"],
parse_dates=True
dayfirst )
The file uses semicolons (;
) to separate columns and commas (,
) for decimal numbers, which we need to specify in order to correctly import the data. Passing parse_dates=["Time"]
together with dayfirst=True
makes sure that the Time
column has the suitable datetime data type. If you are wondering where this information is available – the only way to find out for sure is to open the file in a text editor and inspect the values!
If we do not specify decimal=","
, pandas expects a dot (.
) as the decimal separator. If we then import a column with a different decimal separator, it will be interpreted as type object
. Therefore, it is important to carefully check all column types after importing (and verify that numeric columns have numeric types).
There are 12170 rows and 12 columns with no missing values, and all column data types have been set or inferred correctly (the first column is a datetime, and all remaining columns are numeric except for Bundesland
, which contains the names of the Austrian provinces):
covid19.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12170 entries, 0 to 12169
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 12170 non-null datetime64[ns]
1 Bundesland 12170 non-null object
2 BundeslandID 12170 non-null int64
3 AnzEinwohner 12170 non-null int64
4 AnzahlFaelle 12170 non-null int64
5 AnzahlFaelleSum 12170 non-null int64
6 AnzahlFaelle7Tage 12170 non-null int64
7 SiebenTageInzidenzFaelle 12170 non-null float64
8 AnzahlTotTaeglich 12170 non-null int64
9 AnzahlTotSum 12170 non-null int64
10 AnzahlGeheiltTaeglich 12170 non-null int64
11 AnzahlGeheiltSum 12170 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(9), object(1)
memory usage: 1.1+ MB
There are many interesting questions we could try to answer with this data set. Let’s highlight a few simple ones that each introduce a new feature that we have not seen before.
First, because COVID-19 infections heavily depend on the seasons, we could be interested in average daily new cases per month. A datetime series exposes specialized methods in its dt
method namespace. Here, we use dt.month
to extract the month of the year (1–12) and add this information as a new column month
. Then we group by month
and compute the average in each group:
(covid19=covid19["Time"].dt.month)
.assign(month"month")
.groupby("AnzahlFaelle": "mean"})
.agg({ )
AnzahlFaelle
month
1 1589.455914
2 2287.820455
3 2211.537097
4 658.375000
5 241.375806
6 337.039655
7 710.111828
8 440.939785
9 600.555556
10 990.653763
11 1398.960000
12 705.918280
Similarly, we know that new cases also depend on the day of the week, because fewer tests are administered on the weekend. Therefore, we can compute average cases aggregated per weekday (0–6 correspond to Monday–Sunday) as follows:
(covid19=covid19["Time"].dt.weekday)
.assign(weekday"weekday")
.groupby("AnzahlFaelle": "mean"})
.agg({ )
AnzahlFaelle
weekday
0 945.517241
1 1292.931792
2 1122.196552
3 1072.447126
4 999.032184
5 838.965517
6 725.971264
Let’s combine these two separate analyses and calculate the average number of cases on each weekday per month by grouping by both columns:
(covid19=covid19["Time"].dt.month)
.assign(month=covid19["Time"].dt.weekday)
.assign(weekday"month", "weekday"])
.groupby(["AnzahlFaelle": "mean"})
.agg({ )
AnzahlFaelle
month weekday
1 0 1619.042857
1 1835.153846
2 1733.100000
3 1839.933333
4 1556.461538
... ...
12 2 809.985714
3 735.520000
4 738.914286
5 614.707692
6 494.066667
[84 rows x 1 columns]
Finally, we can also create plots directly from data frames and/or single columns. That way, we can recreate the famous time course of daily COVID-19 infections over the whole time period. We will plot the daily values for Austria with time on the x-axis and new cases on the y-axis:
(covid19"Bundesland == 'Österreich'")
.query("Time")
.set_index("AnzahlFaelle"]
[="Daily new COVID-19 cases in Austria")
.plot(title )
Note that when we plot a single column, the index will be displayed on the x-axis. That’s why we explicitly set the index to the Time
column (which means that this column gets “promoted” to the index). After that, we pick the column of interest and call its plot()
method to generate the visualization.
Footnotes
NumPy supports arrays with different data types through structured arrays, but pandas data frames are usually much easier to work with in a typical data analysis pipeline.↩︎
Note that pandas is spelled with all lower-case letters. We will capitalize the first letter only at the beginning of a sentence.↩︎