Pandas

Data wrangling essentials

Author

Clemens Brunner

Published

June 30, 2025

Introduction

Why do we need another package on top of NumPy? Could we not just use NumPy arrays for our data analysis tasks? Unfortunately, in many cases this is not going to work. NumPy arrays are homogenous arrays, which means that all array items must have the same type. However, real-world tabular data generally consists of variables (columns) that have different types. For example, whereas some columns might contain integers, others might hold floating point numbers, text (strings), dates, or categorical data. Such a diverse collection of variables cannot be represented by a NumPy array1.

Pandas2 offers a flexible data type for tabular data which builds upon NumPy arrays, but can accommodate heterogeneous column types: the so-called data frame.

By convention, we import pandas as follows:

import pandas as pd

Before we discuss data frames, we will introduce series as essential building blocks.

Series

A pandas series (pd.Series) is a sequence-like homogeneous object, similar to a one-dimensional NumPy array. In fact, series are based on NumPy arrays under the hood, so they share important properties. For example, just like a NumPy array, a series has a specific data type (which means that all elements have the same type).

We can construct a series from any iterable, such as a standard list. Here’s an example:

pd.Series([1, 2, 3, 4, 5])
0    1
1    2
2    3
3    4
4    5
dtype: int64

The output shows two columns: the values are in the right column, whereas the left column shows the index (which by default starts with zero). The bottom row displays the data type (int64 in this case).

We can enforce a specific data type with the dtype argument:

pd.Series([1, 2, 3, 4, 5], dtype="float64")
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

Pandas data types are based on NumPy array scalars. The most common sized numerical types are:

  • Integers: int8, int16, int32, and int64
  • Unsigned integers: uint8, uint16, uint32, and uint64
  • Floating point numbers: float16, float32, float64, and float128
  • Complex numbers: complex64, complex128, and complex256

The shorthand notations int, uint, float, and complex map to int64, uint64, float64, and complex128 on most machines, respectively. The numbers indicate the bits used to store one element of a particular type (for example, an int16 value requires 16 bits, or 2 bytes, of memory). Type specifications should be passed as a string when used as the dtype argument:

pd.Series([1, 2, 3], dtype="int")
0    1
1    2
2    3
dtype: int64
pd.Series([1, 2, 3], dtype="uint8")
0    1
1    2
2    3
dtype: uint8
pd.Series([1, 2, 3], dtype="float32")
0    1.0
1    2.0
2    3.0
dtype: float32
pd.Series([1, 2, 3], dtype="complex128")
0    1.0+0.0j
1    2.0+0.0j
2    3.0+0.0j
dtype: complex128

Strings are stored as object series by default (object is the base class in Python that all other data types inherit from – so everything is an object, including strings):

pd.Series(["a", "b", "c"])
0    a
1    b
2    c
dtype: object

Finally, boolean series are also supported:

pd.Series([True, True, False])
0     True
1     True
2    False
dtype: bool

Besides these basic data types, pandas also features so-called extension data types. Currently, the most important extension data types are:

  • category for categorical data
  • Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, and UInt64 for integer types with support for missing values (note the upper case initial letters)
  • string for string data
  • boolean for boolean data with support for missing values (note that this is different from the basic bool data type)
pd.Series(["a", "b", "c"], dtype="category")
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
pd.Series([1, 2, 3], dtype="Int64")
0    1
1    2
2    3
dtype: Int64
pd.Series(["a", "b", "c"], dtype="string")
0    a
1    b
2    c
dtype: string
pd.Series([True, True, False], dtype="boolean")
0     True
1     True
2    False
dtype: boolean

In general, it is recommended to use these extension data types because they perform as well as the core types, but offer better support for missing values or other functionality.

Data frames

Most real-world data can be represented as a table consisting of rows and columns. Pandas represents tabular data with a data frame, which is available as the pd.DataFrame type. It is basically a collection of columns, which correspond to individual pd.Series objects.

Creating data frames

From arrays

Let’s start with a simple example, which creates a data frame from a two-dimensional NumPy array:

import numpy as np

data = np.arange(1, 21).reshape((4, 5))
pd.DataFrame(data)
    0   1   2   3   4
0   1   2   3   4   5
1   6   7   8   9  10
2  11  12  13  14  15
3  16  17  18  19  20

The resulting data frame does not look very different from a NumPy array, except for row and column labels. However, keep in mind that each column could hold values with a different data type (because each column is actually a pd.Series object).

We can supply custom column labels when creating a data frame:

pd.DataFrame(data, columns=["A", "B", "C", "D", "E"])
    A   B   C   D   E
0   1   2   3   4   5
1   6   7   8   9  10
2  11  12  13  14  15
3  16  17  18  19  20

Similarly, we could also set custom row labels with the index argument. However, this is rarely needed since consecutively numbered rows starting at zero are fine for most use cases.

Column-wise

Another way to generate a data frame is to pass a dictionary, where each key corresponds to a column label and the corresponding value is a sequence-like object containing the column elements:

pd.DataFrame({"A": [1., 2, 3, 4], "B": ["w", "x", "y", "z"], "C": np.arange(4)})
     A  B  C
0  1.0  w  0
1  2.0  x  1
2  3.0  y  2
3  4.0  z  3

In this example, each column has a different data type. The first column contains floating point numbers, the second column contains strings, and the third column holds integers.

Note

The first element in the list [1., 2, 3, 4] is a float. Since all elements within a column must have the same type, all other values in that column are coerced to float as well.

Row-wise

Sometimes, we might want to create a data frame row by row. We can use pd.DataFrame.from_dict() in combination with orient="index" for this purpose:

pd.DataFrame.from_dict({1: [1, "a", 3, 5.], 2: [2, "b", 4, 6.]}, orient="index")
   0  1  2    3
1  1  a  3  5.0
2  2  b  4  6.0

The dictionary keys are used for the row labels this time.

Viewing data frames

Every data frame has useful attributes and methods that we can use to inspect it in more detail. Let’s start with one of the example data frames we saw previously:

df = pd.DataFrame({"A": [1., 2, 3, 4], "B": ["w", "x", "y", "z"], "C": np.arange(4)})
df
     A  B  C
0  1.0  w  0
1  2.0  x  1
2  3.0  y  2
3  4.0  z  3

We can access its row and column names with the index and columns attributes, respectively:

df.index
RangeIndex(start=0, stop=4, step=1)
df.columns
Index(['A', 'B', 'C'], dtype='object')

One of the most useful methods is info(), which displays essential information such as the data type and number of non-missing values for each column:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      float64
 1   B       4 non-null      object 
 2   C       4 non-null      int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 228.0+ bytes

In this example, we see that df consists of four rows and three columns. For each column, the label, number of non-missing (non-null) values, and data type are listed. Column A contains float64 values (double precision floating point numbers), column B contains strings (available as object), and column C contains int64 integers.

Obviously, typing df in an interactive interpreter prints the data frame:

df
     A  B  C
0  1.0  w  0
1  2.0  x  1
2  3.0  y  2
3  4.0  z  3

When the data frame has many rows and/or columns, the output only shows the first and last few rows and columns:

import string  # needed to create a list of the 26 letters

df = pd.DataFrame(
    np.random.randint(-100, 100, (1000, 26)),
    columns=list(string.ascii_uppercase)  # custom column labels
)
df
      A   B   C   D   E   F   G   H   I  ...   R   S   T   U   V   W    X   Y   Z
0    65 -42 -13  52 -94 -11  22  92  20  ...   7 -43  27  58   2 -72  -11  52  54
1   -52  63 -38  78  -3  84  51  50 -80  ...  96 -93 -16  13  18  54 -100  75 -90
2   -58 -30  14 -94 -93  11  -7   8  18  ...  -5  82 -15  65 -76 -26    3 -65  67
3   -80  43  79  46 -86 -81 -57 -53 -39  ...  54  47  63  68  12  22  -83  79  83
4   -44 -59 -56 -40  41  44  96 -62 -65  ... -63  74 -72 -33 -72  97  -30 -99 -53
..   ..  ..  ..  ..  ..  ..  ..  ..  ..  ...  ..  ..  ..  ..  ..  ..  ...  ..  ..
995 -63  59  90  33  35  66 -11  13 -60  ...  38 -89  11 -81 -28 -72  -99  92  56
996  28 -47  35 -47 -54 -84 -44  79  80  ...  69  72 -42 -83  82 -16  -97  96   7
997 -58  77  18  13  -3 -95   4   4  -7  ...  44 -83 -74   8  66 -14  -96  54  -7
998   2  96 -40 -95  47   1 -91  92  10  ...   0  29  99  88 -67  14  -37 -38  83
999  89  26  37  39   0 -86  64 -11  34  ...  28  53 -16   2  55 -86  -53  69  64

[1000 rows x 26 columns]

We can change this behavior by setting the following two options to appropriate values (None means unlimited, whereas 0 uses the defaults):

pd.set_option("display.max_rows", None)  # print all rows
pd.set_option("display.max_columns", None)  # print all columns

This changes the display options globally, so from now on all data frames will use these display settings. However, we often want to change these settings only for a particular data frame. We can use a context manager for this purpose:

with pd.option_context("display.max_rows", 4, "display.max_columns", 4):
    df
      A   B  ...   Y   Z
0    65 -42  ...  52  54
1   -52  63  ...  75 -90
..   ..  ..  ...  ..  ..
998   2  96  ... -38  83
999  89  26  ...  69  64

[1000 rows x 26 columns]

The modified display options are only valid within the context manager (the indented block of code).

Another way to inspect the first or last rows of a data frame is available with the head() and tail() methods, respectively:

df.head()  # first 5 rows
    A   B   C   D   E   F   G   H   I  ...   R   S   T   U   V   W    X   Y   Z
0  65 -42 -13  52 -94 -11  22  92  20  ...   7 -43  27  58   2 -72  -11  52  54
1 -52  63 -38  78  -3  84  51  50 -80  ...  96 -93 -16  13  18  54 -100  75 -90
2 -58 -30  14 -94 -93  11  -7   8  18  ...  -5  82 -15  65 -76 -26    3 -65  67
3 -80  43  79  46 -86 -81 -57 -53 -39  ...  54  47  63  68  12  22  -83  79  83
4 -44 -59 -56 -40  41  44  96 -62 -65  ... -63  74 -72 -33 -72  97  -30 -99 -53

[5 rows x 26 columns]
df.tail()  # last 5 rows
      A   B   C   D   E   F   G   H   I  ...   R   S   T   U   V   W   X   Y   Z
995 -63  59  90  33  35  66 -11  13 -60  ...  38 -89  11 -81 -28 -72 -99  92  56
996  28 -47  35 -47 -54 -84 -44  79  80  ...  69  72 -42 -83  82 -16 -97  96   7
997 -58  77  18  13  -3 -95   4   4  -7  ...  44 -83 -74   8  66 -14 -96  54  -7
998   2  96 -40 -95  47   1 -91  92  10  ...   0  29  99  88 -67  14 -37 -38  83
999  89  26  37  39   0 -86  64 -11  34  ...  28  53 -16   2  55 -86 -53  69  64

[5 rows x 26 columns]
df.head(10)  # first 10 rows
    A   B   C   D   E   F   G   H   I  ...   R   S   T   U   V   W    X   Y   Z
0  65 -42 -13  52 -94 -11  22  92  20  ...   7 -43  27  58   2 -72  -11  52  54
1 -52  63 -38  78  -3  84  51  50 -80  ...  96 -93 -16  13  18  54 -100  75 -90
2 -58 -30  14 -94 -93  11  -7   8  18  ...  -5  82 -15  65 -76 -26    3 -65  67
3 -80  43  79  46 -86 -81 -57 -53 -39  ...  54  47  63  68  12  22  -83  79  83
4 -44 -59 -56 -40  41  44  96 -62 -65  ... -63  74 -72 -33 -72  97  -30 -99 -53
5  83 -20  25  50 -18 -98 -58 -50  10  ... -48  41  91   1  24  28   98  59 -35
6  67 -24  -7  17 -33 -66  97 -69  92  ...  80 -58  -2   3 -54  70    9 -58 -72
7  44 -33 -33  18 -60 -84  86  75 -19  ... -44  58  59  35 -60 -73   42  89 -92
8 -78  88 -75 -79 -56  82   6 -57  82  ... -91 -10 -82 -75 -14  65  -49 -56   4
9 -54   1  98 -21  55  71  95 -78 -84  ... -79 -92  36 -43  65 -72  -28  55  31

[10 rows x 26 columns]

Finally, we can generate basic summary statistics for each column, which is especially useful for numeric columns:

df.describe()
                 A            B  ...            Y            Z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean     -2.331000     0.919000  ...     0.422000    -1.573000
std      59.563624    58.540904  ...    58.125575    57.518754
min    -100.000000  -100.000000  ...  -100.000000  -100.000000
25%     -54.250000   -47.250000  ...   -49.000000   -51.000000
50%      -5.500000    -1.000000  ...     3.000000    -3.500000
75%      50.000000    54.000000  ...    52.000000    48.000000
max      99.000000    99.000000  ...    99.000000    99.000000

[8 rows x 26 columns]

We can round the values in this table to fit more columns on the screen:

df.describe().round(2)  # rounded to two decimal places
             A        B        C        D  ...        W        X        Y        Z
count  1000.00  1000.00  1000.00  1000.00  ...  1000.00  1000.00  1000.00  1000.00
mean     -2.33     0.92     0.90    -2.12  ...    -3.07    -3.25     0.42    -1.57
std      59.56    58.54    57.15    58.54  ...    59.13    57.61    58.13    57.52
min    -100.00  -100.00  -100.00  -100.00  ...  -100.00  -100.00  -100.00  -100.00
25%     -54.25   -47.25   -50.00   -53.00  ...   -57.00   -52.25   -49.00   -51.00
50%      -5.50    -1.00     2.00    -1.00  ...    -4.00    -6.00     3.00    -3.50
75%      50.00    54.00    50.00    49.00  ...    50.00    48.00    52.00    48.00
max      99.00    99.00    99.00    99.00  ...    99.00    99.00    99.00    99.00

[8 rows x 26 columns]

Missing data

Almost any real-word data contains missing values. Pandas encodes missing numeric values as NaN (“Not a Number”), a sentinel value defined in the IEEE 754 floating point standard. Therefore, this missing value only exists for floating point data, and integer series are automatically converted to float if they contain missing values. Datetime types have a special NaT (“Not a Time”) value to represent missing dates. Strings and boolean types can use None for missing values. In summary, missing data is represented with various different values depending on the underlying data type.

Recently, pandas has started to unify missing data support with a native NA (“Not Available”) value (available as pd.NA). Future versions of pandas will likely support NA for all data types, but right now we have to live with the variety of different missing value representations (which of course are all handled correctly by pandas).

To illustrate these ideas, let’s create a data frame with some missing values:

df = pd.DataFrame({"A": [1, None, 3], "B": [1., 2., None], "C": [True, False, None]})
df
     A    B      C
0  1.0  1.0   True
1  NaN  2.0  False
2  3.0  NaN   None
df.dtypes
A    float64
B    float64
C     object
dtype: object

Notice that the first column is coerced to float64, because it contains a missing value (otherwise, it would be int64). In both numerical columns, missing values are represented as NaN, whereas the last column is interpreted as a string (object) column because of the None element.

These types are probably not what we expected, so it might be a good idea to use extension data types discussed earlier. Most extension data types support the native NA value, which is displayed as <NA>:

df = pd.DataFrame(
    {
        "A": pd.Series([1, pd.NA, 2], dtype="Int64"),
        "B": pd.Series([True, False, pd.NA], dtype="boolean"),
        "C": pd.Series([pd.NA, "b", "c"], dtype="string"),
        "D": [0.0, 0.5, 1.5],
        "E": [10, 11, 12]
    },
    index=[2, 0, 1]
)
df
      A      B     C    D   E
2     2   <NA>     c  0.0  10
0     1   True  <NA>  0.5  11
1  <NA>  False     b  1.5  12
df.dtypes
A             Int64
B           boolean
C    string[python]
D           float64
E             int64
dtype: object
Tip

It is easy to check which values are missing in a data frame with the isna() method:

df.isna()
       A      B      C      D      E
2  False   True  False  False  False
0  False  False   True  False  False
1   True  False  False  False  False

This works for all kinds of missing value representations, including NaN, None, and NA.

Indexing

Basic indexing

Pandas data frames can be indexed to create subsets of the underlying data. Using square bracket notation, we can select single columns by label:

df["B"]
2     <NA>
0     True
1    False
Name: B, dtype: boolean

As we already know, this returns a pandas series.

Tip

Instead of using square bracket notation, we can also index columns with dot notation, for example:

df.B  # same as df["B"]
2     <NA>
0     True
1    False
Name: B, dtype: boolean

However, keep in mind that this only works for column labels that are also valid Python names. Square bracket notation always works.

We can select multiple columns by specifying a list of column labels within the square brackets:

df[["A", "C"]]
      A     C
2     2     c
0     1  <NA>
1  <NA>     b

This returns a new data frame with only the specified columns.

The same notation can also be used for filtering rows. Instead of column labels, we have to provide a boolean list (or series), for example as generated by a comparison:

df[df["A"] > 1]
   A     B  C    D   E
2  2  <NA>  c  0.0  10
df[[True, True, False]]  # rows 0 and 1
   A     B     C    D   E
2  2  <NA>     c  0.0  10
0  1  True  <NA>  0.5  11

Indexing with loc and iloc

More sophisticated indexing is available via the loc and iloc accessors. Whereas loc uses row and column labels, iloc works with integer indices instead. Both variants support simultaneous row and column indexing (by passing a tuple with the desired row and column labels or indices, respectively).

Here are some examples showing loc in action (note that slices include the end point). First, we can grab a single column by its label:

df.loc[:, "B"]
2     <NA>
0     True
1    False
Name: B, dtype: boolean

The same works for getting a single row with the index 1 (note that this refers to the index labeled 1 and not position 1):

df.loc[1, :]
A     <NA>
B    False
C        b
D      1.5
E       12
Name: 1, dtype: object

We can even specify a range of columns (or rows). In contrast to almost everything else in Python, the end point of slices in loc are always inclusive. This example gets columns B through D (inclusive):

df.loc[:, "B":"D"]
       B     C    D
2   <NA>     c  0.0
0   True  <NA>  0.5
1  False     b  1.5

Combining row and column slices is also possible (again, note that we are referring to index and column labels here):

df.loc[0:1, "B":"D"]
       B     C    D
0   True  <NA>  0.5
1  False     b  1.5

In contrast, iloc always refers to row and column positions instead of labels. Slices do not include the end point. Here are some examples:

df.iloc[:, 1]  # column 1
2     <NA>
0     True
1    False
Name: B, dtype: boolean
df.iloc[0, :]  # row 0
A       2
B    <NA>
C       c
D     0.0
E      10
Name: 2, dtype: object
df.iloc[:, 1:3]  # columns 1 and 2
       B     C
2   <NA>     c
0   True  <NA>
1  False     b
df.iloc[:1, 1:3]  # row 0 and columns 1 and 2
      B  C
2  <NA>  c

Importing data

Pandas supports importing data from a multitude of different formats. In this section, we will showcase how to import some common data formats. In almost all cases, data from a successfully imported file will be available as a data frame.

Text files

Text files are a popular option for storing small to medium-sized data sets. Because text files are simple, they can be opened with any text editor and almost any tool you might use for data analysis. However, importing text files might involve a lot of parameter tweaking, because there is no universal standard on how data should be formatted inside the file (for example, text files can use arbitrary column separators, header rows, comments, decimal number formats, and so on). Moreover, text files store data inefficiently, which makes them a bad choice for very large data sets.

The function pd.read_csv() is the main workhorse. It can import almost any kind of text data and features a large number of parameters that can be tuned to the idiosyncracies of a given file. Let’s start with a simple text file named data.csv (located in the working directory) with the following contents:

name,id,height,weight,gender,bp
Bob,4,177,83,m,93.2
Sue,2,179,,f,98.8
Alice,1,166,65,f,87.1
Hank,3,187,90,m,91
Tom,5,182,77,m,101.7
Mara,6,171,69,f,88.6

This looks pretty straightforward. The first row is a header with the column labels. The six columns are separated by , (commas), and there are six rows of data (not including the header). Note that the weight value in the second row is missing.

Importing this file is as simple as:

df = pd.read_csv("data.csv")
df
    name  id  height  weight gender     bp
0    Bob   4     177    83.0      m   93.2
1    Sue   2     179     NaN      f   98.8
2  Alice   1     166    65.0      f   87.1
3   Hank   3     187    90.0      m   91.0
4    Tom   5     182    77.0      m  101.7
5   Mara   6     171    69.0      f   88.6

Let’s take a look at the inferred column data types:

df.dtypes
name       object
id          int64
height      int64
weight    float64
gender     object
bp        float64
dtype: object

Pandas uses the NumPy-based data types by default, so columns name and gender (which contain strings) are available as object columns. Columns id and height contain integers (int64), and column bp holds floating point numbers (float64). Although column weight originally contains integers, pandas converts it to float64 due to the presence of a missing value.

If we are not happy with one of the automatically inferred data types, we can always change individual columns after importing with the astype() method. For example, the gender column could be represented with a categorical data type. We could also use the Int64 data type for the weight column, which supports missing values. Finally, we could convert the id column from object to the more modern string data type.

Note

We do not have to perform any of these conversions, because the automatically inferred data types are fine in most cases (and in particular, weight is actually a continuous variable, so it is a good idea to use float64 for it).

df["gender"] = df["gender"].astype("category")
df["weight"] = df["weight"].astype("Int64")
df["id"] = df["id"].astype("string")
df.dtypes
name              object
id        string[python]
height             int64
weight             Int64
gender          category
bp               float64
dtype: object
df
    name id  height  weight gender     bp
0    Bob  4     177      83      m   93.2
1    Sue  2     179    <NA>      f   98.8
2  Alice  1     166      65      f   87.1
3   Hank  3     187      90      m   91.0
4    Tom  5     182      77      m  101.7
5   Mara  6     171      69      f   88.6

Alternatively, we can specify the desired data types when importing the file with the dtype argument. This is more efficient, especially if the data frame is large:

df = pd.read_csv(
    "data.csv",
    dtype={"gender": "category", "weight": "Int64", "id": "string"}
)
df.dtypes
name              object
id        string[python]
height             int64
weight             Int64
gender          category
bp               float64
dtype: object

Let’s now take a look at another text file data.txt with the following contents:

Bob;4;177;83;m;93,2
Sue;2;179;;f;98,8
Alice;1;166;65;f;87,1
Hank;3;187;90;m;91
Tom;5;182;77;m;101,7
Mara;6;171;69;f;88,6

This file differs from the previous one in three ways:

  1. There is no header row.
  2. Columns are separated by semi-colons (;).
  3. Decimal numbers use a comma (,) instead of a dot (.) as the decimal mark (which is common in German-speaking regions).

In order to import this file, we need to tweak some parameters, but let’s first see what happens if we just read the file as before:

pd.read_csv("data.txt")
     Bob;4;177;83;m;93    2
0      Sue;2;179;;f;98  8.0
1  Alice;1;166;65;f;87  1.0
2   Hank;3;187;90;m;91  NaN
3   Tom;5;182;77;m;101  7.0
4   Mara;6;171;69;f;88  6.0

As expected, this does not work, because the resulting data frame consists of only two columns, and the column labels contain actual data. We can use the following arguments to adapt to the three changes we identified previously:

  1. header=None specifies that there is no header row in the data file.
  2. sep=";" sets the column separator.
  3. decimal="," sets the decimal marker.

With these additional arguments, the resulting data frame looks much better:

pd.read_csv("data.txt", header=None, sep=";", decimal=",")
       0  1    2     3  4      5
0    Bob  4  177  83.0  m   93.2
1    Sue  2  179   NaN  f   98.8
2  Alice  1  166  65.0  f   87.1
3   Hank  3  187  90.0  m   91.0
4    Tom  5  182  77.0  m  101.7
5   Mara  6  171  69.0  f   88.6

The only thing missing are proper column names, which we can provide with the names argument:

pd.read_csv(
    "data.txt",
    header=None,
    sep=";",
    decimal=",",
    names=["name", "id", "height", "weight", "gender", "bp"]
)
    name  id  height  weight gender     bp
0    Bob   4     177    83.0      m   93.2
1    Sue   2     179     NaN      f   98.8
2  Alice   1     166    65.0      f   87.1
3   Hank   3     187    90.0      m   91.0
4    Tom   5     182    77.0      m  101.7
5   Mara   6     171    69.0      f   88.6

Of course, we could also specify custom column types as we did in our previous example. The pd.read_csv() function has many additional parameters that we do not have time to discuss. Instead, I encourage you to consult the documentation whenever you have problems importing a specific file. Chances are that many issues can be solved by setting a parameter to a suitable value.

Tip

A quick way to create a toy data frame is to use pd.read_csv() in combination with a string representing the contents of a ficticious CSV file. Since the function always expects a file name, we use io.StringIO() to imitate this behavior (we can read from the given StringIO object as if it was a file):

from io import StringIO

pd.read_csv(StringIO("A,B,C\n1.,2,x\n4.,5,y\n7.,8,z"))
     A  B  C
0  1.0  2  x
1  4.0  5  y
2  7.0  8  z

Other useful functions for reading text data include pd.read_clipboard(), pd.read_fwf(), pd.read_json(), and pd.read_xml(). Consult the pandas documentation to learn more.

Excel files

Like it or not, Excel is still a popular tool for analyzing (small) data sets. It is also common as a data storage format (.xlsx). Pandas can import Excel sheets with pd.read_excel().

Important

To import Excel files with pandas, you need to install the following packages depending on the file format:

  • openpyxl for modern Excel files (.xlsx)
  • xlrd for older Excel files (.xls)
  • odfpy for OpenDocument files (.ods)

It is generally a good idea to install all three dependencies if you are planning to work with Excel files.

Let’s import data from an Excel file named data.xlsx (located in the working directory), which contains the same table we saw in the previous section:

pd.read_excel("data.xlsx")
    name  id  height  weight gender     bp
0    Bob   4     177    83.0      m   93.2
1    Sue   2     179     NaN      f   98.8
2  Alice   1     166    65.0      f   87.1
3   Hank   3     187    90.0      m   91.0
4    Tom   5     182    77.0      m  101.7
5   Mara   6     171    69.0      f   88.6

We can use many of the arguments we already know from pd.read_csv() (such as header and dtype). In addition, Excel files can contain multiple sheets. To specify which sheet you want to import, use the sheet_name argument. For more information on all available parameters, consult the pd.read_excel() reference.

SPSS/Stata/SAS files

Similar to Excel files, pandas can import data stored in SPSS, Stata, and SAS files with dedicated reader functions pd.read_spss(), pd.read_stata(), and pd.read_sas().

Important

Reading SPSS, Stata, and SAS files requires the pyreadstat package.

Data wrangling

Data wrangling can be broken down into the following five basic tasks (as defined in R for Data Science):

  • Filtering observations (rows)
  • Arranging (reordering) rows
  • Selecting variables (columns)
  • Creating new variables (columns) based on existing ones
  • Summarizing data, typically in combination with grouping data

We will illustrate each operation using the flights data frame from the nycflights13 data set. This data frame is available as a zipped CSV file flights.zip in the working directory, and we can import it as follows:

flights = pd.read_csv(
    "flights.zip",
    dtype={
        "dep_time": "Int64",
        "dep_delay": "Int64",
        "arr_time": "Int64",
        "arr_delay": "Int64",
        "air_time": "Int64",
        "carrier": "string",
        "tailnum": "string",
        "origin": "string",
        "dest": "string"
    }
)
flights.drop(columns=["year", "hour", "minute", "time_hour"], inplace=True)
flights.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 15 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   month           336776 non-null  int64 
 1   day             336776 non-null  int64 
 2   dep_time        328521 non-null  Int64 
 3   sched_dep_time  336776 non-null  int64 
 4   dep_delay       328521 non-null  Int64 
 5   arr_time        328063 non-null  Int64 
 6   sched_arr_time  336776 non-null  int64 
 7   arr_delay       327346 non-null  Int64 
 8   carrier         336776 non-null  string
 9   flight          336776 non-null  int64 
 10  tailnum         334264 non-null  string
 11  origin          336776 non-null  string
 12  dest            336776 non-null  string
 13  air_time        327346 non-null  Int64 
 14  distance        336776 non-null  int64 
dtypes: Int64(5), int64(6), string(4)
memory usage: 40.1 MB

Notice that in order to get better support for missing values, we specify the data types for integer and string columns manually as one of the extension types Int64 and string, respectively. We also drop the year, hour, minute, and time_hour columns, because we won’t need them in our analyses (and they contain only redundant information anyway).

The nycflights13 data set contains data on all 336,776 flights departing from New York City airports (JFK, LGA, and EWR) in 2013. Let’s see what we can find out about these flights.

Filtering observations

First, we might want to know how many flights departed on New Year’s Day 2013. Each flight is represented by a row, so we need to filter rows with flights that departed on January 1 (columns month and day contain the dates).

The query() method filters rows according to a specified condition based on values in one or more columns. The query must be passed as a single string, which may contain column names without having to use any special indexing syntax. The query returns a new data frame containing only the filtered rows, which in our example consists of 842 rows:

flights.query("month == 1 and day == 1")
     month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
0        1    1       517             515  ...     EWR   IAH       227      1400
1        1    1       533             529  ...     LGA   IAH       227      1416
2        1    1       542             540  ...     JFK   MIA       160      1089
3        1    1       544             545  ...     JFK   BQN       183      1576
4        1    1       554             600  ...     LGA   ATL       116       762
..     ...  ...       ...             ...  ...     ...   ...       ...       ...
837      1    1      2356            2359  ...     JFK   BQN       186      1576
838      1    1      <NA>            1630  ...     EWR   RDU      <NA>       416
839      1    1      <NA>            1935  ...     LGA   DFW      <NA>      1389
840      1    1      <NA>            1500  ...     LGA   MIA      <NA>      1096
841      1    1      <NA>             600  ...     JFK   FLL      <NA>      1069

[842 rows x 15 columns]

Notice that we combined two conditions with and; we can also use or, for example to filter flights that departed in November or December:

flights.query("month == 11 or month == 12")
        month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
55893      11    1         5            2359  ...     JFK   PSE       205      1617
55894      11    1        35            2250  ...     JFK   SYR        36       209
55895      11    1       455             500  ...     EWR   CLT        88       529
55896      11    1       539             545  ...     LGA   IAH       229      1416
55897      11    1       542             545  ...     JFK   MIA       147      1089
...       ...  ...       ...             ...  ...     ...   ...       ...       ...
111291     12   31      <NA>             705  ...     EWR   DEN      <NA>      1605
111292     12   31      <NA>             825  ...     JFK   CLT      <NA>       541
111293     12   31      <NA>            1615  ...     LGA   RDU      <NA>       431
111294     12   31      <NA>             600  ...     EWR   ORD      <NA>       719
111295     12   31      <NA>             830  ...     JFK   LAX      <NA>      2475

[55403 rows x 15 columns]
Important

A common mistake is to use the query "month == 11 or 12", which does not work because the or operator requires two complete conditions on both sides.

The air_time column contains the flight time in minutes. Are there any flights with a duration of less than 25 minutes?

flights.query("air_time < 25")
        month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
826         1    1      2302            2200  ...     EWR   BDL        24       116
1281        1    2      1335            1322  ...     EWR   BDL        24       116
3953        1    5      1323            1325  ...     EWR   BDL        23       116
4202        1    5      1822            1829  ...     EWR   BDL        24       116
5130        1    6      2125            2129  ...     EWR   PHL        22        80
...       ...  ...       ...             ...  ...     ...   ...       ...       ...
330634      9   24      1147            1153  ...     EWR   BDL        23       116
331117      9   24      1939            1945  ...     JFK   PHL        24        94
333416      9   27       844             845  ...     JFK   PHL        23        94
334279      9   28       741             738  ...     EWR   BDL        23       116
336708      9   30      2032            1945  ...     JFK   PHL        24        94

[235 rows x 15 columns]

What about flights that made up time? In other words, how many flights had a smaller arrival delay than their corresponding departure delay?

flights.query("arr_delay < dep_delay")
        month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
3           1    1       544             545  ...     JFK   BQN       183      1576
4           1    1       554             600  ...     LGA   ATL       116       762
7           1    1       557             600  ...     LGA   IAD        53       229
8           1    1       557             600  ...     JFK   MCO       140       944
11          1    1       558             600  ...     JFK   TPA       158      1005
...       ...  ...       ...             ...  ...     ...   ...       ...       ...
336765      9   30      2240            2245  ...     JFK   SYR        41       209
336766      9   30      2240            2250  ...     JFK   BUF        52       301
336767      9   30      2241            2246  ...     JFK   ROC        47       264
336768      9   30      2307            2255  ...     JFK   BOS        33       187
336769      9   30      2349            2359  ...     JFK   PSE       196      1617

[221565 rows x 15 columns]
Note

Instead of query(), we can also filter rows with indexing. For example, flights with an air time of less than 25 minutes can be obtained with:

flights[flights["air_time"] < 25]
        month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
826         1    1      2302            2200  ...     EWR   BDL        24       116
1281        1    2      1335            1322  ...     EWR   BDL        24       116
3953        1    5      1323            1325  ...     EWR   BDL        23       116
4202        1    5      1822            1829  ...     EWR   BDL        24       116
5130        1    6      2125            2129  ...     EWR   PHL        22        80
...       ...  ...       ...             ...  ...     ...   ...       ...       ...
330634      9   24      1147            1153  ...     EWR   BDL        23       116
331117      9   24      1939            1945  ...     JFK   PHL        24        94
333416      9   27       844             845  ...     JFK   PHL        23        94
334279      9   28       741             738  ...     EWR   BDL        23       116
336708      9   30      2032            1945  ...     JFK   PHL        24        94

[235 rows x 15 columns]

However, in most cases this is more verbose than query(), especially when combining several conditions.

Filtering rows implicitly drops missing values that are present in the column(s) of interest. To explicitly drop missing values, we can use the dropna() method. The following command drops all rows containing a missing value in any column:

flights.dropna()
        month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
0           1    1       517             515  ...     EWR   IAH       227      1400
1           1    1       533             529  ...     LGA   IAH       227      1416
2           1    1       542             540  ...     JFK   MIA       160      1089
3           1    1       544             545  ...     JFK   BQN       183      1576
4           1    1       554             600  ...     LGA   ATL       116       762
...       ...  ...       ...             ...  ...     ...   ...       ...       ...
336765      9   30      2240            2245  ...     JFK   SYR        41       209
336766      9   30      2240            2250  ...     JFK   BUF        52       301
336767      9   30      2241            2246  ...     JFK   ROC        47       264
336768      9   30      2307            2255  ...     JFK   BOS        33       187
336769      9   30      2349            2359  ...     JFK   PSE       196      1617

[327346 rows x 15 columns]

If we only want to drop rows with missing values in one (or more) specific columns, we can use the subset argument. Here’s how to drop all rows that have missing values in the tailnum column:

flights.dropna(subset="tailnum")
        month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
0           1    1       517             515  ...     EWR   IAH       227      1400
1           1    1       533             529  ...     LGA   IAH       227      1416
2           1    1       542             540  ...     JFK   MIA       160      1089
3           1    1       544             545  ...     JFK   BQN       183      1576
4           1    1       554             600  ...     LGA   ATL       116       762
...       ...  ...       ...             ...  ...     ...   ...       ...       ...
336769      9   30      2349            2359  ...     JFK   PSE       196      1617
336770      9   30      <NA>            1842  ...     LGA   BNA      <NA>       764
336773      9   30      <NA>            1210  ...     LGA   BNA      <NA>       764
336774      9   30      <NA>            1159  ...     LGA   CLE      <NA>       419
336775      9   30      <NA>             840  ...     LGA   RDU      <NA>       431

[334264 rows x 15 columns]
Note

The dropna() method does not modify the original data frame by default (instead, it returns a new data frame). This can be changed by setting inplace=True.

Arranging rows

Another frequently used operation is sorting rows in a data frame according to values in other columns. The sort_values() method accomplishes this task, and by default it arranges rows in ascending order. For example, we can use this method to sort all flights by scheduled departure time within the entire year as follows:

flights.sort_values(["month", "day", "sched_dep_time"])
        month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
0           1    1       517             515  ...     EWR   IAH       227      1400
1           1    1       533             529  ...     LGA   IAH       227      1416
2           1    1       542             540  ...     JFK   MIA       160      1089
3           1    1       544             545  ...     JFK   BQN       183      1576
5           1    1       554             558  ...     EWR   ORD       150       719
...       ...  ...       ...             ...  ...     ...   ...       ...       ...
111276     12   31      2328            2330  ...     EWR   SJU       198      1608
110520     12   31        13            2359  ...     JFK   BQN       189      1576
110521     12   31        18            2359  ...     JFK   SJU       192      1598
111278     12   31      2355            2359  ...     JFK   SJU       195      1598
111279     12   31      2356            2359  ...     JFK   PSE       200      1617

[336776 rows x 15 columns]

Which flights had the largest departure delay? To answer this question, we can sort by the dep_delay column in descending order (so that flights with the largest delays are on top):

flights.sort_values("dep_delay", ascending=False)
        month  day  dep_time  sched_dep_time  ...  origin  dest  air_time  distance
7072        1    9       641             900  ...     JFK   HNL       640      4983
235778      6   15      1432            1935  ...     JFK   CMH        74       483
8239        1   10      1121            1635  ...     EWR   ORD       111       719
327043      9   20      1139            1845  ...     JFK   SFO       354      2586
270376      7   22       845            1600  ...     JFK   CVG        96       589
...       ...  ...       ...             ...  ...     ...   ...       ...       ...
336771      9   30      <NA>            1455  ...     JFK   DCA      <NA>       213
336772      9   30      <NA>            2200  ...     LGA   SYR      <NA>       198
336773      9   30      <NA>            1210  ...     LGA   BNA      <NA>       764
336774      9   30      <NA>            1159  ...     LGA   CLE      <NA>       419
336775      9   30      <NA>             840  ...     LGA   RDU      <NA>       431

[336776 rows x 15 columns]
Note

Missing values will always be sorted at the end (for both ascending and descending order).

Unfortunately, the dep_delay column is not visible in the output, but we can select (and therefore show) this column with indexing:

flights.sort_values("dep_delay", ascending=False)["dep_delay"]
7072      1301
235778    1137
8239      1126
327043    1014
270376    1005
          ... 
336771    <NA>
336772    <NA>
336773    <NA>
336774    <NA>
336775    <NA>
Name: dep_delay, Length: 336776, dtype: Int64

1301 minutes are almost 22 hours! That’s a long delay for sure!

Selecting variables

Next, we will learn how to select specific columns with the filter() method. Let’s start with a basic example, where we pick two columns passed as a list of column labels:

flights.filter(["month", "day"])
        month  day
0           1    1
1           1    1
2           1    1
3           1    1
4           1    1
...       ...  ...
336771      9   30
336772      9   30
336773      9   30
336774      9   30
336775      9   30

[336776 rows x 2 columns]

We can also pick columns with regular expressions (using the regex argument). For example, we can pick all columns containing delay as follows:

flights.filter(regex="delay")
        dep_delay  arr_delay
0               2         11
1               4         20
2               2         33
3              -1        -18
4              -6        -25
...           ...        ...
336771       <NA>       <NA>
336772       <NA>       <NA>
336773       <NA>       <NA>
336774       <NA>       <NA>
336775       <NA>       <NA>

[336776 rows x 2 columns]

Or what about columns that contain the string arr_?

flights.filter(regex="arr_")
        arr_time  sched_arr_time  arr_delay
0            830             819         11
1            850             830         20
2            923             850         33
3           1004            1022        -18
4            812             837        -25
...          ...             ...        ...
336771      <NA>            1634       <NA>
336772      <NA>            2312       <NA>
336773      <NA>            1330       <NA>
336774      <NA>            1344       <NA>
336775      <NA>            1020       <NA>

[336776 rows x 3 columns]

The next example selects all columns starting with the letter “a”:

flights.filter(regex="^a")
        arr_time  arr_delay  air_time
0            830         11       227
1            850         20       227
2            923         33       160
3           1004        -18       183
4            812        -25       116
...          ...        ...       ...
336771      <NA>       <NA>      <NA>
336772      <NA>       <NA>      <NA>
336773      <NA>       <NA>      <NA>
336774      <NA>       <NA>      <NA>
336775      <NA>       <NA>      <NA>

[336776 rows x 3 columns]

Finally, we pick all columns ending with the word “time”:

flights.filter(regex="time$")
        dep_time  sched_dep_time  arr_time  sched_arr_time  air_time
0            517             515       830             819       227
1            533             529       850             830       227
2            542             540       923             850       160
3            544             545      1004            1022       183
4            554             600       812             837       116
...          ...             ...       ...             ...       ...
336771      <NA>            1455      <NA>            1634      <NA>
336772      <NA>            2200      <NA>            2312      <NA>
336773      <NA>            1210      <NA>            1330      <NA>
336774      <NA>            1159      <NA>            1344      <NA>
336775      <NA>             840      <NA>            1020      <NA>

[336776 rows x 5 columns]

Regular expressions are a powerful way to describe text patterns. We do not have time to go into more details, but it pays off to learn at least the basic concepts (for example with this tutorial).

Creating new variables

First, let’s create a smaller data frame with only a few columns. This will make it easier to see the new columns that we will add in this section.

df = flights.filter(["dep_delay", "arr_delay", "distance", "air_time"])

Now imagine we wanted to add a new column containing the gained time of each flight, which can be computed as the difference between arrival delay and departure delay. If the gain is negative, the airplane made up at least some of its departure delay, and therefore gained time.

Pandas offers the assign() method to compute new columns. Let’s add the previously discussed gain column. In addition, let’s also compute the average speed of each flight (in miles per hour) using the existing air_time and distance columns:

df.assign(
    gain=df["arr_delay"] - df["dep_delay"],
    hours=df["air_time"] / 60,
    speed=lambda x: x["distance"] / x["hours"]
)
        dep_delay  arr_delay  distance  air_time  gain     hours       speed
0               2         11      1400       227     9  3.783333  370.044053
1               4         20      1416       227    16  3.783333  374.273128
2               2         33      1089       160    31  2.666667     408.375
3              -1        -18      1576       183   -17      3.05  516.721311
4              -6        -25       762       116   -19  1.933333  394.137931
...           ...        ...       ...       ...   ...       ...         ...
336771       <NA>       <NA>       213      <NA>  <NA>      <NA>        <NA>
336772       <NA>       <NA>       198      <NA>  <NA>      <NA>        <NA>
336773       <NA>       <NA>       764      <NA>  <NA>      <NA>        <NA>
336774       <NA>       <NA>       419      <NA>  <NA>      <NA>        <NA>
336775       <NA>       <NA>       431      <NA>  <NA>      <NA>        <NA>

[336776 rows x 7 columns]

Notice that we can use columns that we just added to create another new variable like speed, where we use the hours column from the previous line (but in the same method call). We have to use a lambda function though, because df["hours"] does not exist in the original data frame.

Summarizing data

Summarizing data can yield valuable insights, especially when performed on grouped data. In most cases, our goal is to summarize or aggregate values from a single column. Each column is represented by a pd.Series, and we can use the agg() method to summarize its values with a given aggregation function. An aggregation function can be any function which returns exactly one value (a scalar) such as the mean, median, standard deviation, and so on.

Here’s an example which summarizes the dep_delay column by its mean:

flights["dep_delay"].agg("mean")
np.float64(12.639070257304708)

We could also select the column with the filter() method we saw before:

flights.filter(["dep_delay"]).agg("mean")
dep_delay    12.63907
dtype: Float64

Missing values are automatically removed before agg() calculates the summary statistic.

Note

The first example flights["dep_delay"].agg("mean") returns a scalar, whereas the second example flights.filter(["dep_delay"]).agg("mean") returns the value inside a pd.Series. The reason for this difference is that the filter() method always returns a data frame, even if it consists of only one column. Aggregating columns of a data frame results in a series (of aggregated values). Indexing with square bracket notation returns a pd.Series if we are asking for only one column. Aggregating a series results in a value.

We can even compute several summary statistics simultaneously by passing a list of functions:

flights.filter(["dep_delay"]).agg(["mean", "std", "min", "max"])
        dep_delay
mean    12.639070
std     40.210061
min    -43.000000
max   1301.000000

The most important aggregation functions are available as strings (as in the previous example), but in general we can pass an arbitrary function which takes a sequence-like object and returns a single value. For example, we could write the previous mean aggregation as:

flights["dep_delay"].agg(lambda x: sum(x.dropna()) / len(x.dropna()))
np.float64(12.639070257304708)
Note

For even moderately complex functions, lambda functions tend to get unreadable, so you should prefer a regular (named) function instead.

Things get more interesting when we group the data first and aggregate within each group individually. The groupby() data frame method splits the data into groups based on one or more columns. For example, we can group by month and day to calculate the mean departure delay on each single day:

flights.groupby(["month", "day"])["dep_delay"].agg("mean")
month  day
1      1      11.548926
       2      13.858824
       3      10.987832
       4       8.951595
       5       5.732218
                ...    
12     27      10.93763
       28       7.98155
       29     22.309551
       30     10.698113
       31      6.996053
Name: dep_delay, Length: 365, dtype: Float64

Similarly, here’s the average departure delay per month:

flights.groupby("month")["dep_delay"].agg("mean")
month
1     10.036665
2     10.816843
3     13.227076
4     13.938038
5     12.986859
6     20.846332
7     21.727787
8      12.61104
9      6.722476
10     6.243988
11     5.435362
12    16.576688
Name: dep_delay, dtype: Float64

Notice how we select the dep_delay column of the grouped data frame with square bracket indexing. Another option is to specify the columns that should be aggregated in the argument of the agg() method as follows:

flights.groupby("month").agg({"dep_delay": "mean"})
       dep_delay
month           
1      10.036665
2      10.816843
3      13.227076
4      13.938038
5      12.986859
6      20.846332
7      21.727787
8       12.61104
9       6.722476
10      6.243988
11      5.435362
12     16.576688
Note

Do you see the difference in the outputs of the previous two examples? The first output is a pd.Series, whereas the second output is a pd.Dataframe with one column.

The count() method counts the number of non-missing values in a column:

flights.groupby("month").agg({"dep_delay": "count"})
       dep_delay
month           
1          26483
2          23690
3          27973
4          27662
5          28233
6          27234
7          28485
8          28841
9          27122
10         28653
11         27035
12         27110

In contrast, the number of missing values in that column can be obtained with:

flights.groupby("month")["dep_delay"].agg(lambda x: x.isna().sum())
month
1      521
2     1261
3      861
4      668
5      563
6     1009
7      940
8      486
9      452
10     236
11     233
12    1025
Name: dep_delay, dtype: Int64

The value_counts() method determines the number of occurrences for each unique value in a column. For example, an individual airplane is identified by its tail number. If we want to know how many unique airplanes there are in the data set, and how many flights each individual airplane had, we can use the following command:

flights["tailnum"].value_counts()
tailnum
N725MQ    575
N722MQ    513
N723MQ    507
N711MQ    486
N713MQ    483
         ... 
N7ASAA      1
N932DN      1
N318AS      1
N651UA      1
N557AS      1
Name: count, Length: 4043, dtype: Int64

This output shows that there are 4043 unique airplanes, each of which is listed with its corresponding number of flights. The ten airplanes with the most flights are:

flights["tailnum"].value_counts().head(10)
tailnum
N725MQ    575
N722MQ    513
N723MQ    507
N711MQ    486
N713MQ    483
N258JB    427
N298JB    407
N353JB    404
N351JB    402
N735MQ    396
Name: count, dtype: Int64

How many airplanes have just one flight?

(flights["tailnum"].value_counts() == 1).sum()
np.int64(171)

Here, we exploited the fact that summing a boolean series counts True as 1 and False as 0.

Pipelines

Almost all interesting data wrangling tasks require a combination of the five previously mentioned basic operations. You might have already noticed this in some examples, where we first picked a subset of columns and then performed some operations (like adding new variables). Usually, this requires multiple steps involving temporary names, for example:

df1 = flights.filter(["dep_delay", "arr_delay", "distance", "air_time"])
df2 = df1.assign(
    gain=df1["arr_delay"] - df1["dep_delay"],
    hours=df1["air_time"] / 60,
    speed=lambda x: x["distance"] / x["hours"]
)
df2.sort_values("speed")
        dep_delay  arr_delay  distance  air_time  gain     hours      speed
24098          52        103        96        75    51      1.25       76.8
248839         -5         86       199       141    91      2.35  84.680851
305879         -8         25        94        61    33  1.016667  92.459016
25455          42         81        94        59    39  0.983333   95.59322
80009          -4         29        96        60    33       1.0       96.0
...           ...        ...       ...       ...   ...       ...        ...
336771       <NA>       <NA>       213      <NA>  <NA>      <NA>       <NA>
336772       <NA>       <NA>       198      <NA>  <NA>      <NA>       <NA>
336773       <NA>       <NA>       764      <NA>  <NA>      <NA>       <NA>
336774       <NA>       <NA>       419      <NA>  <NA>      <NA>       <NA>
336775       <NA>       <NA>       431      <NA>  <NA>      <NA>       <NA>

[336776 rows x 7 columns]

The problem here is that we do not really need the names df1 and df2 other than to perform the desired operations sequentially. Instead, we can use a so-called pipeline to chain two or more operations without the need to create temporary names:

(flights
    .filter(["dep_delay", "arr_delay", "distance", "air_time"])
    .assign(
        gain=flights["arr_delay"] - flights["dep_delay"],
        hours=flights["air_time"] / 60,
        speed=lambda x: x["distance"] / x["hours"]
    )
    .sort_values("speed", ascending=False)
)
        dep_delay  arr_delay  distance  air_time  gain     hours       speed
216447          9        -14       762        65   -23  1.083333  703.384615
251999         45         26      1008        93   -19      1.55  650.322581
205388         15         -1       594        55   -16  0.916667       648.0
157516          4          2       748        70    -2  1.166667  641.142857
10223          -1        -28      1035       105   -27      1.75  591.428571
...           ...        ...       ...       ...   ...       ...         ...
336771       <NA>       <NA>       213      <NA>  <NA>      <NA>        <NA>
336772       <NA>       <NA>       198      <NA>  <NA>      <NA>        <NA>
336773       <NA>       <NA>       764      <NA>  <NA>      <NA>        <NA>
336774       <NA>       <NA>       419      <NA>  <NA>      <NA>        <NA>
336775       <NA>       <NA>       431      <NA>  <NA>      <NA>        <NA>

[336776 rows x 7 columns]
Important

Breaking up a pipeline into multiple rows (as shown in the previous example) creates a nicely formatted representation of the sequence of steps. However, we need to enclose the entire pipeline in parentheses.

Tidy data

There are many ways to organize data into rows and columns, but the tidy data format has some compelling advantages. A table is tidy when

  • each variable corresponds to exactly one column,
  • each observation corresponds to exactly one row,
  • and each value corresponds to exactly one cell.

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

Consider the following table (taken from the tidyr R package):

Table 1: A tidy table.
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

This table is tidy because it meets all three criteria. It might be difficult to see why this is the case, so let’s look at other tables (containing the same data) that are not tidy. The following table combines two variables into one column, which makes the table twice as long:

Table 2: Variables “cases” and “population” are not in two separate columns.
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

For the same reason, the next table is also not tidy. It contains a combination of two variables in one column. In addition, this column is not numeric, which makes it difficult to work with the actual data.

Table 3: Variables “cases” and “population” are combined in one column.
country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583

In pandas, working with tidy data facilitates many operations, so whenever possible this should be the preferred table format. However, depending on the task, this might not always be possible or practical, so it is completely fine if another data organization works better.

Pandas makes it pretty straightforward to convert any given table to a tidy format (or vice versa) using the pivot() method. Let’s start with data from Table 2 and try to make it tidy:

messy1 = pd.read_csv(StringIO("""country,year,type,count
Afghanistan,1999,cases,745
Afghanistan,1999,population,19987071
Afghanistan,2000,cases,2666
Afghanistan,2000,population,20595360
Brazil,1999,cases,37737
Brazil,1999,population,172006362
Brazil,2000,cases,80488
Brazil,2000,population,174504898
China,1999,cases,212258
China,1999,population,1272915272
China,2000,cases,213766
China,2000,population,1280428583"""))
messy1
        country  year        type       count
0   Afghanistan  1999       cases         745
1   Afghanistan  1999  population    19987071
2   Afghanistan  2000       cases        2666
3   Afghanistan  2000  population    20595360
4        Brazil  1999       cases       37737
5        Brazil  1999  population   172006362
6        Brazil  2000       cases       80488
7        Brazil  2000  population   174504898
8         China  1999       cases      212258
9         China  1999  population  1272915272
10        China  2000       cases      213766
11        China  2000  population  1280428583

The index argument specifies which columns should be moved to new rows (the index), whereas the columns argument specifies which columns should be split up into new columns. Finally, the (optional) values argument lists the columns containing the values of the new data frame:

messy1.pivot(index=["country", "year"], columns="type", values="count")
type               cases  population
country     year                    
Afghanistan 1999     745    19987071
            2000    2666    20595360
Brazil      1999   37737   172006362
            2000   80488   174504898
China       1999  212258  1272915272
            2000  213766  1280428583

This result is a data frame consisting of two columns cases and population and an index consisting of country and year (a so-called hierarchical index or multi-index). We can convert a multi-index into separate columns using the reset_index() method (which resets the index to the default integer series starting at zero):

(messy1
    .pivot(index=["country", "year"], columns="type", values="count")
    .reset_index()
)
type      country  year   cases  population
0     Afghanistan  1999     745    19987071
1     Afghanistan  2000    2666    20595360
2          Brazil  1999   37737   172006362
3          Brazil  2000   80488   174504898
4           China  1999  212258  1272915272
5           China  2000  213766  1280428583

The type in the output is the name of the columns (which is usually empty). If this bothers you, the column name attribute can be set to None (but we cannot do this inside a pipeline):

tidy1 = (messy1
    .pivot(index=["country", "year"], columns="type", values="count")
    .reset_index()
)
tidy1.columns.name = None
tidy1
       country  year   cases  population
0  Afghanistan  1999     745    19987071
1  Afghanistan  2000    2666    20595360
2       Brazil  1999   37737   172006362
3       Brazil  2000   80488   174504898
4        China  1999  212258  1272915272
5        China  2000  213766  1280428583

Let’s try to tidy the data from Table 3:

messy2 = pd.read_csv(StringIO("""country,year,rate
Afghanistan,1999,745/19987071
Afghanistan,2000,2666/20595360
Brazil,1999,37737/172006362
Brazil,2000,80488/174504898
China,1999,212258/1272915272
China,2000,213766/1280428583"""))
messy2
       country  year               rate
0  Afghanistan  1999       745/19987071
1  Afghanistan  2000      2666/20595360
2       Brazil  1999    37737/172006362
3       Brazil  2000    80488/174504898
4        China  1999  212258/1272915272
5        China  2000  213766/1280428583

The problem here is that two variables are combined into one column rate, which is of type object:

messy2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   country  6 non-null      object
 1   year     6 non-null      int64 
 2   rate     6 non-null      object
dtypes: int64(1), object(2)
memory usage: 276.0+ bytes

String columns offer a lot of useful methods such as slicing or splitting using the str method namespace. In our example, we would like to split the column on the / character, resulting in two new columns:

messy2["rate"].str.split("/", expand=True)
        0           1
0     745    19987071
1    2666    20595360
2   37737   172006362
3   80488   174504898
4  212258  1272915272
5  213766  1280428583

Note that this operation creates a new data frame because we set expand=True (by default, expand=False creates a series). We can add this resulting data frame to our tidy result:

tidy2 = messy2.loc[:, ["country", "year"]]
tidy2[["count", "population"]] = messy2["rate"].str.split("/", expand=True)

Now we are almost done, the data frame is tidy, but the column types of count and population are still not numeric (they are object just like the original columns). We need to explicitly convert them, for example with pd.to_numeric():

tidy2["count"] = pd.to_numeric(tidy2["count"])
tidy2["population"] = pd.to_numeric(tidy2["population"])
tidy2
       country  year   count  population
0  Afghanistan  1999     745    19987071
1  Afghanistan  2000    2666    20595360
2       Brazil  1999   37737   172006362
3       Brazil  2000   80488   174504898
4        China  1999  212258  1272915272
5        China  2000  213766  1280428583

COVID-19 example

Let’s apply some of the things we learned in this chapter to a real-world data set: historic COVID-19 cases in Austria. The data is not available on any of the official government websites anymore, but we can use a snapshot from June 28 2023 (renamed to covid19.csv) and put it in our working directory.

Using pd.read_csv(), we can then import this file as follows:

covid19 = pd.read_csv(
    "covid19.csv",
    sep=";",
    decimal=",",
    parse_dates=["Time"],
    dayfirst=True
)

The file uses semicolons (;) to separate columns and commas (,) for decimal numbers, which we need to specify in order to correctly import the data. Passing parse_dates=["Time"] together with dayfirst=True makes sure that the Time column has the suitable datetime data type. If you are wondering where this information is available – the only way to find out for sure is to open the file in a text editor and inspect the values!

Note

If we do not specify decimal=",", pandas expects a dot (.) as the decimal separator. If we then import a column with a different decimal separator, it will be interpreted as type object. Therefore, it is important to carefully check all column types after importing (and verify that numeric columns have numeric types).

There are 12170 rows and 12 columns with no missing values, and all column data types have been set or inferred correctly (the first column is a datetime, and all remaining columns are numeric except for Bundesland, which contains the names of the Austrian provinces):

covid19.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12170 entries, 0 to 12169
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Time                      12170 non-null  datetime64[ns]
 1   Bundesland                12170 non-null  object        
 2   BundeslandID              12170 non-null  int64         
 3   AnzEinwohner              12170 non-null  int64         
 4   AnzahlFaelle              12170 non-null  int64         
 5   AnzahlFaelleSum           12170 non-null  int64         
 6   AnzahlFaelle7Tage         12170 non-null  int64         
 7   SiebenTageInzidenzFaelle  12170 non-null  float64       
 8   AnzahlTotTaeglich         12170 non-null  int64         
 9   AnzahlTotSum              12170 non-null  int64         
 10  AnzahlGeheiltTaeglich     12170 non-null  int64         
 11  AnzahlGeheiltSum          12170 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(9), object(1)
memory usage: 1.1+ MB

There are many interesting questions we could try to answer with this data set. Let’s highlight a few simple ones that each introduce a new feature that we have not seen before.

First, because COVID-19 infections heavily depend on the seasons, we could be interested in average daily new cases per month. A datetime series exposes specialized methods in its dt method namespace. Here, we use dt.month to extract the month of the year (1–12) and add this information as a new column month. Then we group by month and compute the average in each group:

(covid19
    .assign(month=covid19["Time"].dt.month)
    .groupby("month")
    .agg({"AnzahlFaelle": "mean"})
)
       AnzahlFaelle
month              
1       1589.455914
2       2287.820455
3       2211.537097
4        658.375000
5        241.375806
6        337.039655
7        710.111828
8        440.939785
9        600.555556
10       990.653763
11      1398.960000
12       705.918280

Similarly, we know that new cases also depend on the day of the week, because fewer tests are administered on the weekend. Therefore, we can compute average cases aggregated per weekday (0–6 correspond to Monday–Sunday) as follows:

(covid19
    .assign(weekday=covid19["Time"].dt.weekday)
    .groupby("weekday")
    .agg({"AnzahlFaelle": "mean"})
)
         AnzahlFaelle
weekday              
0          945.517241
1         1292.931792
2         1122.196552
3         1072.447126
4          999.032184
5          838.965517
6          725.971264

Let’s combine these two separate analyses and calculate the average number of cases on each weekday per month by grouping by both columns:

(covid19
    .assign(month=covid19["Time"].dt.month)
    .assign(weekday=covid19["Time"].dt.weekday)
    .groupby(["month", "weekday"])
    .agg({"AnzahlFaelle": "mean"})
)
               AnzahlFaelle
month weekday              
1     0         1619.042857
      1         1835.153846
      2         1733.100000
      3         1839.933333
      4         1556.461538
...                     ...
12    2          809.985714
      3          735.520000
      4          738.914286
      5          614.707692
      6          494.066667

[84 rows x 1 columns]

Finally, we can also create plots directly from data frames and/or single columns. That way, we can recreate the famous time course of daily COVID-19 infections over the whole time period. We will plot the daily values for Austria with time on the x-axis and new cases on the y-axis:

(covid19
    .query("Bundesland == 'Österreich'")
    .set_index("Time")
    ["AnzahlFaelle"]
    .plot(title="Daily new COVID-19 cases in Austria")
)

Note that when we plot a single column, the index will be displayed on the x-axis. That’s why we explicitly set the index to the Time column (which means that this column gets “promoted” to the index). After that, we pick the column of interest and call its plot() method to generate the visualization.


© Clemens Brunner (CC BY-NC-SA 4.0)

Footnotes

  1. NumPy supports arrays with different data types through structured arrays, but pandas data frames are usually much easier to work with in a typical data analysis pipeline.↩︎

  2. Note that pandas is spelled with all lower-case letters. We will capitalize the first letter only at the beginning of a sentence.↩︎