Pandas basics

Pandas is a Python library for working with tabular data. The main objects it provides are Series and DataFrame.

[10]:
import pandas as pd

Pandas Series

Pandas Series is essentially a one-dimensional array, equipped with an index which labels its entries. We can create a Series object, for example, by converting a list:

[11]:
diameters = pd.Series([4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528])
diameters
[11]:
0      4879
1     12104
2     12756
3      6792
4    142984
5    120536
6     51118
7     49528
dtype: int64

By default entries of a Series are indexed by consecutive integers, but we can specify a more meaningful index. The numbers in the above Series give diameters (in kilometers) of planets of the Solar System, so it is sensible to use names of the planet as index values:

[12]:
diameters = pd.Series([4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528],
                      index=["Mercury", "Venus", "Earth", "Mars", "Jupyter", "Saturn",
                             "Uranus", "Neptune"])
diameters
[12]:
Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
dtype: int64

We can use index values to select Series entries:

[13]:
diameters["Earth"]
[13]:
12756

Slicing using index works as well:

[14]:
diameters["Mercury":"Mars"]
[14]:
Mercury     4879
Venus      12104
Earth      12756
Mars        6792
dtype: int64

Note. In pandas slicing with index values works a bit differently that slicing in Python in general: the code diameters["Mercury":"Mars"] selects all entries from "Mercury" up to and including "Mars".

We can also select Series values by specifying a list of index labels:

[15]:
diameters[["Earth", "Jupyter", "Neptune"]]
[15]:
Earth       12756
Jupyter    142984
Neptune     49528
dtype: int64

Indexing can be used to modify Series values or to add new values:

[16]:
diameters["Pluto"] = 2370
diameters
[16]:
Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64

Pandas DataFrame

Pandas DataFrame is a two-dimensional array equipped with one index labeling its rows, and another labeling its columns. There are several ways of creating a DataFrame. One of them is to use a dictionary of lists. Each list gives values of a column of the DataFrame, and dictionary keys give column labels:

[17]:
d = {"diameter" : [4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528, 2370],
     "avg_temp" : [167, 464, 15, -65, -110, -140, -195, -200, -225],
     "gravity" : [3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0, 0.7]}

planets = pd.DataFrame(d)
planets
[17]:
diameter avg_temp gravity
0 4879 167 3.7
1 12104 464 8.9
2 12756 15 9.8
3 6792 -65 3.7
4 142984 -110 23.1
5 120536 -140 9.0
6 51118 -195 8.7
7 49528 -200 11.0
8 2370 -225 0.7

The head and tail methods lets us select the first snd last few rows of a DataFrame:

[18]:
# get the first 3 rows
planets.head(3)
[18]:
diameter avg_temp gravity
0 4879 167 3.7
1 12104 464 8.9
2 12756 15 9.8
[19]:
# get the last 2 rows
planets.tail(2)
[19]:
diameter avg_temp gravity
7 49528 -200 11.0
8 2370 -225 0.7

Index of DataFrame columns can be accessed using the columns property:

[20]:
planets.columns
[20]:
Index(['diameter', 'avg_temp', 'gravity'], dtype='object')

Index of rows can be obtained using the index property:

[21]:
planets.index
[21]:
RangeIndex(start=0, stop=9, step=1)

Since we have not specified an index for rows, by default it consists of consecutive integers. We can change it by modifying the index:

[22]:
planets.index = ["Mercury", "Venus", "Earth", "Mars", "Jupyter", "Saturn",
                 "Uranus", "Neptune", "Pluto"]
planets
[22]:
diameter avg_temp gravity
Mercury 4879 167 3.7
Venus 12104 464 8.9
Earth 12756 15 9.8
Mars 6792 -65 3.7
Jupyter 142984 -110 23.1
Saturn 120536 -140 9.0
Uranus 51118 -195 8.7
Neptune 49528 -200 11.0
Pluto 2370 -225 0.7
[23]:
planets.index
[23]:
Index(['Mercury', 'Venus', 'Earth', 'Mars', 'Jupyter', 'Saturn', 'Uranus',
       'Neptune', 'Pluto'],
      dtype='object')

Selecting data in a DataFrame

Basic selections

Columns of a DataFrame can be selected by specifying a column name. This returns a Series with column data:

[24]:
planets["gravity"]
[24]:
Mercury     3.7
Venus       8.9
Earth       9.8
Mars        3.7
Jupyter    23.1
Saturn      9.0
Uranus      8.7
Neptune    11.0
Pluto       0.7
Name: gravity, dtype: float64

Specifying a list of column names gives a DataFrame consisting of the selected columns:

[25]:
planets[["gravity", "diameter"]]
[25]:
gravity diameter
Mercury 3.7 4879
Venus 8.9 12104
Earth 9.8 12756
Mars 3.7 6792
Jupyter 23.1 142984
Saturn 9.0 120536
Uranus 8.7 51118
Neptune 11.0 49528
Pluto 0.7 2370

Slicing selects rows of a DataFrame:

[26]:
planets["Earth":"Saturn"]
[26]:
diameter avg_temp gravity
Earth 12756 15 9.8
Mars 6792 -65 3.7
Jupyter 142984 -110 23.1
Saturn 120536 -140 9.0

Selections with loc and iloc

To get more flexibility with DataFrame selections we can use the loc indexer. It lets us select rows and columns using row and column labels, lists of labels, and slices:

[27]:
# select an entry in a given row and column
planets.loc["Earth", "gravity"]
[27]:
9.8
[28]:
# select one row and a list of columns
planets.loc["Earth", ["diameter", "gravity"]]
[28]:
diameter    12756.0
gravity         9.8
Name: Earth, dtype: float64
[29]:
# select a slice of rows and a list of columns
planets.loc["Earth":"Saturn", ["gravity", "avg_temp"]]
[29]:
gravity avg_temp
Earth 9.8 15
Mars 3.7 -65
Jupyter 23.1 -110
Saturn 9.0 -140

The iloc indexer works similarly, but it uses row and column numbers instead of their labels:

[30]:
# select the first 3 rows
planets.iloc[:3]
[30]:
diameter avg_temp gravity
Mercury 4879 167 3.7
Venus 12104 464 8.9
Earth 12756 15 9.8
[31]:
# select all rows and the first 2 columns
planets.iloc[: , :2]
[31]:
diameter avg_temp
Mercury 4879 167
Venus 12104 464
Earth 12756 15
Mars 6792 -65
Jupyter 142984 -110
Saturn 120536 -140
Uranus 51118 -195
Neptune 49528 -200
Pluto 2370 -225
[32]:
# select using a list of row numbers and a slice of column numbers
planets.iloc[[0, 4, 5], 1:2]
[32]:
avg_temp
Mercury 167
Jupyter -110
Saturn -140

loc snd iloc work with pandas Series too:

[33]:
diameters
[33]:
Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64
[34]:
# loc with pandas Series
diameters.loc["Jupyter":"Pluto"]
[34]:
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64
[35]:
# iloc with pandas Series
diameters.iloc[[0, 1, 5]]
[35]:
Mercury      4879
Venus       12104
Saturn     120536
dtype: int64

Boolean indexing

Similarly as with numpy arrays, we can use Boolean indexing to select parts of pandas Series and DataFrames.

[36]:
planets
[36]:
diameter avg_temp gravity
Mercury 4879 167 3.7
Venus 12104 464 8.9
Earth 12756 15 9.8
Mars 6792 -65 3.7
Jupyter 142984 -110 23.1
Saturn 120536 -140 9.0
Uranus 51118 -195 8.7
Neptune 49528 -200 11.0
Pluto 2370 -225 0.7
[37]:
# check which planets have diameter greater than 100,000 km
planets["diameter"] > 100000
[37]:
Mercury    False
Venus      False
Earth      False
Mars       False
Jupyter     True
Saturn      True
Uranus     False
Neptune    False
Pluto      False
Name: diameter, dtype: bool
[38]:
# select all planets with diameter greater than 100,000 km
planets[planets["diameter"] > 100000]
[38]:
diameter avg_temp gravity
Jupyter 142984 -110 23.1
Saturn 120536 -140 9.0

Notice that if the Boolean array used for selection is one-dimensional, then is selects rows of the DataFrame. Other options (using a two-dimensional Boolean array, using a one-dimensional array to select columns) are also possible.

Logical conditions can be combined the same way as for numpy arrays:

[39]:
# select planets based on their average temperatue and gravity
planets[(planets["avg_temp"] > 0) & (planets["gravity"] > 5) ]
[39]:
diameter avg_temp gravity
Venus 12104 464 8.9
Earth 12756 15 9.8

Sorting

Values of pandas Series can be sorted using the sort_values method:

[40]:
diameters
[40]:
Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64
[41]:
# sort values in ascending order
diameters.sort_values()
[41]:
Pluto        2370
Mercury      4879
Mars         6792
Venus       12104
Earth       12756
Neptune     49528
Uranus      51118
Saturn     120536
Jupyter    142984
dtype: int64

By default values are sorted in the ascending order, but we can change it:

[42]:
# sort values in descending order
diameters.sort_values(ascending=False)
[42]:
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Earth       12756
Venus       12104
Mars         6792
Mercury      4879
Pluto        2370
dtype: int64

sort_values works with DataFrames as well, but in this case we need to specify which column (or row) should be used for sorting:

[43]:
# sort using the "diameter" column
planets.sort_values(by="diameter")
[43]:
diameter avg_temp gravity
Pluto 2370 -225 0.7
Mercury 4879 167 3.7
Mars 6792 -65 3.7
Venus 12104 464 8.9
Earth 12756 15 9.8
Neptune 49528 -200 11.0
Uranus 51118 -195 8.7
Saturn 120536 -140 9.0
Jupyter 142984 -110 23.1
[44]:
# sort using the "gravity" column in descending order
planets.sort_values(by="gravity", ascending=False)
[44]:
diameter avg_temp gravity
Jupyter 142984 -110 23.1
Neptune 49528 -200 11.0
Earth 12756 15 9.8
Saturn 120536 -140 9.0
Venus 12104 464 8.9
Uranus 51118 -195 8.7
Mercury 4879 167 3.7
Mars 6792 -65 3.7
Pluto 2370 -225 0.7

The sort_values method accepts axis argument. The default value is axis=0 which sorts columns. By changing it to axis=1 we can sort rows:

[45]:
# sort values in the "Mercury" row
planets.sort_values(by="Mercury", axis=1)
[45]:
gravity avg_temp diameter
Mercury 3.7 167 4879
Venus 8.9 464 12104
Earth 9.8 15 12756
Mars 3.7 -65 6792
Jupyter 23.1 -110 142984
Saturn 9.0 -140 120536
Uranus 8.7 -195 51118
Neptune 11.0 -200 49528
Pluto 0.7 -225 2370

The mathod sort_index allows us to sort a DataFrame according to the index of its rows or columns:

[46]:
# sort row index
planets.sort_index()
[46]:
diameter avg_temp gravity
Earth 12756 15 9.8
Jupyter 142984 -110 23.1
Mars 6792 -65 3.7
Mercury 4879 167 3.7
Neptune 49528 -200 11.0
Pluto 2370 -225 0.7
Saturn 120536 -140 9.0
Uranus 51118 -195 8.7
Venus 12104 464 8.9
[47]:
# sort column index
planets.sort_index(axis=1)
[47]:
avg_temp diameter gravity
Mercury 167 4879 3.7
Venus 464 12104 8.9
Earth 15 12756 9.8
Mars -65 6792 3.7
Jupyter -110 142984 23.1
Saturn -140 120536 9.0
Uranus -195 51118 8.7
Neptune -200 49528 11.0
Pluto -225 2370 0.7

Modifying DataFrames and Series

Adding rows and columns

We can add a column to a DataFrame by specifying its name and values:

[48]:
planets["moons"] = [0, 0, 1, 2, 79, 82, 27, 14, 5]
planets
[48]:
diameter avg_temp gravity moons
Mercury 4879 167 3.7 0
Venus 12104 464 8.9 0
Earth 12756 15 9.8 1
Mars 6792 -65 3.7 2
Jupyter 142984 -110 23.1 79
Saturn 120536 -140 9.0 82
Uranus 51118 -195 8.7 27
Neptune 49528 -200 11.0 14
Pluto 2370 -225 0.7 5

Rows can be added in a similar way:

[49]:
planets.loc["Moon"] = [3475, -20, 1.6, 0]
planets
[49]:
diameter avg_temp gravity moons
Mercury 4879.0 167.0 3.7 0.0
Venus 12104.0 464.0 8.9 0.0
Earth 12756.0 15.0 9.8 1.0
Mars 6792.0 -65.0 3.7 2.0
Jupyter 142984.0 -110.0 23.1 79.0
Saturn 120536.0 -140.0 9.0 82.0
Uranus 51118.0 -195.0 8.7 27.0
Neptune 49528.0 -200.0 11.0 14.0
Pluto 2370.0 -225.0 0.7 5.0
Moon 3475.0 -20.0 1.6 0.0

New columns can be created based on the values of existing columns. For example, in the planets DataFrame the column avg_temp gives temperatures in degrees Celsius. We can add a new column with temperatures in degrees Fahrenheit:

[50]:
planets["avg_temp_F"] = planets["avg_temp"]*1.8 + 32
planets
[50]:
diameter avg_temp gravity moons avg_temp_F
Mercury 4879.0 167.0 3.7 0.0 332.6
Venus 12104.0 464.0 8.9 0.0 867.2
Earth 12756.0 15.0 9.8 1.0 59.0
Mars 6792.0 -65.0 3.7 2.0 -85.0
Jupyter 142984.0 -110.0 23.1 79.0 -166.0
Saturn 120536.0 -140.0 9.0 82.0 -220.0
Uranus 51118.0 -195.0 8.7 27.0 -319.0
Neptune 49528.0 -200.0 11.0 14.0 -328.0
Pluto 2370.0 -225.0 0.7 5.0 -373.0
Moon 3475.0 -20.0 1.6 0.0 -4.0

By default a new column is added as the last column of a DataFrame. However, we can rearrange columns, by selecting them in any given order:

[51]:
# make a copy of the DataFrame with rearranged columns
planets = planets[['diameter', 'avg_temp', 'avg_temp_F', 'gravity', 'moons', ]].copy()
planets
[51]:
diameter avg_temp avg_temp_F gravity moons
Mercury 4879.0 167.0 332.6 3.7 0.0
Venus 12104.0 464.0 867.2 8.9 0.0
Earth 12756.0 15.0 59.0 9.8 1.0
Mars 6792.0 -65.0 -85.0 3.7 2.0
Jupyter 142984.0 -110.0 -166.0 23.1 79.0
Saturn 120536.0 -140.0 -220.0 9.0 82.0
Uranus 51118.0 -195.0 -319.0 8.7 27.0
Neptune 49528.0 -200.0 -328.0 11.0 14.0
Pluto 2370.0 -225.0 -373.0 0.7 5.0
Moon 3475.0 -20.0 -4.0 1.6 0.0

Using numpy functions

Numpy functions can be be applied directly to pandas Series:

[52]:
diameters
[52]:
Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64
[53]:
import numpy as np

# take base 10 logarithm of each element of the Series
np.log10(diameters)
[53]:
Mercury    3.688331
Venus      4.082929
Earth      4.105715
Mars       3.831998
Jupyter    5.155287
Saturn     5.081117
Uranus     4.708574
Neptune    4.694851
Pluto      3.374748
dtype: float64

Since each row and each column of a DataFrame is a Series, we can use numpy functions to modify DataFrame rows and columns, or to create new rows and columns:

[54]:
# create a new column with logarithms of diameters of planets
planets["log10_diameter"] = np.log10(planets["diameter"])
planets
[54]:
diameter avg_temp avg_temp_F gravity moons log10_diameter
Mercury 4879.0 167.0 332.6 3.7 0.0 3.688331
Venus 12104.0 464.0 867.2 8.9 0.0 4.082929
Earth 12756.0 15.0 59.0 9.8 1.0 4.105715
Mars 6792.0 -65.0 -85.0 3.7 2.0 3.831998
Jupyter 142984.0 -110.0 -166.0 23.1 79.0 5.155287
Saturn 120536.0 -140.0 -220.0 9.0 82.0 5.081117
Uranus 51118.0 -195.0 -319.0 8.7 27.0 4.708574
Neptune 49528.0 -200.0 -328.0 11.0 14.0 4.694851
Pluto 2370.0 -225.0 -373.0 0.7 5.0 3.374748
Moon 3475.0 -20.0 -4.0 1.6 0.0 3.540955

Series.map()

More complex computations can be accomplished using the map method of pandas Series. This method takes as its argument a function, applies this function to each value of the Series, and produces a new Series consisting of return values of the function:

[55]:
def get_size(d):
    if d > 10**5:
        return "big"
    elif d > 10**4:
        return "medium"
    else:
        return "small"

diameters.map(get_size)
[55]:
Mercury     small
Venus      medium
Earth      medium
Mars        small
Jupyter       big
Saturn        big
Uranus     medium
Neptune    medium
Pluto       small
dtype: object

Again, we can use this with any column or row of a DataFrame:

[56]:
# add new column with values of the size function applied
# to the diameter column
planets["planet_size"] = planets["diameter"].map(get_size)
planets
[56]:
diameter avg_temp avg_temp_F gravity moons log10_diameter planet_size
Mercury 4879.0 167.0 332.6 3.7 0.0 3.688331 small
Venus 12104.0 464.0 867.2 8.9 0.0 4.082929 medium
Earth 12756.0 15.0 59.0 9.8 1.0 4.105715 medium
Mars 6792.0 -65.0 -85.0 3.7 2.0 3.831998 small
Jupyter 142984.0 -110.0 -166.0 23.1 79.0 5.155287 big
Saturn 120536.0 -140.0 -220.0 9.0 82.0 5.081117 big
Uranus 51118.0 -195.0 -319.0 8.7 27.0 4.708574 medium
Neptune 49528.0 -200.0 -328.0 11.0 14.0 4.694851 medium
Pluto 2370.0 -225.0 -373.0 0.7 5.0 3.374748 small
Moon 3475.0 -20.0 -4.0 1.6 0.0 3.540955 small

Aggregations

Similarly as numpy arrays, pandas Series and DataFrames have several methods that let us summarize their data. We will use the following DataFrame with exam scores of some students to illustrate this:

[46]:
scores_arr = np.random.randint(0,100, (4,3))

# create a dataframe
scores = pd.DataFrame(scores_arr,                               # data
                      index=["John", "Alice", "Bob", "Emily"],  # row names
                      columns=["exam_1", "exam_2", "exam_3"])   # column names
scores
[46]:
exam_1 exam_2 exam_3
John 9 69 10
Alice 59 53 99
Bob 15 48 16
Emily 59 22 27
[47]:
# compute maximum of each column
scores.max()
[47]:
exam_1    59
exam_2    69
exam_3    99
dtype: int64
[48]:
# compute minimum of each column
scores.min()
[48]:
exam_1     9
exam_2    22
exam_3    10
dtype: int64
[49]:
# compute average value of each column
scores.mean()
[49]:
exam_1    35.5
exam_2    48.0
exam_3    38.0
dtype: float64

By default, aggregations are computed for each column, but we can change it using the axis argument.

[50]:
# compute sum of each row
scores.sum(axis=1)
[50]:
John      88
Alice    211
Bob       79
Emily    108
dtype: int64

DataFrame.agg()

To aggregate data in a custom way we can use the DataFrame agg method. This method takes as an argument a function, passes each column (or row) of the DataFrame to this function, and returns a Series (or DataFrame) consisting of return values.

For example, lets say that we want to assign grades to students based on their total exam scores: “Pass” for a score above 200, and “Fail” otherwise. We can do it as follows:

[51]:
def assign_grades(scores):
    total = scores.sum()
    if total > 200:
        return "Pass"
    else:
        return "Fail"

scores.agg(assign_grades, axis=1)
[51]:
John     Fail
Alice    Pass
Bob      Fail
Emily    Fail
dtype: object

The agg method has several additional options. For example, we can use it to specify a different aggregation function for each column.

[52]:
# compute minumum of exam_1, maximum of exam_2 and mean of exam_3
scores.agg({"exam_1": "min", "exam_2" : "max", "exam_3" : "mean"})
[52]:
exam_1     9.0
exam_2    69.0
exam_3    38.0
dtype: float64

Notice that the standard aggregation functions (mean, min, max, sum etc.) can be specified by giving their names as strings.

We can also use agg with a list of functions to compute several aggregations at once:

[53]:
# compute mean, sum, and the value of assign_grades for each row
scores.agg(["mean", "sum", assign_grades], axis=1)
[53]:
mean sum assign_grades
John 29.3333 88 Fail
Alice 70.3333 211 Pass
Bob 26.3333 79 Fail
Emily 36 108 Fail