Pandas basics¶

Pandas is a Python library for working with tabular data. The main objects it provides are Series and DataFrame.

[10]:

import pandas as pd

Pandas Series¶

Pandas Series is essentially a one-dimensional array, equipped with an index which labels its entries. We can create a Series object, for example, by converting a list:

[11]:

diameters = pd.Series([4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528])
diameters

[11]:

0      4879
1     12104
2     12756
3      6792
4    142984
5    120536
6     51118
7     49528
dtype: int64

By default entries of a Series are indexed by consecutive integers, but we can specify a more meaningful index. The numbers in the above Series give diameters (in kilometers) of planets of the Solar System, so it is sensible to use names of the planet as index values:

[12]:

diameters = pd.Series([4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528],
                      index=["Mercury", "Venus", "Earth", "Mars", "Jupyter", "Saturn",
                             "Uranus", "Neptune"])
diameters

[12]:

Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
dtype: int64

We can use index values to select Series entries:

[13]:

diameters["Earth"]

[13]:

Slicing using index works as well:

[14]:

diameters["Mercury":"Mars"]

[14]:

Mercury     4879
Venus      12104
Earth      12756
Mars        6792
dtype: int64

Note. In pandas slicing with index values works a bit differently that slicing in Python in general: the code diameters["Mercury":"Mars"] selects all entries from "Mercury" up to and including "Mars".

We can also select Series values by specifying a list of index labels:

[15]:

diameters[["Earth", "Jupyter", "Neptune"]]

[15]:

Earth       12756
Jupyter    142984
Neptune     49528
dtype: int64

Indexing can be used to modify Series values or to add new values:

[16]:

diameters["Pluto"] = 2370
diameters

[16]:

Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64

Pandas DataFrame¶

Pandas DataFrame is a two-dimensional array equipped with one index labeling its rows, and another labeling its columns. There are several ways of creating a DataFrame. One of them is to use a dictionary of lists. Each list gives values of a column of the DataFrame, and dictionary keys give column labels:

[17]:

d = {"diameter" : [4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528, 2370],
     "avg_temp" : [167, 464, 15, -65, -110, -140, -195, -200, -225],
     "gravity" : [3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0, 0.7]}

planets = pd.DataFrame(d)
planets

[17]:

	diameter	avg_temp	gravity
0	4879	167	3.7
1	12104	464	8.9
2	12756	15	9.8
3	6792	-65	3.7
4	142984	-110	23.1
5	120536	-140	9.0
6	51118	-195	8.7
7	49528	-200	11.0
8	2370	-225	0.7

The head and tail methods lets us select the first snd last few rows of a DataFrame:

[18]:

# get the first 3 rows
planets.head(3)

[18]:

	diameter	avg_temp	gravity
0	4879	167	3.7
1	12104	464	8.9
2	12756	15	9.8

[19]:

# get the last 2 rows
planets.tail(2)

[19]:

	diameter	avg_temp	gravity
7	49528	-200	11.0
8	2370	-225	0.7

Index of DataFrame columns can be accessed using the columns property:

[20]:

planets.columns

[20]:

Index(['diameter', 'avg_temp', 'gravity'], dtype='object')

Index of rows can be obtained using the index property:

[21]:

planets.index

[21]:

RangeIndex(start=0, stop=9, step=1)

Since we have not specified an index for rows, by default it consists of consecutive integers. We can change it by modifying the index:

[22]:

planets.index = ["Mercury", "Venus", "Earth", "Mars", "Jupyter", "Saturn",
                 "Uranus", "Neptune", "Pluto"]
planets

[22]:

	diameter	avg_temp	gravity
Mercury	4879	167	3.7
Venus	12104	464	8.9
Earth	12756	15	9.8
Mars	6792	-65	3.7
Jupyter	142984	-110	23.1
Saturn	120536	-140	9.0
Uranus	51118	-195	8.7
Neptune	49528	-200	11.0
Pluto	2370	-225	0.7

[23]:

planets.index

[23]:

Index(['Mercury', 'Venus', 'Earth', 'Mars', 'Jupyter', 'Saturn', 'Uranus',
       'Neptune', 'Pluto'],
      dtype='object')

Selecting data in a DataFrame¶

Basic selections¶

Columns of a DataFrame can be selected by specifying a column name. This returns a Series with column data:

[24]:

planets["gravity"]

[24]:

Mercury     3.7
Venus       8.9
Earth       9.8
Mars        3.7
Jupyter    23.1
Saturn      9.0
Uranus      8.7
Neptune    11.0
Pluto       0.7
Name: gravity, dtype: float64

Specifying a list of column names gives a DataFrame consisting of the selected columns:

[25]:

planets[["gravity", "diameter"]]

[25]:

	gravity	diameter
Mercury	3.7	4879
Venus	8.9	12104
Earth	9.8	12756
Mars	3.7	6792
Jupyter	23.1	142984
Saturn	9.0	120536
Uranus	8.7	51118
Neptune	11.0	49528
Pluto	0.7	2370

Slicing selects rows of a DataFrame:

[26]:

planets["Earth":"Saturn"]

[26]:

	diameter	avg_temp	gravity
Earth	12756	15	9.8
Mars	6792	-65	3.7
Jupyter	142984	-110	23.1
Saturn	120536	-140	9.0

Selections with `loc` and `iloc`¶

To get more flexibility with DataFrame selections we can use the loc indexer. It lets us select rows and columns using row and column labels, lists of labels, and slices:

[27]:

# select an entry in a given row and column
planets.loc["Earth", "gravity"]

[27]:

9.8

[28]:

# select one row and a list of columns
planets.loc["Earth", ["diameter", "gravity"]]

[28]:

diameter    12756.0
gravity         9.8
Name: Earth, dtype: float64

[29]:

# select a slice of rows and a list of columns
planets.loc["Earth":"Saturn", ["gravity", "avg_temp"]]

[29]:

	gravity	avg_temp
Earth	9.8	15
Mars	3.7	-65
Jupyter	23.1	-110
Saturn	9.0	-140

The iloc indexer works similarly, but it uses row and column numbers instead of their labels:

[30]:

# select the first 3 rows
planets.iloc[:3]

[30]:

	diameter	avg_temp	gravity
Mercury	4879	167	3.7
Venus	12104	464	8.9
Earth	12756	15	9.8

[31]:

# select all rows and the first 2 columns
planets.iloc[: , :2]

[31]:

	diameter	avg_temp
Mercury	4879	167
Venus	12104	464
Earth	12756	15
Mars	6792	-65
Jupyter	142984	-110
Saturn	120536	-140
Uranus	51118	-195
Neptune	49528	-200
Pluto	2370	-225

[32]:

# select using a list of row numbers and a slice of column numbers
planets.iloc[[0, 4, 5], 1:2]

[32]:

	avg_temp
Mercury	167
Jupyter	-110
Saturn	-140

loc snd iloc work with pandas Series too:

[33]:

diameters

[33]:

Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64

[34]:

# loc with pandas Series
diameters.loc["Jupyter":"Pluto"]

[34]:

Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64

[35]:

# iloc with pandas Series
diameters.iloc[[0, 1, 5]]

[35]:

Mercury      4879
Venus       12104
Saturn     120536
dtype: int64

Boolean indexing¶

Similarly as with numpy arrays, we can use Boolean indexing to select parts of pandas Series and DataFrames.

[36]:

planets

[36]:

	diameter	avg_temp	gravity
Mercury	4879	167	3.7
Venus	12104	464	8.9
Earth	12756	15	9.8
Mars	6792	-65	3.7
Jupyter	142984	-110	23.1
Saturn	120536	-140	9.0
Uranus	51118	-195	8.7
Neptune	49528	-200	11.0
Pluto	2370	-225	0.7

[37]:

# check which planets have diameter greater than 100,000 km
planets["diameter"] > 100000

[37]:

Mercury    False
Venus      False
Earth      False
Mars       False
Jupyter     True
Saturn      True
Uranus     False
Neptune    False
Pluto      False
Name: diameter, dtype: bool

[38]:

# select all planets with diameter greater than 100,000 km
planets[planets["diameter"] > 100000]

[38]:

	diameter	avg_temp	gravity
Jupyter	142984	-110	23.1
Saturn	120536	-140	9.0

Notice that if the Boolean array used for selection is one-dimensional, then is selects rows of the DataFrame. Other options (using a two-dimensional Boolean array, using a one-dimensional array to select columns) are also possible.

Logical conditions can be combined the same way as for numpy arrays:

[39]:

# select planets based on their average temperatue and gravity
planets[(planets["avg_temp"] > 0) & (planets["gravity"] > 5) ]

[39]:

	diameter	avg_temp	gravity
Venus	12104	464	8.9
Earth	12756	15	9.8

Sorting¶

Values of pandas Series can be sorted using the sort_values method:

[40]:

diameters

[40]:

Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64

[41]:

# sort values in ascending order
diameters.sort_values()

[41]:

Pluto        2370
Mercury      4879
Mars         6792
Venus       12104
Earth       12756
Neptune     49528
Uranus      51118
Saturn     120536
Jupyter    142984
dtype: int64

By default values are sorted in the ascending order, but we can change it:

[42]:

# sort values in descending order
diameters.sort_values(ascending=False)

[42]:

Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Earth       12756
Venus       12104
Mars         6792
Mercury      4879
Pluto        2370
dtype: int64

sort_values works with DataFrames as well, but in this case we need to specify which column (or row) should be used for sorting:

[43]:

# sort using the "diameter" column
planets.sort_values(by="diameter")

[43]:

	diameter	avg_temp	gravity
Pluto	2370	-225	0.7
Mercury	4879	167	3.7
Mars	6792	-65	3.7
Venus	12104	464	8.9
Earth	12756	15	9.8
Neptune	49528	-200	11.0
Uranus	51118	-195	8.7
Saturn	120536	-140	9.0
Jupyter	142984	-110	23.1

[44]:

# sort using the "gravity" column in descending order
planets.sort_values(by="gravity", ascending=False)

[44]:

	diameter	avg_temp	gravity
Jupyter	142984	-110	23.1
Neptune	49528	-200	11.0
Earth	12756	15	9.8
Saturn	120536	-140	9.0
Venus	12104	464	8.9
Uranus	51118	-195	8.7
Mercury	4879	167	3.7
Mars	6792	-65	3.7
Pluto	2370	-225	0.7

The sort_values method accepts axis argument. The default value is axis=0 which sorts columns. By changing it to axis=1 we can sort rows:

[45]:

# sort values in the "Mercury" row
planets.sort_values(by="Mercury", axis=1)

[45]:

	gravity	avg_temp	diameter
Mercury	3.7	167	4879
Venus	8.9	464	12104
Earth	9.8	15	12756
Mars	3.7	-65	6792
Jupyter	23.1	-110	142984
Saturn	9.0	-140	120536
Uranus	8.7	-195	51118
Neptune	11.0	-200	49528
Pluto	0.7	-225	2370

The mathod sort_index allows us to sort a DataFrame according to the index of its rows or columns:

[46]:

# sort row index
planets.sort_index()

[46]:

	diameter	avg_temp	gravity
Earth	12756	15	9.8
Jupyter	142984	-110	23.1
Mars	6792	-65	3.7
Mercury	4879	167	3.7
Neptune	49528	-200	11.0
Pluto	2370	-225	0.7
Saturn	120536	-140	9.0
Uranus	51118	-195	8.7
Venus	12104	464	8.9

[47]:

# sort column index
planets.sort_index(axis=1)

[47]:

	avg_temp	diameter	gravity
Mercury	167	4879	3.7
Venus	464	12104	8.9
Earth	15	12756	9.8
Mars	-65	6792	3.7
Jupyter	-110	142984	23.1
Saturn	-140	120536	9.0
Uranus	-195	51118	8.7
Neptune	-200	49528	11.0
Pluto	-225	2370	0.7

Modifying DataFrames and Series¶

Adding rows and columns¶

We can add a column to a DataFrame by specifying its name and values:

[48]:

planets["moons"] = [0, 0, 1, 2, 79, 82, 27, 14, 5]
planets

[48]:

	diameter	avg_temp	gravity	moons
Mercury	4879	167	3.7	0
Venus	12104	464	8.9	0
Earth	12756	15	9.8	1
Mars	6792	-65	3.7	2
Jupyter	142984	-110	23.1	79
Saturn	120536	-140	9.0	82
Uranus	51118	-195	8.7	27
Neptune	49528	-200	11.0	14
Pluto	2370	-225	0.7	5

Rows can be added in a similar way:

[49]:

planets.loc["Moon"] = [3475, -20, 1.6, 0]
planets

[49]:

	diameter	avg_temp	gravity	moons
Mercury	4879.0	167.0	3.7	0.0
Venus	12104.0	464.0	8.9	0.0
Earth	12756.0	15.0	9.8	1.0
Mars	6792.0	-65.0	3.7	2.0
Jupyter	142984.0	-110.0	23.1	79.0
Saturn	120536.0	-140.0	9.0	82.0
Uranus	51118.0	-195.0	8.7	27.0
Neptune	49528.0	-200.0	11.0	14.0
Pluto	2370.0	-225.0	0.7	5.0
Moon	3475.0	-20.0	1.6	0.0

New columns can be created based on the values of existing columns. For example, in the planets DataFrame the column avg_temp gives temperatures in degrees Celsius. We can add a new column with temperatures in degrees Fahrenheit:

[50]:

planets["avg_temp_F"] = planets["avg_temp"]*1.8 + 32
planets

[50]:

	diameter	avg_temp	gravity	moons	avg_temp_F
Mercury	4879.0	167.0	3.7	0.0	332.6
Venus	12104.0	464.0	8.9	0.0	867.2
Earth	12756.0	15.0	9.8	1.0	59.0
Mars	6792.0	-65.0	3.7	2.0	-85.0
Jupyter	142984.0	-110.0	23.1	79.0	-166.0
Saturn	120536.0	-140.0	9.0	82.0	-220.0
Uranus	51118.0	-195.0	8.7	27.0	-319.0
Neptune	49528.0	-200.0	11.0	14.0	-328.0
Pluto	2370.0	-225.0	0.7	5.0	-373.0
Moon	3475.0	-20.0	1.6	0.0	-4.0

By default a new column is added as the last column of a DataFrame. However, we can rearrange columns, by selecting them in any given order:

[51]:

# make a copy of the DataFrame with rearranged columns
planets = planets[['diameter', 'avg_temp', 'avg_temp_F', 'gravity', 'moons', ]].copy()
planets

[51]:

	diameter	avg_temp	avg_temp_F	gravity	moons
Mercury	4879.0	167.0	332.6	3.7	0.0
Venus	12104.0	464.0	867.2	8.9	0.0
Earth	12756.0	15.0	59.0	9.8	1.0
Mars	6792.0	-65.0	-85.0	3.7	2.0
Jupyter	142984.0	-110.0	-166.0	23.1	79.0
Saturn	120536.0	-140.0	-220.0	9.0	82.0
Uranus	51118.0	-195.0	-319.0	8.7	27.0
Neptune	49528.0	-200.0	-328.0	11.0	14.0
Pluto	2370.0	-225.0	-373.0	0.7	5.0
Moon	3475.0	-20.0	-4.0	1.6	0.0

Using numpy functions¶

Numpy functions can be be applied directly to pandas Series:

[52]:

diameters

[52]:

Mercury      4879
Venus       12104
Earth       12756
Mars         6792
Jupyter    142984
Saturn     120536
Uranus      51118
Neptune     49528
Pluto        2370
dtype: int64

[53]:

import numpy as np

# take base 10 logarithm of each element of the Series
np.log10(diameters)

[53]:

Mercury    3.688331
Venus      4.082929
Earth      4.105715
Mars       3.831998
Jupyter    5.155287
Saturn     5.081117
Uranus     4.708574
Neptune    4.694851
Pluto      3.374748
dtype: float64

Since each row and each column of a DataFrame is a Series, we can use numpy functions to modify DataFrame rows and columns, or to create new rows and columns:

[54]:

# create a new column with logarithms of diameters of planets
planets["log10_diameter"] = np.log10(planets["diameter"])
planets

[54]:

	diameter	avg_temp	avg_temp_F	gravity	moons	log10_diameter
Mercury	4879.0	167.0	332.6	3.7	0.0	3.688331
Venus	12104.0	464.0	867.2	8.9	0.0	4.082929
Earth	12756.0	15.0	59.0	9.8	1.0	4.105715
Mars	6792.0	-65.0	-85.0	3.7	2.0	3.831998
Jupyter	142984.0	-110.0	-166.0	23.1	79.0	5.155287
Saturn	120536.0	-140.0	-220.0	9.0	82.0	5.081117
Uranus	51118.0	-195.0	-319.0	8.7	27.0	4.708574
Neptune	49528.0	-200.0	-328.0	11.0	14.0	4.694851
Pluto	2370.0	-225.0	-373.0	0.7	5.0	3.374748
Moon	3475.0	-20.0	-4.0	1.6	0.0	3.540955

Series.map()¶

More complex computations can be accomplished using the map method of pandas Series. This method takes as its argument a function, applies this function to each value of the Series, and produces a new Series consisting of return values of the function:

[55]:

def get_size(d):
    if d > 10**5:
        return "big"
    elif d > 10**4:
        return "medium"
    else:
        return "small"

diameters.map(get_size)

[55]:

Mercury     small
Venus      medium
Earth      medium
Mars        small
Jupyter       big
Saturn        big
Uranus     medium
Neptune    medium
Pluto       small
dtype: object

Again, we can use this with any column or row of a DataFrame:

[56]:

# add new column with values of the size function applied
# to the diameter column
planets["planet_size"] = planets["diameter"].map(get_size)
planets

[56]:

	diameter	avg_temp	avg_temp_F	gravity	moons	log10_diameter	planet_size
Mercury	4879.0	167.0	332.6	3.7	0.0	3.688331	small
Venus	12104.0	464.0	867.2	8.9	0.0	4.082929	medium
Earth	12756.0	15.0	59.0	9.8	1.0	4.105715	medium
Mars	6792.0	-65.0	-85.0	3.7	2.0	3.831998	small
Jupyter	142984.0	-110.0	-166.0	23.1	79.0	5.155287	big
Saturn	120536.0	-140.0	-220.0	9.0	82.0	5.081117	big
Uranus	51118.0	-195.0	-319.0	8.7	27.0	4.708574	medium
Neptune	49528.0	-200.0	-328.0	11.0	14.0	4.694851	medium
Pluto	2370.0	-225.0	-373.0	0.7	5.0	3.374748	small
Moon	3475.0	-20.0	-4.0	1.6	0.0	3.540955	small

Aggregations¶

Similarly as numpy arrays, pandas Series and DataFrames have several methods that let us summarize their data. We will use the following DataFrame with exam scores of some students to illustrate this:

[46]:

scores_arr = np.random.randint(0,100, (4,3))

# create a dataframe
scores = pd.DataFrame(scores_arr,                               # data
                      index=["John", "Alice", "Bob", "Emily"],  # row names
                      columns=["exam_1", "exam_2", "exam_3"])   # column names
scores

[46]:

	exam_1	exam_2	exam_3
John	9	69	10
Alice	59	53	99
Bob	15	48	16
Emily	59	22	27

[47]:

# compute maximum of each column
scores.max()

[47]:

exam_1    59
exam_2    69
exam_3    99
dtype: int64

[48]:

# compute minimum of each column
scores.min()

[48]:

exam_1     9
exam_2    22
exam_3    10
dtype: int64

[49]:

# compute average value of each column
scores.mean()

[49]:

exam_1    35.5
exam_2    48.0
exam_3    38.0
dtype: float64

By default, aggregations are computed for each column, but we can change it using the axis argument.

[50]:

# compute sum of each row
scores.sum(axis=1)

[50]:

John      88
Alice    211
Bob       79
Emily    108
dtype: int64

DataFrame.agg()¶

To aggregate data in a custom way we can use the DataFrame agg method. This method takes as an argument a function, passes each column (or row) of the DataFrame to this function, and returns a Series (or DataFrame) consisting of return values.

For example, lets say that we want to assign grades to students based on their total exam scores: “Pass” for a score above 200, and “Fail” otherwise. We can do it as follows:

[51]:

def assign_grades(scores):
    total = scores.sum()
    if total > 200:
        return "Pass"
    else:
        return "Fail"

scores.agg(assign_grades, axis=1)

[51]:

John     Fail
Alice    Pass
Bob      Fail
Emily    Fail
dtype: object

The agg method has several additional options. For example, we can use it to specify a different aggregation function for each column.

[52]:

# compute minumum of exam_1, maximum of exam_2 and mean of exam_3
scores.agg({"exam_1": "min", "exam_2" : "max", "exam_3" : "mean"})

[52]:

exam_1     9.0
exam_2    69.0
exam_3    38.0
dtype: float64

Notice that the standard aggregation functions (mean, min, max, sum etc.) can be specified by giving their names as strings.

We can also use agg with a list of functions to compute several aggregations at once:

[53]:

# compute mean, sum, and the value of assign_grades for each row
scores.agg(["mean", "sum", assign_grades], axis=1)

[53]:

	mean	sum	assign_grades
John	29.3333	88	Fail
Alice	70.3333	211	Pass
Bob	26.3333	79	Fail
Emily	36	108	Fail