Pandas basics¶
Pandas is a Python library for working with tabular data. The main objects it provides are Series and DataFrame.
[10]:
import pandas as pd
Pandas Series¶
Pandas Series is essentially a one-dimensional array, equipped with an index which labels its entries. We can create a Series object, for example, by converting a list:
[11]:
diameters = pd.Series([4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528])
diameters
[11]:
0 4879
1 12104
2 12756
3 6792
4 142984
5 120536
6 51118
7 49528
dtype: int64
By default entries of a Series are indexed by consecutive integers, but we can specify a more meaningful index. The numbers in the above Series give diameters (in kilometers) of planets of the Solar System, so it is sensible to use names of the planet as index values:
[12]:
diameters = pd.Series([4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528],
index=["Mercury", "Venus", "Earth", "Mars", "Jupyter", "Saturn",
"Uranus", "Neptune"])
diameters
[12]:
Mercury 4879
Venus 12104
Earth 12756
Mars 6792
Jupyter 142984
Saturn 120536
Uranus 51118
Neptune 49528
dtype: int64
We can use index values to select Series entries:
[13]:
diameters["Earth"]
[13]:
12756
Slicing using index works as well:
[14]:
diameters["Mercury":"Mars"]
[14]:
Mercury 4879
Venus 12104
Earth 12756
Mars 6792
dtype: int64
Note. In pandas slicing with index values works a bit differently that slicing in Python in general: the code diameters["Mercury":"Mars"]
selects all entries from "Mercury"
up to and including "Mars"
.
We can also select Series values by specifying a list of index labels:
[15]:
diameters[["Earth", "Jupyter", "Neptune"]]
[15]:
Earth 12756
Jupyter 142984
Neptune 49528
dtype: int64
Indexing can be used to modify Series values or to add new values:
[16]:
diameters["Pluto"] = 2370
diameters
[16]:
Mercury 4879
Venus 12104
Earth 12756
Mars 6792
Jupyter 142984
Saturn 120536
Uranus 51118
Neptune 49528
Pluto 2370
dtype: int64
Pandas DataFrame¶
Pandas DataFrame is a two-dimensional array equipped with one index labeling its rows, and another labeling its columns. There are several ways of creating a DataFrame. One of them is to use a dictionary of lists. Each list gives values of a column of the DataFrame, and dictionary keys give column labels:
[17]:
d = {"diameter" : [4879, 12104, 12756, 6792, 142984, 120536, 51118, 49528, 2370],
"avg_temp" : [167, 464, 15, -65, -110, -140, -195, -200, -225],
"gravity" : [3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0, 0.7]}
planets = pd.DataFrame(d)
planets
[17]:
diameter | avg_temp | gravity | |
---|---|---|---|
0 | 4879 | 167 | 3.7 |
1 | 12104 | 464 | 8.9 |
2 | 12756 | 15 | 9.8 |
3 | 6792 | -65 | 3.7 |
4 | 142984 | -110 | 23.1 |
5 | 120536 | -140 | 9.0 |
6 | 51118 | -195 | 8.7 |
7 | 49528 | -200 | 11.0 |
8 | 2370 | -225 | 0.7 |
The head
and tail
methods lets us select the first snd last few rows of a DataFrame:
[18]:
# get the first 3 rows
planets.head(3)
[18]:
diameter | avg_temp | gravity | |
---|---|---|---|
0 | 4879 | 167 | 3.7 |
1 | 12104 | 464 | 8.9 |
2 | 12756 | 15 | 9.8 |
[19]:
# get the last 2 rows
planets.tail(2)
[19]:
diameter | avg_temp | gravity | |
---|---|---|---|
7 | 49528 | -200 | 11.0 |
8 | 2370 | -225 | 0.7 |
Index of DataFrame columns can be accessed using the columns
property:
[20]:
planets.columns
[20]:
Index(['diameter', 'avg_temp', 'gravity'], dtype='object')
Index of rows can be obtained using the index
property:
[21]:
planets.index
[21]:
RangeIndex(start=0, stop=9, step=1)
Since we have not specified an index for rows, by default it consists of consecutive integers. We can change it by modifying the index:
[22]:
planets.index = ["Mercury", "Venus", "Earth", "Mars", "Jupyter", "Saturn",
"Uranus", "Neptune", "Pluto"]
planets
[22]:
diameter | avg_temp | gravity | |
---|---|---|---|
Mercury | 4879 | 167 | 3.7 |
Venus | 12104 | 464 | 8.9 |
Earth | 12756 | 15 | 9.8 |
Mars | 6792 | -65 | 3.7 |
Jupyter | 142984 | -110 | 23.1 |
Saturn | 120536 | -140 | 9.0 |
Uranus | 51118 | -195 | 8.7 |
Neptune | 49528 | -200 | 11.0 |
Pluto | 2370 | -225 | 0.7 |
[23]:
planets.index
[23]:
Index(['Mercury', 'Venus', 'Earth', 'Mars', 'Jupyter', 'Saturn', 'Uranus',
'Neptune', 'Pluto'],
dtype='object')
Selecting data in a DataFrame¶
Basic selections¶
Columns of a DataFrame can be selected by specifying a column name. This returns a Series with column data:
[24]:
planets["gravity"]
[24]:
Mercury 3.7
Venus 8.9
Earth 9.8
Mars 3.7
Jupyter 23.1
Saturn 9.0
Uranus 8.7
Neptune 11.0
Pluto 0.7
Name: gravity, dtype: float64
Specifying a list of column names gives a DataFrame consisting of the selected columns:
[25]:
planets[["gravity", "diameter"]]
[25]:
gravity | diameter | |
---|---|---|
Mercury | 3.7 | 4879 |
Venus | 8.9 | 12104 |
Earth | 9.8 | 12756 |
Mars | 3.7 | 6792 |
Jupyter | 23.1 | 142984 |
Saturn | 9.0 | 120536 |
Uranus | 8.7 | 51118 |
Neptune | 11.0 | 49528 |
Pluto | 0.7 | 2370 |
Slicing selects rows of a DataFrame:
[26]:
planets["Earth":"Saturn"]
[26]:
diameter | avg_temp | gravity | |
---|---|---|---|
Earth | 12756 | 15 | 9.8 |
Mars | 6792 | -65 | 3.7 |
Jupyter | 142984 | -110 | 23.1 |
Saturn | 120536 | -140 | 9.0 |
Selections with loc
and iloc
¶
To get more flexibility with DataFrame selections we can use the loc
indexer. It lets us select rows and columns using row and column labels, lists of labels, and slices:
[27]:
# select an entry in a given row and column
planets.loc["Earth", "gravity"]
[27]:
9.8
[28]:
# select one row and a list of columns
planets.loc["Earth", ["diameter", "gravity"]]
[28]:
diameter 12756.0
gravity 9.8
Name: Earth, dtype: float64
[29]:
# select a slice of rows and a list of columns
planets.loc["Earth":"Saturn", ["gravity", "avg_temp"]]
[29]:
gravity | avg_temp | |
---|---|---|
Earth | 9.8 | 15 |
Mars | 3.7 | -65 |
Jupyter | 23.1 | -110 |
Saturn | 9.0 | -140 |
The iloc
indexer works similarly, but it uses row and column numbers instead of their labels:
[30]:
# select the first 3 rows
planets.iloc[:3]
[30]:
diameter | avg_temp | gravity | |
---|---|---|---|
Mercury | 4879 | 167 | 3.7 |
Venus | 12104 | 464 | 8.9 |
Earth | 12756 | 15 | 9.8 |
[31]:
# select all rows and the first 2 columns
planets.iloc[: , :2]
[31]:
diameter | avg_temp | |
---|---|---|
Mercury | 4879 | 167 |
Venus | 12104 | 464 |
Earth | 12756 | 15 |
Mars | 6792 | -65 |
Jupyter | 142984 | -110 |
Saturn | 120536 | -140 |
Uranus | 51118 | -195 |
Neptune | 49528 | -200 |
Pluto | 2370 | -225 |
[32]:
# select using a list of row numbers and a slice of column numbers
planets.iloc[[0, 4, 5], 1:2]
[32]:
avg_temp | |
---|---|
Mercury | 167 |
Jupyter | -110 |
Saturn | -140 |
loc
snd iloc
work with pandas Series too:
[33]:
diameters
[33]:
Mercury 4879
Venus 12104
Earth 12756
Mars 6792
Jupyter 142984
Saturn 120536
Uranus 51118
Neptune 49528
Pluto 2370
dtype: int64
[34]:
# loc with pandas Series
diameters.loc["Jupyter":"Pluto"]
[34]:
Jupyter 142984
Saturn 120536
Uranus 51118
Neptune 49528
Pluto 2370
dtype: int64
[35]:
# iloc with pandas Series
diameters.iloc[[0, 1, 5]]
[35]:
Mercury 4879
Venus 12104
Saturn 120536
dtype: int64
Boolean indexing¶
Similarly as with numpy arrays, we can use Boolean indexing to select parts of pandas Series and DataFrames.
[36]:
planets
[36]:
diameter | avg_temp | gravity | |
---|---|---|---|
Mercury | 4879 | 167 | 3.7 |
Venus | 12104 | 464 | 8.9 |
Earth | 12756 | 15 | 9.8 |
Mars | 6792 | -65 | 3.7 |
Jupyter | 142984 | -110 | 23.1 |
Saturn | 120536 | -140 | 9.0 |
Uranus | 51118 | -195 | 8.7 |
Neptune | 49528 | -200 | 11.0 |
Pluto | 2370 | -225 | 0.7 |
[37]:
# check which planets have diameter greater than 100,000 km
planets["diameter"] > 100000
[37]:
Mercury False
Venus False
Earth False
Mars False
Jupyter True
Saturn True
Uranus False
Neptune False
Pluto False
Name: diameter, dtype: bool
[38]:
# select all planets with diameter greater than 100,000 km
planets[planets["diameter"] > 100000]
[38]:
diameter | avg_temp | gravity | |
---|---|---|---|
Jupyter | 142984 | -110 | 23.1 |
Saturn | 120536 | -140 | 9.0 |
Notice that if the Boolean array used for selection is one-dimensional, then is selects rows of the DataFrame. Other options (using a two-dimensional Boolean array, using a one-dimensional array to select columns) are also possible.
Logical conditions can be combined the same way as for numpy arrays:
[39]:
# select planets based on their average temperatue and gravity
planets[(planets["avg_temp"] > 0) & (planets["gravity"] > 5) ]
[39]:
diameter | avg_temp | gravity | |
---|---|---|---|
Venus | 12104 | 464 | 8.9 |
Earth | 12756 | 15 | 9.8 |
Sorting¶
Values of pandas Series can be sorted using the sort_values
method:
[40]:
diameters
[40]:
Mercury 4879
Venus 12104
Earth 12756
Mars 6792
Jupyter 142984
Saturn 120536
Uranus 51118
Neptune 49528
Pluto 2370
dtype: int64
[41]:
# sort values in ascending order
diameters.sort_values()
[41]:
Pluto 2370
Mercury 4879
Mars 6792
Venus 12104
Earth 12756
Neptune 49528
Uranus 51118
Saturn 120536
Jupyter 142984
dtype: int64
By default values are sorted in the ascending order, but we can change it:
[42]:
# sort values in descending order
diameters.sort_values(ascending=False)
[42]:
Jupyter 142984
Saturn 120536
Uranus 51118
Neptune 49528
Earth 12756
Venus 12104
Mars 6792
Mercury 4879
Pluto 2370
dtype: int64
sort_values
works with DataFrames as well, but in this case we need to specify which column (or row) should be used for sorting:
[43]:
# sort using the "diameter" column
planets.sort_values(by="diameter")
[43]:
diameter | avg_temp | gravity | |
---|---|---|---|
Pluto | 2370 | -225 | 0.7 |
Mercury | 4879 | 167 | 3.7 |
Mars | 6792 | -65 | 3.7 |
Venus | 12104 | 464 | 8.9 |
Earth | 12756 | 15 | 9.8 |
Neptune | 49528 | -200 | 11.0 |
Uranus | 51118 | -195 | 8.7 |
Saturn | 120536 | -140 | 9.0 |
Jupyter | 142984 | -110 | 23.1 |
[44]:
# sort using the "gravity" column in descending order
planets.sort_values(by="gravity", ascending=False)
[44]:
diameter | avg_temp | gravity | |
---|---|---|---|
Jupyter | 142984 | -110 | 23.1 |
Neptune | 49528 | -200 | 11.0 |
Earth | 12756 | 15 | 9.8 |
Saturn | 120536 | -140 | 9.0 |
Venus | 12104 | 464 | 8.9 |
Uranus | 51118 | -195 | 8.7 |
Mercury | 4879 | 167 | 3.7 |
Mars | 6792 | -65 | 3.7 |
Pluto | 2370 | -225 | 0.7 |
The sort_values
method accepts axis
argument. The default value is axis=0
which sorts columns. By changing it to axis=1
we can sort rows:
[45]:
# sort values in the "Mercury" row
planets.sort_values(by="Mercury", axis=1)
[45]:
gravity | avg_temp | diameter | |
---|---|---|---|
Mercury | 3.7 | 167 | 4879 |
Venus | 8.9 | 464 | 12104 |
Earth | 9.8 | 15 | 12756 |
Mars | 3.7 | -65 | 6792 |
Jupyter | 23.1 | -110 | 142984 |
Saturn | 9.0 | -140 | 120536 |
Uranus | 8.7 | -195 | 51118 |
Neptune | 11.0 | -200 | 49528 |
Pluto | 0.7 | -225 | 2370 |
The mathod sort_index
allows us to sort a DataFrame according to the index of its rows or columns:
[46]:
# sort row index
planets.sort_index()
[46]:
diameter | avg_temp | gravity | |
---|---|---|---|
Earth | 12756 | 15 | 9.8 |
Jupyter | 142984 | -110 | 23.1 |
Mars | 6792 | -65 | 3.7 |
Mercury | 4879 | 167 | 3.7 |
Neptune | 49528 | -200 | 11.0 |
Pluto | 2370 | -225 | 0.7 |
Saturn | 120536 | -140 | 9.0 |
Uranus | 51118 | -195 | 8.7 |
Venus | 12104 | 464 | 8.9 |
[47]:
# sort column index
planets.sort_index(axis=1)
[47]:
avg_temp | diameter | gravity | |
---|---|---|---|
Mercury | 167 | 4879 | 3.7 |
Venus | 464 | 12104 | 8.9 |
Earth | 15 | 12756 | 9.8 |
Mars | -65 | 6792 | 3.7 |
Jupyter | -110 | 142984 | 23.1 |
Saturn | -140 | 120536 | 9.0 |
Uranus | -195 | 51118 | 8.7 |
Neptune | -200 | 49528 | 11.0 |
Pluto | -225 | 2370 | 0.7 |
Modifying DataFrames and Series¶
Adding rows and columns¶
We can add a column to a DataFrame by specifying its name and values:
[48]:
planets["moons"] = [0, 0, 1, 2, 79, 82, 27, 14, 5]
planets
[48]:
diameter | avg_temp | gravity | moons | |
---|---|---|---|---|
Mercury | 4879 | 167 | 3.7 | 0 |
Venus | 12104 | 464 | 8.9 | 0 |
Earth | 12756 | 15 | 9.8 | 1 |
Mars | 6792 | -65 | 3.7 | 2 |
Jupyter | 142984 | -110 | 23.1 | 79 |
Saturn | 120536 | -140 | 9.0 | 82 |
Uranus | 51118 | -195 | 8.7 | 27 |
Neptune | 49528 | -200 | 11.0 | 14 |
Pluto | 2370 | -225 | 0.7 | 5 |
Rows can be added in a similar way:
[49]:
planets.loc["Moon"] = [3475, -20, 1.6, 0]
planets
[49]:
diameter | avg_temp | gravity | moons | |
---|---|---|---|---|
Mercury | 4879.0 | 167.0 | 3.7 | 0.0 |
Venus | 12104.0 | 464.0 | 8.9 | 0.0 |
Earth | 12756.0 | 15.0 | 9.8 | 1.0 |
Mars | 6792.0 | -65.0 | 3.7 | 2.0 |
Jupyter | 142984.0 | -110.0 | 23.1 | 79.0 |
Saturn | 120536.0 | -140.0 | 9.0 | 82.0 |
Uranus | 51118.0 | -195.0 | 8.7 | 27.0 |
Neptune | 49528.0 | -200.0 | 11.0 | 14.0 |
Pluto | 2370.0 | -225.0 | 0.7 | 5.0 |
Moon | 3475.0 | -20.0 | 1.6 | 0.0 |
New columns can be created based on the values of existing columns. For example, in the planets
DataFrame the column avg_temp
gives temperatures in degrees Celsius. We can add a new column with temperatures in degrees Fahrenheit:
[50]:
planets["avg_temp_F"] = planets["avg_temp"]*1.8 + 32
planets
[50]:
diameter | avg_temp | gravity | moons | avg_temp_F | |
---|---|---|---|---|---|
Mercury | 4879.0 | 167.0 | 3.7 | 0.0 | 332.6 |
Venus | 12104.0 | 464.0 | 8.9 | 0.0 | 867.2 |
Earth | 12756.0 | 15.0 | 9.8 | 1.0 | 59.0 |
Mars | 6792.0 | -65.0 | 3.7 | 2.0 | -85.0 |
Jupyter | 142984.0 | -110.0 | 23.1 | 79.0 | -166.0 |
Saturn | 120536.0 | -140.0 | 9.0 | 82.0 | -220.0 |
Uranus | 51118.0 | -195.0 | 8.7 | 27.0 | -319.0 |
Neptune | 49528.0 | -200.0 | 11.0 | 14.0 | -328.0 |
Pluto | 2370.0 | -225.0 | 0.7 | 5.0 | -373.0 |
Moon | 3475.0 | -20.0 | 1.6 | 0.0 | -4.0 |
By default a new column is added as the last column of a DataFrame. However, we can rearrange columns, by selecting them in any given order:
[51]:
# make a copy of the DataFrame with rearranged columns
planets = planets[['diameter', 'avg_temp', 'avg_temp_F', 'gravity', 'moons', ]].copy()
planets
[51]:
diameter | avg_temp | avg_temp_F | gravity | moons | |
---|---|---|---|---|---|
Mercury | 4879.0 | 167.0 | 332.6 | 3.7 | 0.0 |
Venus | 12104.0 | 464.0 | 867.2 | 8.9 | 0.0 |
Earth | 12756.0 | 15.0 | 59.0 | 9.8 | 1.0 |
Mars | 6792.0 | -65.0 | -85.0 | 3.7 | 2.0 |
Jupyter | 142984.0 | -110.0 | -166.0 | 23.1 | 79.0 |
Saturn | 120536.0 | -140.0 | -220.0 | 9.0 | 82.0 |
Uranus | 51118.0 | -195.0 | -319.0 | 8.7 | 27.0 |
Neptune | 49528.0 | -200.0 | -328.0 | 11.0 | 14.0 |
Pluto | 2370.0 | -225.0 | -373.0 | 0.7 | 5.0 |
Moon | 3475.0 | -20.0 | -4.0 | 1.6 | 0.0 |
Using numpy functions¶
Numpy functions can be be applied directly to pandas Series:
[52]:
diameters
[52]:
Mercury 4879
Venus 12104
Earth 12756
Mars 6792
Jupyter 142984
Saturn 120536
Uranus 51118
Neptune 49528
Pluto 2370
dtype: int64
[53]:
import numpy as np
# take base 10 logarithm of each element of the Series
np.log10(diameters)
[53]:
Mercury 3.688331
Venus 4.082929
Earth 4.105715
Mars 3.831998
Jupyter 5.155287
Saturn 5.081117
Uranus 4.708574
Neptune 4.694851
Pluto 3.374748
dtype: float64
Since each row and each column of a DataFrame is a Series, we can use numpy functions to modify DataFrame rows and columns, or to create new rows and columns:
[54]:
# create a new column with logarithms of diameters of planets
planets["log10_diameter"] = np.log10(planets["diameter"])
planets
[54]:
diameter | avg_temp | avg_temp_F | gravity | moons | log10_diameter | |
---|---|---|---|---|---|---|
Mercury | 4879.0 | 167.0 | 332.6 | 3.7 | 0.0 | 3.688331 |
Venus | 12104.0 | 464.0 | 867.2 | 8.9 | 0.0 | 4.082929 |
Earth | 12756.0 | 15.0 | 59.0 | 9.8 | 1.0 | 4.105715 |
Mars | 6792.0 | -65.0 | -85.0 | 3.7 | 2.0 | 3.831998 |
Jupyter | 142984.0 | -110.0 | -166.0 | 23.1 | 79.0 | 5.155287 |
Saturn | 120536.0 | -140.0 | -220.0 | 9.0 | 82.0 | 5.081117 |
Uranus | 51118.0 | -195.0 | -319.0 | 8.7 | 27.0 | 4.708574 |
Neptune | 49528.0 | -200.0 | -328.0 | 11.0 | 14.0 | 4.694851 |
Pluto | 2370.0 | -225.0 | -373.0 | 0.7 | 5.0 | 3.374748 |
Moon | 3475.0 | -20.0 | -4.0 | 1.6 | 0.0 | 3.540955 |
Series.map()¶
More complex computations can be accomplished using the map
method of pandas Series. This method takes as its argument a function, applies this function to each value of the Series, and produces a new Series consisting of return values of the function:
[55]:
def get_size(d):
if d > 10**5:
return "big"
elif d > 10**4:
return "medium"
else:
return "small"
diameters.map(get_size)
[55]:
Mercury small
Venus medium
Earth medium
Mars small
Jupyter big
Saturn big
Uranus medium
Neptune medium
Pluto small
dtype: object
Again, we can use this with any column or row of a DataFrame:
[56]:
# add new column with values of the size function applied
# to the diameter column
planets["planet_size"] = planets["diameter"].map(get_size)
planets
[56]:
diameter | avg_temp | avg_temp_F | gravity | moons | log10_diameter | planet_size | |
---|---|---|---|---|---|---|---|
Mercury | 4879.0 | 167.0 | 332.6 | 3.7 | 0.0 | 3.688331 | small |
Venus | 12104.0 | 464.0 | 867.2 | 8.9 | 0.0 | 4.082929 | medium |
Earth | 12756.0 | 15.0 | 59.0 | 9.8 | 1.0 | 4.105715 | medium |
Mars | 6792.0 | -65.0 | -85.0 | 3.7 | 2.0 | 3.831998 | small |
Jupyter | 142984.0 | -110.0 | -166.0 | 23.1 | 79.0 | 5.155287 | big |
Saturn | 120536.0 | -140.0 | -220.0 | 9.0 | 82.0 | 5.081117 | big |
Uranus | 51118.0 | -195.0 | -319.0 | 8.7 | 27.0 | 4.708574 | medium |
Neptune | 49528.0 | -200.0 | -328.0 | 11.0 | 14.0 | 4.694851 | medium |
Pluto | 2370.0 | -225.0 | -373.0 | 0.7 | 5.0 | 3.374748 | small |
Moon | 3475.0 | -20.0 | -4.0 | 1.6 | 0.0 | 3.540955 | small |
Aggregations¶
Similarly as numpy arrays, pandas Series and DataFrames have several methods that let us summarize their data. We will use the following DataFrame with exam scores of some students to illustrate this:
[46]:
scores_arr = np.random.randint(0,100, (4,3))
# create a dataframe
scores = pd.DataFrame(scores_arr, # data
index=["John", "Alice", "Bob", "Emily"], # row names
columns=["exam_1", "exam_2", "exam_3"]) # column names
scores
[46]:
exam_1 | exam_2 | exam_3 | |
---|---|---|---|
John | 9 | 69 | 10 |
Alice | 59 | 53 | 99 |
Bob | 15 | 48 | 16 |
Emily | 59 | 22 | 27 |
[47]:
# compute maximum of each column
scores.max()
[47]:
exam_1 59
exam_2 69
exam_3 99
dtype: int64
[48]:
# compute minimum of each column
scores.min()
[48]:
exam_1 9
exam_2 22
exam_3 10
dtype: int64
[49]:
# compute average value of each column
scores.mean()
[49]:
exam_1 35.5
exam_2 48.0
exam_3 38.0
dtype: float64
By default, aggregations are computed for each column, but we can change it using the axis
argument.
[50]:
# compute sum of each row
scores.sum(axis=1)
[50]:
John 88
Alice 211
Bob 79
Emily 108
dtype: int64
DataFrame.agg()¶
To aggregate data in a custom way we can use the DataFrame agg
method. This method takes as an argument a function, passes each column (or row) of the DataFrame to this function, and returns a Series (or DataFrame) consisting of return values.
For example, lets say that we want to assign grades to students based on their total exam scores: “Pass” for a score above 200, and “Fail” otherwise. We can do it as follows:
[51]:
def assign_grades(scores):
total = scores.sum()
if total > 200:
return "Pass"
else:
return "Fail"
scores.agg(assign_grades, axis=1)
[51]:
John Fail
Alice Pass
Bob Fail
Emily Fail
dtype: object
The agg
method has several additional options. For example, we can use it to specify a different aggregation function for each column.
[52]:
# compute minumum of exam_1, maximum of exam_2 and mean of exam_3
scores.agg({"exam_1": "min", "exam_2" : "max", "exam_3" : "mean"})
[52]:
exam_1 9.0
exam_2 69.0
exam_3 38.0
dtype: float64
Notice that the standard aggregation functions (mean
, min
, max
, sum
etc.) can be specified by giving their names as strings.
We can also use agg
with a list of functions to compute several aggregations at once:
[53]:
# compute mean, sum, and the value of assign_grades for each row
scores.agg(["mean", "sum", assign_grades], axis=1)
[53]:
mean | sum | assign_grades | |
---|---|---|---|
John | 29.3333 | 88 | Fail |
Alice | 70.3333 | 211 | Pass |
Bob | 26.3333 | 79 | Fail |
Emily | 36 | 108 | Fail |