GroupBy¶
A common pattern in data analysis is to split data into groups based on some criteria (geographical area, age, type of business, etc.), make some computations for each group, and then assemble the results in order to compare them. We will show how to perform such tasks with pandas using the DataFrame groupby()
method.
As an example we will use “tips” data set provided by the seaborn module. It contains data on restaurant visits: the total bill amount, tip amount, sex of the person paying the bill, whether the visiting group included smokers, day and time of the visit, and the size of the group:
[1]:
import pandas as pd
import seaborn as sns
tips = sns.load_dataset("tips")
tips.head()
[1]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
The GroupBy object¶
The unique()
method of pandas Series returns values of a Series, without repetitions. Using it we can check that the “sex” column has only two unique values: “Female” and “Male”:
[2]:
tips["sex"].unique()
[2]:
[Female, Male]
Categories (2, object): [Female, Male]
Lets say that we want to group rows the DataFrame into parts, depending on the value of the “sex” column. This can be done using groupby()
with the column name as an argument:
[3]:
grouped = tips.groupby("sex")
grouped
[3]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a26d11690>
grouby()
produces a GroupBy object which provides tools for manipulating grouped data. For example, we can use it to get a DataFrame with all rows which have “Female” in the “sex” column:
[4]:
grouped.get_group("Female")
[4]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
11 | 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
14 | 14.83 | 3.02 | Female | No | Sun | Dinner | 2 |
16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
226 | 10.09 | 2.00 | Female | Yes | Fri | Lunch | 2 |
229 | 22.12 | 2.88 | Female | Yes | Sat | Dinner | 2 |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
87 rows × 7 columns
Iteration over the grouped
object produces tuples consisting of a value of the “sex” column and a DataFrame with all rows containing this value:
[5]:
for k, df in grouped:
print(f"\nvalue: {k}\n")
display(df.head(3))
value: Male
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
value: Female
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
11 | 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
Simple aggregations¶
The usual way of using GroupBy objects it to make some computations for each group, and then combine the results:
[6]:
# get the sum of rows in each group
grouped.sum()
[6]:
total_bill | tip | size | |
---|---|---|---|
sex | |||
Male | 3256.82 | 485.07 | 413 |
Female | 1570.95 | 246.51 | 214 |
[106]:
# get the average value of rows in each group
grouped.mean()
[106]:
total_bill | tip | size | |
---|---|---|---|
sex | |||
Male | 20.744076 | 3.089618 | 2.630573 |
Female | 18.056897 | 2.833448 | 2.459770 |
Notice that in the above examples columns “smoker”, “day”, and “time” were skipped. These columns contain non-numeric data and the sum()
and mean()
operations are not defined for them.
If we want to aggregate data only in some columns, we can select them before making computations:
[8]:
# get the maximum value of the "total_bill" column in each group
grouped["total_bill"].max()
[8]:
sex
Male 50.81
Female 44.30
Name: total_bill, dtype: float64
[9]:
# get the minimum value of the "total_bill" and "tip" columns in each group
grouped[["total_bill", "tip"]].min()
[9]:
total_bill | tip | |
---|---|---|
sex | ||
Male | 7.25 | 1.0 |
Female | 3.07 | 1.0 |
GroupBy.agg¶
To aggregate data in a custom way, we can use the agg()
method. This method takes as its argument a function. This function is applied to every column of each group to compute aggregations.
For example, let say that we want to calculate how many males and females left tips above $5.00. This could be done as follows:
[26]:
def big_tip(tip_values, min_val=5):
n = 0
for t in tip_values:
if t > min_val:
n += 1
return n
grouped["tip"].agg(big_tip)
[26]:
sex
Male 14.0
Female 4.0
Name: tip, dtype: float64
In the above example we can avoid custom aggregation, by selecting rows with large tips before grouping:
[107]:
# selects rows with tips above $5.00
big_tippers = tips[tips["tip"] > 5]
# group by "sex"
grouped_bt = big_tippers.groupby("sex")
# count the number of values in the "tip" column for each group
grouped_bt["tip"].count()
[107]:
sex
Male 14
Female 4
Name: tip, dtype: int64
GroupBy.transform¶
transform()
is another method of GroupBy objects. It also takes as its argument a function and applies this function to columns of each group. In this case though, the function must return either a Series of the same length as the original column or a scalar value.
In the example below, we use transform to express values of the “total_bill” and “tip” columns as fractions of mean values of these columns in each group.
[65]:
def mean_fraction(col):
# if the name of a column is "total_bill" or "tips"
# divide its values by the mean of the column
if col.name in ["total_bill", "tip"]:
return col / col.mean()
# leave all other columns unchanged
else:
return col
grouped.transform(mean_fraction).head()
[65]:
total_bill | tip | smoker | day | time | size | |
---|---|---|---|---|---|---|
0 | 0.940915 | 0.356456 | No | Sun | Dinner | 2 |
1 | 0.498456 | 0.537283 | No | Sun | Dinner | 3 |
2 | 1.012819 | 1.132826 | No | Sun | Dinner | 3 |
3 | 1.141531 | 1.071330 | No | Sun | Dinner | 2 |
4 | 1.361807 | 1.274066 | No | Sun | Dinner | 4 |
If the function passed to transform()
returns a single value for a column then all values of that column belonging to a given group will be replaced by the return value of the function.
Below we use it to replace values of the “tip” column by group averages:
[61]:
def tips_mean(col):
# if name of a column is "tip"
# return the mean value of the column
if col.name == "tip":
return col.mean()
# leave all other columns unchanged
else:
return col
df = grouped.transform(tips_mean)
df.head()
[61]:
total_bill | tip | smoker | day | time | size | |
---|---|---|---|---|---|---|
0 | 16.99 | 2.833448 | No | Sun | Dinner | 2 |
1 | 10.34 | 3.089618 | No | Sun | Dinner | 3 |
2 | 21.01 | 3.089618 | No | Sun | Dinner | 3 |
3 | 23.68 | 3.089618 | No | Sun | Dinner | 2 |
4 | 24.59 | 2.833448 | No | Sun | Dinner | 4 |
The transformed DataFrame does not include the “sex” column, since this column was used for grouping. However, since rows are in the same order as in the original tips
DataFrame, we can easily add this column:
[62]:
df["sex"] = tips["sex"]
df.head()
[62]:
total_bill | tip | smoker | day | time | size | sex | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 2.833448 | No | Sun | Dinner | 2 | Female |
1 | 10.34 | 3.089618 | No | Sun | Dinner | 3 | Male |
2 | 21.01 | 3.089618 | No | Sun | Dinner | 3 | Male |
3 | 23.68 | 3.089618 | No | Sun | Dinner | 2 | Male |
4 | 24.59 | 2.833448 | No | Sun | Dinner | 4 | Female |
GroupBy.apply¶
The apply()
method is more flexible than agg()
and transform()
but usually also slower. It takes as an argument a function. This function is applied to the whole DataFrame for each group. The return values are then assembled together.
For example, lets say that for each “Female”/“Male” group we want to find 3 rows of data with the largest tip amounts. We can do it as follows:
[111]:
def large_tips(df):
# sort the DataFrame by "tip" column in descending order
df = df.sort_values(by="tip", ascending=False)
# return the first 3 rows of the sorted DaraFrame
return df.head(3)
grouped.apply(large_tips)
[111]:
total_bill | tip | smoker | day | time | size | ||
---|---|---|---|---|---|---|---|
sex | |||||||
Male | 170 | 50.81 | 10.00 | Yes | Sat | Dinner | 3 |
212 | 48.33 | 9.00 | No | Sat | Dinner | 4 | |
23 | 39.42 | 7.58 | No | Sat | Dinner | 4 | |
Female | 214 | 28.17 | 6.50 | Yes | Sat | Dinner | 3 |
52 | 34.81 | 5.20 | No | Sun | Dinner | 4 | |
85 | 34.83 | 5.17 | No | Thur | Lunch | 4 |
The same result can obtained a bit simpler using DataFrame nlargest()
method:
[114]:
def large_tips(df):
# sort DataFrame columns by values of the "tip" column
# in descending order and return the first 3 rows
return df.nlargest(n=3, columns="tip")
grouped.apply(large_tips)
[114]:
total_bill | tip | smoker | day | time | size | ||
---|---|---|---|---|---|---|---|
sex | |||||||
Male | 170 | 50.81 | 10.00 | Yes | Sat | Dinner | 3 |
212 | 48.33 | 9.00 | No | Sat | Dinner | 4 | |
23 | 39.42 | 7.58 | No | Sat | Dinner | 4 | |
Female | 214 | 28.17 | 6.50 | Yes | Sat | Dinner | 3 |
52 | 34.81 | 5.20 | No | Sun | Dinner | 4 | |
85 | 34.83 | 5.17 | No | Thur | Lunch | 4 |
groupby options¶
In the examples above we grouped DataFrame rows based on values of a single column. Grouping can be also performed based on values of several columns. For example, we can split the tips
DataFrame into groups using values of both “day” and “time” columns, and then compute maximal values of columns for each group:
[103]:
grouped2 = tips.groupby(["day", "time"])
grouped2.max()
[103]:
total_bill | tip | size | ||
---|---|---|---|---|
day | time | |||
Thur | Lunch | 43.11 | 6.70 | 6.0 |
Dinner | 18.78 | 3.00 | 2.0 | |
Fri | Lunch | 16.27 | 3.48 | 3.0 |
Dinner | 40.17 | 4.73 | 4.0 | |
Sat | Lunch | NaN | NaN | NaN |
Dinner | 50.81 | 10.00 | 5.0 | |
Sun | Lunch | NaN | NaN | NaN |
Dinner | 48.17 | 6.50 | 6.0 |
The argument of groupby()
need not be a column of the DataFrame. Instead, we can use an array of length equal to the number of rows of the DataFrame. Grouping will be performed based on the values of the array. Below we use it to split the tips
DataFrame into two groups, depending on whether the value of the “size” columns is below 3 or not:
[101]:
# create a numpy array with values "small" and "large"
# depending on the value of the "size" column
group_size = np.select([tips["size"] < 3, tips["size"] >= 3], ["small", "large"])
group_size[:5]
[101]:
array(['small', 'large', 'large', 'small', 'large'], dtype='<U21')
[112]:
# group tips DataFrame based on values of the group_size array
# and compute average values for each group
tips.groupby(large_group).mean()
[112]:
total_bill | tip | size | |
---|---|---|---|
large | 26.582262 | 3.845119 | 3.702381 |
small | 16.217875 | 2.553688 | 1.975000 |
Similar result as above can be obtained by converting the “size” column into an index and then passing to the groupby()
method a function which converts index values to “large” and “small” values. Groups are created based on return values of the function. In this case the aggregated DataFrame does not have “size” column, since this columns is used to create groups:
[113]:
def f(x):
if x < 3:
return "small"
else:
return "large"
tips.set_index("size").groupby(f).mean()
[113]:
total_bill | tip | |
---|---|---|
large | 26.582262 | 3.845119 |
small | 16.217875 | 2.553688 |