GroupBy

A common pattern in data analysis is to split data into groups based on some criteria (geographical area, age, type of business, etc.), make some computations for each group, and then assemble the results in order to compare them. We will show how to perform such tasks with pandas using the DataFrame groupby() method.

As an example we will use “tips” data set provided by the seaborn module. It contains data on restaurant visits: the total bill amount, tip amount, sex of the person paying the bill, whether the visiting group included smokers, day and time of the visit, and the size of the group:

[1]:
import pandas as pd
import seaborn as sns

tips = sns.load_dataset("tips")
tips.head()
[1]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

The GroupBy object

The unique() method of pandas Series returns values of a Series, without repetitions. Using it we can check that the “sex” column has only two unique values: “Female” and “Male”:

[2]:
tips["sex"].unique()
[2]:
[Female, Male]
Categories (2, object): [Female, Male]

Lets say that we want to group rows the DataFrame into parts, depending on the value of the “sex” column. This can be done using groupby() with the column name as an argument:

[3]:
grouped = tips.groupby("sex")
grouped
[3]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a26d11690>

grouby() produces a GroupBy object which provides tools for manipulating grouped data. For example, we can use it to get a DataFrame with all rows which have “Female” in the “sex” column:

[4]:
grouped.get_group("Female")
[4]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
11 35.26 5.00 Female No Sun Dinner 4
14 14.83 3.02 Female No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3
... ... ... ... ... ... ... ...
226 10.09 2.00 Female Yes Fri Lunch 2
229 22.12 2.88 Female Yes Sat Dinner 2
238 35.83 4.67 Female No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

87 rows × 7 columns

Iteration over the grouped object produces tuples consisting of a value of the “sex” column and a DataFrame with all rows containing this value:

[5]:
for k, df in grouped:
    print(f"\nvalue: {k}\n")
    display(df.head(3))

value: Male

total_bill tip sex smoker day time size
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2

value: Female

total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
11 35.26 5.00 Female No Sun Dinner 4

Simple aggregations

The usual way of using GroupBy objects it to make some computations for each group, and then combine the results:

[6]:
# get the sum of rows in each group
grouped.sum()
[6]:
total_bill tip size
sex
Male 3256.82 485.07 413
Female 1570.95 246.51 214
[106]:
# get the average value of rows in each group
grouped.mean()
[106]:
total_bill tip size
sex
Male 20.744076 3.089618 2.630573
Female 18.056897 2.833448 2.459770

Notice that in the above examples columns “smoker”, “day”, and “time” were skipped. These columns contain non-numeric data and the sum() and mean() operations are not defined for them.

If we want to aggregate data only in some columns, we can select them before making computations:

[8]:
# get the maximum value of the "total_bill" column in each group
grouped["total_bill"].max()
[8]:
sex
Male      50.81
Female    44.30
Name: total_bill, dtype: float64
[9]:
# get the minimum value of the "total_bill" and "tip" columns in each group
grouped[["total_bill", "tip"]].min()
[9]:
total_bill tip
sex
Male 7.25 1.0
Female 3.07 1.0

GroupBy.agg

To aggregate data in a custom way, we can use the agg() method. This method takes as its argument a function. This function is applied to every column of each group to compute aggregations.

For example, let say that we want to calculate how many males and females left tips above $5.00. This could be done as follows:

[26]:
def big_tip(tip_values, min_val=5):

    n = 0
    for t in tip_values:
        if t > min_val:
            n += 1
    return n


grouped["tip"].agg(big_tip)
[26]:
sex
Male      14.0
Female     4.0
Name: tip, dtype: float64

In the above example we can avoid custom aggregation, by selecting rows with large tips before grouping:

[107]:
# selects rows with tips above $5.00
big_tippers = tips[tips["tip"] > 5]

# group by "sex"
grouped_bt = big_tippers.groupby("sex")

# count the number of values in the "tip" column for each group
grouped_bt["tip"].count()
[107]:
sex
Male      14
Female     4
Name: tip, dtype: int64

GroupBy.transform

transform() is another method of GroupBy objects. It also takes as its argument a function and applies this function to columns of each group. In this case though, the function must return either a Series of the same length as the original column or a scalar value.

In the example below, we use transform to express values of the “total_bill” and “tip” columns as fractions of mean values of these columns in each group.

[65]:
def mean_fraction(col):

    # if the name of a column is "total_bill" or "tips"
    # divide its values by the mean of the column
    if col.name in ["total_bill", "tip"]:
        return col / col.mean()
    # leave all other columns unchanged
    else:
        return col


grouped.transform(mean_fraction).head()
[65]:
total_bill tip smoker day time size
0 0.940915 0.356456 No Sun Dinner 2
1 0.498456 0.537283 No Sun Dinner 3
2 1.012819 1.132826 No Sun Dinner 3
3 1.141531 1.071330 No Sun Dinner 2
4 1.361807 1.274066 No Sun Dinner 4

If the function passed to transform() returns a single value for a column then all values of that column belonging to a given group will be replaced by the return value of the function.

Below we use it to replace values of the “tip” column by group averages:

[61]:
def tips_mean(col):

    # if name of a column is "tip"
    # return the mean value of the column
    if col.name == "tip":
        return col.mean()
    # leave all other columns unchanged
    else:
        return col


df = grouped.transform(tips_mean)
df.head()
[61]:
total_bill tip smoker day time size
0 16.99 2.833448 No Sun Dinner 2
1 10.34 3.089618 No Sun Dinner 3
2 21.01 3.089618 No Sun Dinner 3
3 23.68 3.089618 No Sun Dinner 2
4 24.59 2.833448 No Sun Dinner 4

The transformed DataFrame does not include the “sex” column, since this column was used for grouping. However, since rows are in the same order as in the original tips DataFrame, we can easily add this column:

[62]:
df["sex"] = tips["sex"]
df.head()
[62]:
total_bill tip smoker day time size sex
0 16.99 2.833448 No Sun Dinner 2 Female
1 10.34 3.089618 No Sun Dinner 3 Male
2 21.01 3.089618 No Sun Dinner 3 Male
3 23.68 3.089618 No Sun Dinner 2 Male
4 24.59 2.833448 No Sun Dinner 4 Female

GroupBy.apply

The apply() method is more flexible than agg() and transform() but usually also slower. It takes as an argument a function. This function is applied to the whole DataFrame for each group. The return values are then assembled together.

For example, lets say that for each “Female”/“Male” group we want to find 3 rows of data with the largest tip amounts. We can do it as follows:

[111]:
def large_tips(df):

    # sort the DataFrame by "tip" column in descending order
    df = df.sort_values(by="tip", ascending=False)
    # return the first 3 rows of the sorted DaraFrame
    return df.head(3)


grouped.apply(large_tips)
[111]:
total_bill tip smoker day time size
sex
Male 170 50.81 10.00 Yes Sat Dinner 3
212 48.33 9.00 No Sat Dinner 4
23 39.42 7.58 No Sat Dinner 4
Female 214 28.17 6.50 Yes Sat Dinner 3
52 34.81 5.20 No Sun Dinner 4
85 34.83 5.17 No Thur Lunch 4

The same result can obtained a bit simpler using DataFrame nlargest() method:

[114]:
def large_tips(df):

    # sort DataFrame columns by values of the "tip" column
    # in descending order and return the first 3 rows
    return df.nlargest(n=3, columns="tip")


grouped.apply(large_tips)
[114]:
total_bill tip smoker day time size
sex
Male 170 50.81 10.00 Yes Sat Dinner 3
212 48.33 9.00 No Sat Dinner 4
23 39.42 7.58 No Sat Dinner 4
Female 214 28.17 6.50 Yes Sat Dinner 3
52 34.81 5.20 No Sun Dinner 4
85 34.83 5.17 No Thur Lunch 4

groupby options

In the examples above we grouped DataFrame rows based on values of a single column. Grouping can be also performed based on values of several columns. For example, we can split the tips DataFrame into groups using values of both “day” and “time” columns, and then compute maximal values of columns for each group:

[103]:
grouped2 = tips.groupby(["day", "time"])
grouped2.max()
[103]:
total_bill tip size
day time
Thur Lunch 43.11 6.70 6.0
Dinner 18.78 3.00 2.0
Fri Lunch 16.27 3.48 3.0
Dinner 40.17 4.73 4.0
Sat Lunch NaN NaN NaN
Dinner 50.81 10.00 5.0
Sun Lunch NaN NaN NaN
Dinner 48.17 6.50 6.0

The argument of groupby() need not be a column of the DataFrame. Instead, we can use an array of length equal to the number of rows of the DataFrame. Grouping will be performed based on the values of the array. Below we use it to split the tips DataFrame into two groups, depending on whether the value of the “size” columns is below 3 or not:

[101]:
# create a numpy array with values "small" and "large"
# depending on the value of the "size" column
group_size = np.select([tips["size"] < 3, tips["size"] >= 3], ["small", "large"])
group_size[:5]
[101]:
array(['small', 'large', 'large', 'small', 'large'], dtype='<U21')
[112]:
# group tips DataFrame based on values of the group_size array
# and compute average values for each group
tips.groupby(large_group).mean()
[112]:
total_bill tip size
large 26.582262 3.845119 3.702381
small 16.217875 2.553688 1.975000

Similar result as above can be obtained by converting the “size” column into an index and then passing to the groupby() method a function which converts index values to “large” and “small” values. Groups are created based on return values of the function. In this case the aggregated DataFrame does not have “size” column, since this columns is used to create groups:

[113]:
def f(x):
    if x < 3:
        return "small"
    else:
        return "large"

tips.set_index("size").groupby(f).mean()
[113]:
total_bill tip
large 26.582262 3.845119
small 16.217875 2.553688