GroupBy¶

A common pattern in data analysis is to split data into groups based on some criteria (geographical area, age, type of business, etc.), make some computations for each group, and then assemble the results in order to compare them. We will show how to perform such tasks with pandas using the DataFrame groupby() method.

As an example we will use “tips” data set provided by the seaborn module. It contains data on restaurant visits: the total bill amount, tip amount, sex of the person paying the bill, whether the visiting group included smokers, day and time of the visit, and the size of the group:

[1]:

import pandas as pd
import seaborn as sns

tips = sns.load_dataset("tips")
tips.head()

[1]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

The GroupBy object¶

The unique() method of pandas Series returns values of a Series, without repetitions. Using it we can check that the “sex” column has only two unique values: “Female” and “Male”:

[2]:

tips["sex"].unique()

[2]:

[Female, Male]
Categories (2, object): [Female, Male]

Lets say that we want to group rows the DataFrame into parts, depending on the value of the “sex” column. This can be done using groupby() with the column name as an argument:

[3]:

grouped = tips.groupby("sex")
grouped

[3]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a26d11690>

grouby() produces a GroupBy object which provides tools for manipulating grouped data. For example, we can use it to get a DataFrame with all rows which have “Female” in the “sex” column:

[4]:

grouped.get_group("Female")

[4]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
11	35.26	5.00	Female	No	Sun	Dinner	4
14	14.83	3.02	Female	No	Sun	Dinner	2
16	10.33	1.67	Female	No	Sun	Dinner	3
...	...	...	...	...	...	...	...
226	10.09	2.00	Female	Yes	Fri	Lunch	2
229	22.12	2.88	Female	Yes	Sat	Dinner	2
238	35.83	4.67	Female	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

87 rows × 7 columns

Iteration over the grouped object produces tuples consisting of a value of the “sex” column and a DataFrame with all rows containing this value:

[5]:

for k, df in grouped:
    print(f"\nvalue: {k}\n")
    display(df.head(3))


value: Male

	total_bill	tip	sex	smoker	day	time	size
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2


value: Female

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
11	35.26	5.00	Female	No	Sun	Dinner	4

Simple aggregations¶

The usual way of using GroupBy objects it to make some computations for each group, and then combine the results:

[6]:

# get the sum of rows in each group
grouped.sum()

[6]:

	total_bill	tip	size
sex
Male	3256.82	485.07	413
Female	1570.95	246.51	214

[106]:

# get the average value of rows in each group
grouped.mean()

[106]:

	total_bill	tip	size
sex
Male	20.744076	3.089618	2.630573
Female	18.056897	2.833448	2.459770

Notice that in the above examples columns “smoker”, “day”, and “time” were skipped. These columns contain non-numeric data and the sum() and mean() operations are not defined for them.

If we want to aggregate data only in some columns, we can select them before making computations:

[8]:

# get the maximum value of the "total_bill" column in each group
grouped["total_bill"].max()

[8]:

sex
Male      50.81
Female    44.30
Name: total_bill, dtype: float64

[9]:

# get the minimum value of the "total_bill" and "tip" columns in each group
grouped[["total_bill", "tip"]].min()

[9]:

	total_bill	tip
sex
Male	7.25	1.0
Female	3.07	1.0

GroupBy.agg¶

To aggregate data in a custom way, we can use the agg() method. This method takes as its argument a function. This function is applied to every column of each group to compute aggregations.

For example, let say that we want to calculate how many males and females left tips above $5.00. This could be done as follows:

[26]:

def big_tip(tip_values, min_val=5):

    n = 0
    for t in tip_values:
        if t > min_val:
            n += 1
    return n


grouped["tip"].agg(big_tip)

[26]:

sex
Male      14.0
Female     4.0
Name: tip, dtype: float64

In the above example we can avoid custom aggregation, by selecting rows with large tips before grouping:

[107]:

# selects rows with tips above $5.00
big_tippers = tips[tips["tip"] > 5]

# group by "sex"
grouped_bt = big_tippers.groupby("sex")

# count the number of values in the "tip" column for each group
grouped_bt["tip"].count()

[107]:

sex
Male      14
Female     4
Name: tip, dtype: int64

GroupBy.transform¶

transform() is another method of GroupBy objects. It also takes as its argument a function and applies this function to columns of each group. In this case though, the function must return either a Series of the same length as the original column or a scalar value.

In the example below, we use transform to express values of the “total_bill” and “tip” columns as fractions of mean values of these columns in each group.

[65]:

def mean_fraction(col):

    # if the name of a column is "total_bill" or "tips"
    # divide its values by the mean of the column
    if col.name in ["total_bill", "tip"]:
        return col / col.mean()
    # leave all other columns unchanged
    else:
        return col


grouped.transform(mean_fraction).head()

[65]:

	total_bill	tip	smoker	day	time	size
0	0.940915	0.356456	No	Sun	Dinner	2
1	0.498456	0.537283	No	Sun	Dinner	3
2	1.012819	1.132826	No	Sun	Dinner	3
3	1.141531	1.071330	No	Sun	Dinner	2
4	1.361807	1.274066	No	Sun	Dinner	4

If the function passed to transform() returns a single value for a column then all values of that column belonging to a given group will be replaced by the return value of the function.

Below we use it to replace values of the “tip” column by group averages:

[61]:

def tips_mean(col):

    # if name of a column is "tip"
    # return the mean value of the column
    if col.name == "tip":
        return col.mean()
    # leave all other columns unchanged
    else:
        return col


df = grouped.transform(tips_mean)
df.head()

[61]:

	total_bill	tip	smoker	day	time	size
0	16.99	2.833448	No	Sun	Dinner	2
1	10.34	3.089618	No	Sun	Dinner	3
2	21.01	3.089618	No	Sun	Dinner	3
3	23.68	3.089618	No	Sun	Dinner	2
4	24.59	2.833448	No	Sun	Dinner	4

The transformed DataFrame does not include the “sex” column, since this column was used for grouping. However, since rows are in the same order as in the original tips DataFrame, we can easily add this column:

[62]:

df["sex"] = tips["sex"]
df.head()

[62]:

	total_bill	tip	smoker	day	time	size	sex
0	16.99	2.833448	No	Sun	Dinner	2	Female
1	10.34	3.089618	No	Sun	Dinner	3	Male
2	21.01	3.089618	No	Sun	Dinner	3	Male
3	23.68	3.089618	No	Sun	Dinner	2	Male
4	24.59	2.833448	No	Sun	Dinner	4	Female

GroupBy.apply¶

The apply() method is more flexible than agg() and transform() but usually also slower. It takes as an argument a function. This function is applied to the whole DataFrame for each group. The return values are then assembled together.

For example, lets say that for each “Female”/“Male” group we want to find 3 rows of data with the largest tip amounts. We can do it as follows:

[111]:

def large_tips(df):

    # sort the DataFrame by "tip" column in descending order
    df = df.sort_values(by="tip", ascending=False)
    # return the first 3 rows of the sorted DaraFrame
    return df.head(3)


grouped.apply(large_tips)

[111]:

		total_bill	tip	smoker	day	time	size
sex
Male	170	50.81	10.00	Yes	Sat	Dinner	3
	212	48.33	9.00	No	Sat	Dinner	4
	23	39.42	7.58	No	Sat	Dinner	4
Female	214	28.17	6.50	Yes	Sat	Dinner	3
	52	34.81	5.20	No	Sun	Dinner	4
	85	34.83	5.17	No	Thur	Lunch	4

The same result can obtained a bit simpler using DataFrame nlargest() method:

[114]:

def large_tips(df):

    # sort DataFrame columns by values of the "tip" column
    # in descending order and return the first 3 rows
    return df.nlargest(n=3, columns="tip")


grouped.apply(large_tips)

[114]:

		total_bill	tip	smoker	day	time	size
sex
Male	170	50.81	10.00	Yes	Sat	Dinner	3
	212	48.33	9.00	No	Sat	Dinner	4
	23	39.42	7.58	No	Sat	Dinner	4
Female	214	28.17	6.50	Yes	Sat	Dinner	3
	52	34.81	5.20	No	Sun	Dinner	4
	85	34.83	5.17	No	Thur	Lunch	4

groupby options¶

In the examples above we grouped DataFrame rows based on values of a single column. Grouping can be also performed based on values of several columns. For example, we can split the tips DataFrame into groups using values of both “day” and “time” columns, and then compute maximal values of columns for each group:

[103]:

grouped2 = tips.groupby(["day", "time"])
grouped2.max()

[103]:

		total_bill	tip	size
day	time
Thur	Lunch	43.11	6.70	6.0
Thur	Dinner	18.78	3.00	2.0
Fri	Lunch	16.27	3.48	3.0
Fri	Dinner	40.17	4.73	4.0
Sat	Lunch	NaN	NaN	NaN
Sat	Dinner	50.81	10.00	5.0
Sun	Lunch	NaN	NaN	NaN
Sun	Dinner	48.17	6.50	6.0

The argument of groupby() need not be a column of the DataFrame. Instead, we can use an array of length equal to the number of rows of the DataFrame. Grouping will be performed based on the values of the array. Below we use it to split the tips DataFrame into two groups, depending on whether the value of the “size” columns is below 3 or not:

[101]:

# create a numpy array with values "small" and "large"
# depending on the value of the "size" column
group_size = np.select([tips["size"] < 3, tips["size"] >= 3], ["small", "large"])
group_size[:5]

[101]:

array(['small', 'large', 'large', 'small', 'large'], dtype='<U21')

[112]:

# group tips DataFrame based on values of the group_size array
# and compute average values for each group
tips.groupby(large_group).mean()

[112]:

	total_bill	tip	size
large	26.582262	3.845119	3.702381
small	16.217875	2.553688	1.975000

Similar result as above can be obtained by converting the “size” column into an index and then passing to the groupby() method a function which converts index values to “large” and “small” values. Groups are created based on return values of the function. In this case the aggregated DataFrame does not have “size” column, since this columns is used to create groups:

[113]:

def f(x):
    if x < 3:
        return "small"
    else:
        return "large"

tips.set_index("size").groupby(f).mean()

[113]:

	total_bill	tip
large	26.582262	3.845119
small	16.217875	2.553688