Reshaping DataFrames¶

Pandas provide various tools for reshaping DataFrames. This helps us bring data to a form most convenient for our needs. Below we show a few ways how to accomplish it.

As an illustration we will use sample data provided by the seaborm module:

[125]:

import pandas as pd
import seaborn as sns

df = sns.load_dataset("flights")
df.head(5)

[125]:

	year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121

Transpose¶

Transpose of a DataFrame produces a new DataFrame whose rows are columns of the original DataFrame:

[126]:

df1 = df.head(5)
df1

[126]:

	year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121

[127]:

# tranpose of df1; each column of df1 becomes a row
df1.T

[127]:

	0	1	2	3	4
year	1949	1949	1949	1949	1949
month	January	February	March	April	May
passengers	112	118	132	129	121

Setting and reseting an index¶

The set_index() method lets us select a column to serve as an index of the DataFrame:

[121]:

# convert the "year" column into DataFrame index
df2 = df.set_index("year")
df2.head(5)

[121]:

	month	passengers
year
1949	January	112
1949	February	118
1949	March	132
1949	April	129
1949	May	121

Notice that values of the newly created index repeat several times. If we use the loc indexer with an index value, we will obtain all rows corresponding to this value:

[86]:

df2.loc[1950]

[86]:

	month	passengers
year
1950	January	115
1950	February	126
1950	March	141
1950	April	135
1950	May	125
1950	June	149
1950	July	170
1950	August	170
1950	September	158
1950	October	133
1950	November	114
1950	December	140

The reset_index() method has the opposite effect to set_index(): it converts the index into a column of the DataFrame:

[87]:

df2.reset_index()

[87]:

	year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121
...	...	...	...
139	1960	August	606
140	1960	September	508
141	1960	October	461
142	1960	November	390
143	1960	December	432

144 rows × 3 columns

MultiIndex¶

In our sample DataFrame each value of the passengers column is uniquely identified by a combination of a year and a month. For this reason it would convenient to set the index using not just the year column (as we did above) but both year and month columns. This can be done by passing a list of columns to the set_index() method:

[101]:

# create a DataFrame with index consising of "year" and "month" columns
df3 = df.set_index(["year", "month"])
df3

[101]:

		passengers
year	month
1949	January	112
	February	118
	March	132
	April	129
	May	121
...	...	...
1960	August	606
	September	508
	October	461
	November	390
	December	432

144 rows × 1 columns

The new DataFrame is equipped with a MultiIndex consisting of (year, month) tuples:

[90]:

df3.index

[90]:

MultiIndex([(1949,   'January'),
            (1949,  'February'),
            (1949,     'March'),
            (1949,     'April'),
            (1949,       'May'),
            (1949,      'June'),
            (1949,      'July'),
            (1949,    'August'),
            (1949, 'September'),
            (1949,   'October'),
            ...
            (1960,     'March'),
            (1960,     'April'),
            (1960,       'May'),
            (1960,      'June'),
            (1960,      'July'),
            (1960,    'August'),
            (1960, 'September'),
            (1960,   'October'),
            (1960,  'November'),
            (1960,  'December')],
           names=['year', 'month'], length=144)

To select a row of the DataFrame we can use loc with one of these tuples:

[102]:

df3.loc[(1949, "May")]

[102]:

passengers    121
Name: (1949, May), dtype: int64

MultiIndex can be used in several way to select data. For example, MultiIndex slicing works:

[103]:

# select all rows with MultiIndex values between (1950, "November") and (1951, "March")
df3.loc[(1950, "November"):(1951, "March")]

[103]:

		passengers
year	month
1950	November	114
1950	December	140
1951	January	145
	February	150
	March	178

Stack and unstack¶

Lets bring up again the DataFrame with the (year, month) MultiIndex:

[104]:

df3

[104]:

		passengers
year	month
1949	January	112
	February	118
	March	132
	April	129
	May	121
...	...	...
1960	August	606
	September	508
	October	461
	November	390
	December	432

144 rows × 1 columns

It may be convenient to rearrange this DataFrame, so that years label rows, months label columns, and a value in a given row and column is the number of passangers for the corresponding year and month. Such rearrangement can be accomplished using the unstack() method. Thie method converts the lowest level of the MultiIndex of rows into the lowest level of the MultiIndex of columns:

[89]:

df4 = df3.unstack()
df4

[89]:

	passengers
month	January	February	March	April	May	June	July	August	September	October	November	December
year
1949	112	118	132	129	121	135	148	148	136	119	104	118
1950	115	126	141	135	125	149	170	170	158	133	114	140
1951	145	150	178	163	172	178	199	199	184	162	146	166
1952	171	180	193	181	183	218	230	242	209	191	172	194
1953	196	196	236	235	229	243	264	272	237	211	180	201
1954	204	188	235	227	234	264	302	293	259	229	203	229
1955	242	233	267	269	270	315	364	347	312	274	237	278
1956	284	277	317	313	318	374	413	405	355	306	271	306
1957	315	301	356	348	355	422	465	467	404	347	305	336
1958	340	318	362	348	363	435	491	505	404	359	310	337
1959	360	342	406	396	420	472	548	559	463	407	362	405
1960	417	391	419	461	472	535	622	606	508	461	390	432

Notice that the row index now consists of a single level with year values, but the column index is a MultiIndex with two levels:

[105]:

# index of rows
df4.index

[105]:

Int64Index([1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
            1960],
           dtype='int64', name='year')

[106]:

# MultiIndex of columns
df4.columns

[106]:

MultiIndex([('passengers',   'January'),
            ('passengers',  'February'),
            ('passengers',     'March'),
            ('passengers',     'April'),
            ('passengers',       'May'),
            ('passengers',      'June'),
            ('passengers',      'July'),
            ('passengers',    'August'),
            ('passengers', 'September'),
            ('passengers',   'October'),
            ('passengers',  'November'),
            ('passengers',  'December')],
           names=[None, 'month'])

To select a column we need to specify both levels of the MultiIndex:

[124]:

df4[("passengers", "January")]

[124]:

year
1949    112
1950    115
1951    145
1952    171
1953    196
1954    204
1955    242
1956    284
1957    315
1958    340
1959    360
1960    417
Name: (passengers, January), dtype: int64

The first level of the column MultiIndex is not very useful, since it has only one value “passengers” for all columns. We can delete it using the droplevel() method

[128]:

df5 = df4.droplevel(level=0,   # level of index to drop, indexing starts with 0
                    axis = 1)  # specifies that we are modifying the column index

df5

[128]:

month	January	February	March	April	May	June	July	August	September	October	November	December
year
1949	112	118	132	129	121	135	148	148	136	119	104	118
1950	115	126	141	135	125	149	170	170	158	133	114	140
1951	145	150	178	163	172	178	199	199	184	162	146	166
1952	171	180	193	181	183	218	230	242	209	191	172	194
1953	196	196	236	235	229	243	264	272	237	211	180	201
1954	204	188	235	227	234	264	302	293	259	229	203	229
1955	242	233	267	269	270	315	364	347	312	274	237	278
1956	284	277	317	313	318	374	413	405	355	306	271	306
1957	315	301	356	348	355	422	465	467	404	347	305	336
1958	340	318	362	348	363	435	491	505	404	359	310	337
1959	360	342	406	396	420	472	548	559	463	407	362	405
1960	417	391	419	461	472	535	622	606	508	461	390	432

Now the column index has only one level, consisting of month names:

[108]:

df5.columns

[108]:

CategoricalIndex(['January', 'February', 'March', 'April', 'May', 'June',
                  'July', 'August', 'September', 'October', 'November',
                  'December'],
                 categories=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', ...], ordered=False, name='month', dtype='category')

[109]:

# select a column
df5["May"]

[109]:

year
1949    121
1950    125
1951    172
1952    183
1953    229
1954    234
1955    270
1956    318
1957    355
1958    363
1959    420
1960    472
Name: May, dtype: int64

The atack() method is the inverse of unstack(): it converts the lowest level of the column index into the lowest level of the row index:

[110]:

df6 = df5.stack()
df6

[110]:

year  month
1949  January      112
      February     118
      March        132
      April        129
      May          121
                  ...
1960  August       606
      September    508
      October      461
      November     390
      December     432
Length: 144, dtype: int64

df6 is a Series with a two level MultiIndex consisting of year and row values:

[112]:

df6.index

[112]:

MultiIndex([(1949,   'January'),
            (1949,  'February'),
            (1949,     'March'),
            (1949,     'April'),
            (1949,       'May'),
            (1949,      'June'),
            (1949,      'July'),
            (1949,    'August'),
            (1949, 'September'),
            (1949,   'October'),
            ...
            (1960,     'March'),
            (1960,     'April'),
            (1960,       'May'),
            (1960,      'June'),
            (1960,      'July'),
            (1960,    'August'),
            (1960, 'September'),
            (1960,   'October'),
            (1960,  'November'),
            (1960,  'December')],
           names=['year', 'month'], length=144)

[113]:

# select a row of the series
df6[(1949, "March")]

[113]:

Using the reset_index() method we can convert the Multiindex into colums, one column for each MultiIndex level:

[114]:

df7 = df6.reset_index()
df7

[114]:

	year	month	0
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121
...	...	...	...
139	1960	August	606
140	1960	September	508
141	1960	October	461
142	1960	November	390
143	1960	December	432

144 rows × 3 columns

The DataFrame we obtained is almost the same as the DataFrame we started with. The only difference is that the passenger column is now called 0. We can use the rename method to change the name of this column. This method takes as its argument a dictionary with keys given by names of columns (or rows) we want to rename, and values given by new names:

[116]:

df8 = df7.rename({0 : "passengers"},  # replace the label 0 in the index by "passenger"
                 axis=1)              # this specifies that we are changing the column index
df8

[116]:

	year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121
...	...	...	...
139	1960	August	606
140	1960	September	508
141	1960	October	461
142	1960	November	390
143	1960	December	432

144 rows × 3 columns