Reshaping DataFrames

Pandas provide various tools for reshaping DataFrames. This helps us bring data to a form most convenient for our needs. Below we show a few ways how to accomplish it.

As an illustration we will use sample data provided by the seaborm module:

[125]:
import pandas as pd
import seaborn as sns

df = sns.load_dataset("flights")
df.head(5)
[125]:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121

Transpose

Transpose of a DataFrame produces a new DataFrame whose rows are columns of the original DataFrame:

[126]:
df1 = df.head(5)
df1
[126]:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
[127]:
# tranpose of df1; each column of df1 becomes a row
df1.T
[127]:
0 1 2 3 4
year 1949 1949 1949 1949 1949
month January February March April May
passengers 112 118 132 129 121

Setting and reseting an index

The set_index() method lets us select a column to serve as an index of the DataFrame:

[121]:
# convert the "year" column into DataFrame index
df2 = df.set_index("year")
df2.head(5)
[121]:
month passengers
year
1949 January 112
1949 February 118
1949 March 132
1949 April 129
1949 May 121

Notice that values of the newly created index repeat several times. If we use the loc indexer with an index value, we will obtain all rows corresponding to this value:

[86]:
df2.loc[1950]
[86]:
month passengers
year
1950 January 115
1950 February 126
1950 March 141
1950 April 135
1950 May 125
1950 June 149
1950 July 170
1950 August 170
1950 September 158
1950 October 133
1950 November 114
1950 December 140

The reset_index() method has the opposite effect to set_index(): it converts the index into a column of the DataFrame:

[87]:
df2.reset_index()
[87]:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
... ... ... ...
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432

144 rows × 3 columns

MultiIndex

In our sample DataFrame each value of the passengers column is uniquely identified by a combination of a year and a month. For this reason it would convenient to set the index using not just the year column (as we did above) but both year and month columns. This can be done by passing a list of columns to the set_index() method:

[101]:
# create a DataFrame with index consising of "year" and "month" columns
df3 = df.set_index(["year", "month"])
df3
[101]:
passengers
year month
1949 January 112
February 118
March 132
April 129
May 121
... ... ...
1960 August 606
September 508
October 461
November 390
December 432

144 rows × 1 columns

The new DataFrame is equipped with a MultiIndex consisting of (year, month) tuples:

[90]:
df3.index
[90]:
MultiIndex([(1949,   'January'),
            (1949,  'February'),
            (1949,     'March'),
            (1949,     'April'),
            (1949,       'May'),
            (1949,      'June'),
            (1949,      'July'),
            (1949,    'August'),
            (1949, 'September'),
            (1949,   'October'),
            ...
            (1960,     'March'),
            (1960,     'April'),
            (1960,       'May'),
            (1960,      'June'),
            (1960,      'July'),
            (1960,    'August'),
            (1960, 'September'),
            (1960,   'October'),
            (1960,  'November'),
            (1960,  'December')],
           names=['year', 'month'], length=144)

To select a row of the DataFrame we can use loc with one of these tuples:

[102]:
df3.loc[(1949, "May")]
[102]:
passengers    121
Name: (1949, May), dtype: int64

MultiIndex can be used in several way to select data. For example, MultiIndex slicing works:

[103]:
# select all rows with MultiIndex values between (1950, "November") and (1951, "March")
df3.loc[(1950, "November"):(1951, "March")]
[103]:
passengers
year month
1950 November 114
December 140
1951 January 145
February 150
March 178

Stack and unstack

Lets bring up again the DataFrame with the (year, month) MultiIndex:

[104]:
df3
[104]:
passengers
year month
1949 January 112
February 118
March 132
April 129
May 121
... ... ...
1960 August 606
September 508
October 461
November 390
December 432

144 rows × 1 columns

It may be convenient to rearrange this DataFrame, so that years label rows, months label columns, and a value in a given row and column is the number of passangers for the corresponding year and month. Such rearrangement can be accomplished using the unstack() method. Thie method converts the lowest level of the MultiIndex of rows into the lowest level of the MultiIndex of columns:

[89]:
df4 = df3.unstack()
df4
[89]:
passengers
month January February March April May June July August September October November December
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

Notice that the row index now consists of a single level with year values, but the column index is a MultiIndex with two levels:

[105]:
# index of rows
df4.index
[105]:
Int64Index([1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
            1960],
           dtype='int64', name='year')
[106]:
# MultiIndex of columns
df4.columns
[106]:
MultiIndex([('passengers',   'January'),
            ('passengers',  'February'),
            ('passengers',     'March'),
            ('passengers',     'April'),
            ('passengers',       'May'),
            ('passengers',      'June'),
            ('passengers',      'July'),
            ('passengers',    'August'),
            ('passengers', 'September'),
            ('passengers',   'October'),
            ('passengers',  'November'),
            ('passengers',  'December')],
           names=[None, 'month'])

To select a column we need to specify both levels of the MultiIndex:

[124]:
df4[("passengers", "January")]
[124]:
year
1949    112
1950    115
1951    145
1952    171
1953    196
1954    204
1955    242
1956    284
1957    315
1958    340
1959    360
1960    417
Name: (passengers, January), dtype: int64

The first level of the column MultiIndex is not very useful, since it has only one value “passengers” for all columns. We can delete it using the droplevel() method

[128]:
df5 = df4.droplevel(level=0,   # level of index to drop, indexing starts with 0
                    axis = 1)  # specifies that we are modifying the column index

df5
[128]:
month January February March April May June July August September October November December
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

Now the column index has only one level, consisting of month names:

[108]:
df5.columns
[108]:
CategoricalIndex(['January', 'February', 'March', 'April', 'May', 'June',
                  'July', 'August', 'September', 'October', 'November',
                  'December'],
                 categories=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', ...], ordered=False, name='month', dtype='category')
[109]:
# select a column
df5["May"]
[109]:
year
1949    121
1950    125
1951    172
1952    183
1953    229
1954    234
1955    270
1956    318
1957    355
1958    363
1959    420
1960    472
Name: May, dtype: int64

The atack() method is the inverse of unstack(): it converts the lowest level of the column index into the lowest level of the row index:

[110]:
df6 = df5.stack()
df6
[110]:
year  month
1949  January      112
      February     118
      March        132
      April        129
      May          121
                  ...
1960  August       606
      September    508
      October      461
      November     390
      December     432
Length: 144, dtype: int64

df6 is a Series with a two level MultiIndex consisting of year and row values:

[112]:
df6.index
[112]:
MultiIndex([(1949,   'January'),
            (1949,  'February'),
            (1949,     'March'),
            (1949,     'April'),
            (1949,       'May'),
            (1949,      'June'),
            (1949,      'July'),
            (1949,    'August'),
            (1949, 'September'),
            (1949,   'October'),
            ...
            (1960,     'March'),
            (1960,     'April'),
            (1960,       'May'),
            (1960,      'June'),
            (1960,      'July'),
            (1960,    'August'),
            (1960, 'September'),
            (1960,   'October'),
            (1960,  'November'),
            (1960,  'December')],
           names=['year', 'month'], length=144)
[113]:
# select a row of the series
df6[(1949, "March")]
[113]:
132

Using the reset_index() method we can convert the Multiindex into colums, one column for each MultiIndex level:

[114]:
df7 = df6.reset_index()
df7
[114]:
year month 0
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
... ... ... ...
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432

144 rows × 3 columns

The DataFrame we obtained is almost the same as the DataFrame we started with. The only difference is that the passenger column is now called 0. We can use the rename method to change the name of this column. This method takes as its argument a dictionary with keys given by names of columns (or rows) we want to rename, and values given by new names:

[116]:
df8 = df7.rename({0 : "passengers"},  # replace the label 0 in the index by "passenger"
                 axis=1)              # this specifies that we are changing the column index
df8
[116]:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
... ... ... ...
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432

144 rows × 3 columns