Reshaping DataFrames¶
Pandas provide various tools for reshaping DataFrames. This helps us bring data to a form most convenient for our needs. Below we show a few ways how to accomplish it.
As an illustration we will use sample data provided by the seaborm
module:
[125]:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("flights")
df.head(5)
[125]:
year | month | passengers | |
---|---|---|---|
0 | 1949 | January | 112 |
1 | 1949 | February | 118 |
2 | 1949 | March | 132 |
3 | 1949 | April | 129 |
4 | 1949 | May | 121 |
Transpose¶
Transpose of a DataFrame produces a new DataFrame whose rows are columns of the original DataFrame:
[126]:
df1 = df.head(5)
df1
[126]:
year | month | passengers | |
---|---|---|---|
0 | 1949 | January | 112 |
1 | 1949 | February | 118 |
2 | 1949 | March | 132 |
3 | 1949 | April | 129 |
4 | 1949 | May | 121 |
[127]:
# tranpose of df1; each column of df1 becomes a row
df1.T
[127]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
year | 1949 | 1949 | 1949 | 1949 | 1949 |
month | January | February | March | April | May |
passengers | 112 | 118 | 132 | 129 | 121 |
Setting and reseting an index¶
The set_index()
method lets us select a column to serve as an index of the DataFrame:
[121]:
# convert the "year" column into DataFrame index
df2 = df.set_index("year")
df2.head(5)
[121]:
month | passengers | |
---|---|---|
year | ||
1949 | January | 112 |
1949 | February | 118 |
1949 | March | 132 |
1949 | April | 129 |
1949 | May | 121 |
Notice that values of the newly created index repeat several times. If we use the loc
indexer with an index value, we will obtain all rows corresponding to this value:
[86]:
df2.loc[1950]
[86]:
month | passengers | |
---|---|---|
year | ||
1950 | January | 115 |
1950 | February | 126 |
1950 | March | 141 |
1950 | April | 135 |
1950 | May | 125 |
1950 | June | 149 |
1950 | July | 170 |
1950 | August | 170 |
1950 | September | 158 |
1950 | October | 133 |
1950 | November | 114 |
1950 | December | 140 |
The reset_index()
method has the opposite effect to set_index()
: it converts the index into a column of the DataFrame:
[87]:
df2.reset_index()
[87]:
year | month | passengers | |
---|---|---|---|
0 | 1949 | January | 112 |
1 | 1949 | February | 118 |
2 | 1949 | March | 132 |
3 | 1949 | April | 129 |
4 | 1949 | May | 121 |
... | ... | ... | ... |
139 | 1960 | August | 606 |
140 | 1960 | September | 508 |
141 | 1960 | October | 461 |
142 | 1960 | November | 390 |
143 | 1960 | December | 432 |
144 rows × 3 columns
MultiIndex¶
In our sample DataFrame each value of the passengers column is uniquely identified by a combination of a year and a month. For this reason it would convenient to set the index using not just the year
column (as we did above) but both year
and month
columns. This can be done by passing a list of columns to the set_index()
method:
[101]:
# create a DataFrame with index consising of "year" and "month" columns
df3 = df.set_index(["year", "month"])
df3
[101]:
passengers | ||
---|---|---|
year | month | |
1949 | January | 112 |
February | 118 | |
March | 132 | |
April | 129 | |
May | 121 | |
... | ... | ... |
1960 | August | 606 |
September | 508 | |
October | 461 | |
November | 390 | |
December | 432 |
144 rows × 1 columns
The new DataFrame is equipped with a MultiIndex consisting of (year, month) tuples:
[90]:
df3.index
[90]:
MultiIndex([(1949, 'January'),
(1949, 'February'),
(1949, 'March'),
(1949, 'April'),
(1949, 'May'),
(1949, 'June'),
(1949, 'July'),
(1949, 'August'),
(1949, 'September'),
(1949, 'October'),
...
(1960, 'March'),
(1960, 'April'),
(1960, 'May'),
(1960, 'June'),
(1960, 'July'),
(1960, 'August'),
(1960, 'September'),
(1960, 'October'),
(1960, 'November'),
(1960, 'December')],
names=['year', 'month'], length=144)
To select a row of the DataFrame we can use loc
with one of these tuples:
[102]:
df3.loc[(1949, "May")]
[102]:
passengers 121
Name: (1949, May), dtype: int64
MultiIndex can be used in several way to select data. For example, MultiIndex slicing works:
[103]:
# select all rows with MultiIndex values between (1950, "November") and (1951, "March")
df3.loc[(1950, "November"):(1951, "March")]
[103]:
passengers | ||
---|---|---|
year | month | |
1950 | November | 114 |
December | 140 | |
1951 | January | 145 |
February | 150 | |
March | 178 |
Stack and unstack¶
Lets bring up again the DataFrame with the (year, month) MultiIndex:
[104]:
df3
[104]:
passengers | ||
---|---|---|
year | month | |
1949 | January | 112 |
February | 118 | |
March | 132 | |
April | 129 | |
May | 121 | |
... | ... | ... |
1960 | August | 606 |
September | 508 | |
October | 461 | |
November | 390 | |
December | 432 |
144 rows × 1 columns
It may be convenient to rearrange this DataFrame, so that years label rows, months label columns, and a value in a given row and column is the number of passangers for the corresponding year and month. Such rearrangement can be accomplished using the unstack()
method. Thie method converts the lowest level of the MultiIndex of rows into the lowest level of the MultiIndex of columns:
[89]:
df4 = df3.unstack()
df4
[89]:
passengers | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
month | January | February | March | April | May | June | July | August | September | October | November | December |
year | ||||||||||||
1949 | 112 | 118 | 132 | 129 | 121 | 135 | 148 | 148 | 136 | 119 | 104 | 118 |
1950 | 115 | 126 | 141 | 135 | 125 | 149 | 170 | 170 | 158 | 133 | 114 | 140 |
1951 | 145 | 150 | 178 | 163 | 172 | 178 | 199 | 199 | 184 | 162 | 146 | 166 |
1952 | 171 | 180 | 193 | 181 | 183 | 218 | 230 | 242 | 209 | 191 | 172 | 194 |
1953 | 196 | 196 | 236 | 235 | 229 | 243 | 264 | 272 | 237 | 211 | 180 | 201 |
1954 | 204 | 188 | 235 | 227 | 234 | 264 | 302 | 293 | 259 | 229 | 203 | 229 |
1955 | 242 | 233 | 267 | 269 | 270 | 315 | 364 | 347 | 312 | 274 | 237 | 278 |
1956 | 284 | 277 | 317 | 313 | 318 | 374 | 413 | 405 | 355 | 306 | 271 | 306 |
1957 | 315 | 301 | 356 | 348 | 355 | 422 | 465 | 467 | 404 | 347 | 305 | 336 |
1958 | 340 | 318 | 362 | 348 | 363 | 435 | 491 | 505 | 404 | 359 | 310 | 337 |
1959 | 360 | 342 | 406 | 396 | 420 | 472 | 548 | 559 | 463 | 407 | 362 | 405 |
1960 | 417 | 391 | 419 | 461 | 472 | 535 | 622 | 606 | 508 | 461 | 390 | 432 |
Notice that the row index now consists of a single level with year values, but the column index is a MultiIndex with two levels:
[105]:
# index of rows
df4.index
[105]:
Int64Index([1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
1960],
dtype='int64', name='year')
[106]:
# MultiIndex of columns
df4.columns
[106]:
MultiIndex([('passengers', 'January'),
('passengers', 'February'),
('passengers', 'March'),
('passengers', 'April'),
('passengers', 'May'),
('passengers', 'June'),
('passengers', 'July'),
('passengers', 'August'),
('passengers', 'September'),
('passengers', 'October'),
('passengers', 'November'),
('passengers', 'December')],
names=[None, 'month'])
To select a column we need to specify both levels of the MultiIndex:
[124]:
df4[("passengers", "January")]
[124]:
year
1949 112
1950 115
1951 145
1952 171
1953 196
1954 204
1955 242
1956 284
1957 315
1958 340
1959 360
1960 417
Name: (passengers, January), dtype: int64
The first level of the column MultiIndex is not very useful, since it has only one value “passengers” for all columns. We can delete it using the droplevel()
method
[128]:
df5 = df4.droplevel(level=0, # level of index to drop, indexing starts with 0
axis = 1) # specifies that we are modifying the column index
df5
[128]:
month | January | February | March | April | May | June | July | August | September | October | November | December |
---|---|---|---|---|---|---|---|---|---|---|---|---|
year | ||||||||||||
1949 | 112 | 118 | 132 | 129 | 121 | 135 | 148 | 148 | 136 | 119 | 104 | 118 |
1950 | 115 | 126 | 141 | 135 | 125 | 149 | 170 | 170 | 158 | 133 | 114 | 140 |
1951 | 145 | 150 | 178 | 163 | 172 | 178 | 199 | 199 | 184 | 162 | 146 | 166 |
1952 | 171 | 180 | 193 | 181 | 183 | 218 | 230 | 242 | 209 | 191 | 172 | 194 |
1953 | 196 | 196 | 236 | 235 | 229 | 243 | 264 | 272 | 237 | 211 | 180 | 201 |
1954 | 204 | 188 | 235 | 227 | 234 | 264 | 302 | 293 | 259 | 229 | 203 | 229 |
1955 | 242 | 233 | 267 | 269 | 270 | 315 | 364 | 347 | 312 | 274 | 237 | 278 |
1956 | 284 | 277 | 317 | 313 | 318 | 374 | 413 | 405 | 355 | 306 | 271 | 306 |
1957 | 315 | 301 | 356 | 348 | 355 | 422 | 465 | 467 | 404 | 347 | 305 | 336 |
1958 | 340 | 318 | 362 | 348 | 363 | 435 | 491 | 505 | 404 | 359 | 310 | 337 |
1959 | 360 | 342 | 406 | 396 | 420 | 472 | 548 | 559 | 463 | 407 | 362 | 405 |
1960 | 417 | 391 | 419 | 461 | 472 | 535 | 622 | 606 | 508 | 461 | 390 | 432 |
Now the column index has only one level, consisting of month names:
[108]:
df5.columns
[108]:
CategoricalIndex(['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November',
'December'],
categories=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', ...], ordered=False, name='month', dtype='category')
[109]:
# select a column
df5["May"]
[109]:
year
1949 121
1950 125
1951 172
1952 183
1953 229
1954 234
1955 270
1956 318
1957 355
1958 363
1959 420
1960 472
Name: May, dtype: int64
The atack()
method is the inverse of unstack()
: it converts the lowest level of the column index into the lowest level of the row index:
[110]:
df6 = df5.stack()
df6
[110]:
year month
1949 January 112
February 118
March 132
April 129
May 121
...
1960 August 606
September 508
October 461
November 390
December 432
Length: 144, dtype: int64
df6
is a Series with a two level MultiIndex consisting of year and row values:
[112]:
df6.index
[112]:
MultiIndex([(1949, 'January'),
(1949, 'February'),
(1949, 'March'),
(1949, 'April'),
(1949, 'May'),
(1949, 'June'),
(1949, 'July'),
(1949, 'August'),
(1949, 'September'),
(1949, 'October'),
...
(1960, 'March'),
(1960, 'April'),
(1960, 'May'),
(1960, 'June'),
(1960, 'July'),
(1960, 'August'),
(1960, 'September'),
(1960, 'October'),
(1960, 'November'),
(1960, 'December')],
names=['year', 'month'], length=144)
[113]:
# select a row of the series
df6[(1949, "March")]
[113]:
132
Using the reset_index()
method we can convert the Multiindex into colums, one column for each MultiIndex level:
[114]:
df7 = df6.reset_index()
df7
[114]:
year | month | 0 | |
---|---|---|---|
0 | 1949 | January | 112 |
1 | 1949 | February | 118 |
2 | 1949 | March | 132 |
3 | 1949 | April | 129 |
4 | 1949 | May | 121 |
... | ... | ... | ... |
139 | 1960 | August | 606 |
140 | 1960 | September | 508 |
141 | 1960 | October | 461 |
142 | 1960 | November | 390 |
143 | 1960 | December | 432 |
144 rows × 3 columns
The DataFrame we obtained is almost the same as the DataFrame we started with. The only difference is that the passenger column is now called 0. We can use the rename
method to change the name of this column. This method takes as its argument a dictionary with keys given by names of columns (or rows) we want to rename, and values given by new names:
[116]:
df8 = df7.rename({0 : "passengers"}, # replace the label 0 in the index by "passenger"
axis=1) # this specifies that we are changing the column index
df8
[116]:
year | month | passengers | |
---|---|---|---|
0 | 1949 | January | 112 |
1 | 1949 | February | 118 |
2 | 1949 | March | 132 |
3 | 1949 | April | 129 |
4 | 1949 | May | 121 |
... | ... | ... | ... |
139 | 1960 | August | 606 |
140 | 1960 | September | 508 |
141 | 1960 | October | 461 |
142 | 1960 | November | 390 |
143 | 1960 | December | 432 |
144 rows × 3 columns