Combining DataFrames¶
It often happens that data we are working with is split into several DataFrames or Series. Pandas provides tools for combining such data in various ways. We show how to use them below.
pd.concat¶
Lets say that we have two DataFrames with exam scores in some college course. Each DataFrame contains scores of students enrolled in a different section of the course:
[1]:
import pandas as pd
import numpy as np
names = ["Ava", "Benjamin", "Charlotte", "Daniel",
"Emma", "Fredric", "Gianna", ]
scores1 = np.random.randint(0, 100, 12).reshape(4, 3)
scores2 = np.random.randint(0, 100, 9).reshape(3, 3)
columns = ["problem_1", "problem_2", "problem_3"]
sec1 = pd.DataFrame(scores1, index = names[:4], columns = columns)
sec2 = pd.DataFrame(scores2, index = names[4:7], columns = columns)
print("\nSection 1:")
display(sec1)
print("\nSection 2:")
display(sec2)
Section 1:
problem_1 | problem_2 | problem_3 | |
---|---|---|---|
Ava | 2 | 42 | 34 |
Benjamin | 28 | 4 | 95 |
Charlotte | 42 | 21 | 66 |
Daniel | 44 | 41 | 59 |
Section 2:
problem_1 | problem_2 | problem_3 | |
---|---|---|---|
Emma | 10 | 26 | 96 |
Fredric | 42 | 85 | 31 |
Gianna | 29 | 57 | 84 |
To combine these two DataFrames into one, we can use pandas concat()
function with the list of DataFrames as the argument:
[2]:
pd.concat([sec1, sec2])
[2]:
problem_1 | problem_2 | problem_3 | |
---|---|---|---|
Ava | 2 | 42 | 34 |
Benjamin | 28 | 4 | 95 |
Charlotte | 42 | 21 | 66 |
Daniel | 44 | 41 | 59 |
Emma | 10 | 26 | 96 |
Fredric | 42 | 85 | 31 |
Gianna | 29 | 57 | 84 |
In the combined DataFrame it can be useful to include labels indicating which rows came from which source DataFrame. We can do it by passing a list of labels to the concat()
function. The resulting DataFrame will be equipped with a Multiindex containing the labels:
[3]:
pd.concat([sec1, sec2], # list of DataFrames
keys=["section_1", "section_2"]) # list of labels
[3]:
problem_1 | problem_2 | problem_3 | ||
---|---|---|---|---|
section_1 | Ava | 2 | 42 | 34 |
Benjamin | 28 | 4 | 95 | |
Charlotte | 42 | 21 | 66 | |
Daniel | 44 | 41 | 59 | |
section_2 | Emma | 10 | 26 | 96 |
Fredric | 42 | 85 | 31 | |
Gianna | 29 | 57 | 84 |
So far we combined DataFrames which had the same columns, but this is not required for concatenation. Assume for example, that the last exam problem was different in each course section, and the score tables look as follows:
[4]:
sec1.rename({"problem_3" : "problem_3a"}, axis=1, inplace=True)
sec2.rename({"problem_3" : "problem_3b"}, axis=1, inplace=True)
print("\nSection 1:")
display(sec1)
print("\nSection 2:")
display(sec2)
Section 1:
problem_1 | problem_2 | problem_3a | |
---|---|---|---|
Ava | 2 | 42 | 34 |
Benjamin | 28 | 4 | 95 |
Charlotte | 42 | 21 | 66 |
Daniel | 44 | 41 | 59 |
Section 2:
problem_1 | problem_2 | problem_3b | |
---|---|---|---|
Emma | 10 | 26 | 96 |
Fredric | 42 | 85 | 31 |
Gianna | 29 | 57 | 84 |
Concatenation can be still performed, but the resulting DataFrame will contain NaN
values in the spots where there is no meaningful data:
[5]:
pd.concat([sec1, sec2], keys=["section_1", "section_2"])
[5]:
problem_1 | problem_2 | problem_3a | problem_3b | ||
---|---|---|---|---|---|
section_1 | Ava | 2 | 42 | 34.0 | NaN |
Benjamin | 28 | 4 | 95.0 | NaN | |
Charlotte | 42 | 21 | 66.0 | NaN | |
Daniel | 44 | 41 | 59.0 | NaN | |
section_2 | Emma | 10 | 26 | NaN | 96.0 |
Fredric | 42 | 85 | NaN | 31.0 | |
Gianna | 29 | 57 | NaN | 84.0 |
By default, the concatenated DataFrame contains all columns which appear in at least one of the source DataFrames. To change this, we can use concat()
with the argument join="inner"
. This will produce a DataFrame containing only columns that exist in each source DataFrame:
[6]:
pd.concat([sec1, sec2], keys=["section_1", "section_2"], join="inner")
[6]:
problem_1 | problem_2 | ||
---|---|---|---|
section_1 | Ava | 2 | 42 |
Benjamin | 28 | 4 | |
Charlotte | 42 | 21 | |
Daniel | 44 | 41 | |
section_2 | Emma | 10 | 26 |
Fredric | 42 | 85 | |
Gianna | 29 | 57 |
In all examples above we concatenated DataFrames along their 0-axes, i.e. along their rows. We can also use concat()
with axis=1
argument to combine DataFrames along their columns. For instance, assume that an exam was given in two parts, and we have separate DataFrames with scores from part 1 and part 2:
[7]:
scores1 = np.random.randint(0, 100, 8).reshape(4, 2)
scores2 = np.random.randint(0, 100, 9).reshape(3, 3)
part1 = pd.DataFrame(scores1, index = names[:4], columns = ["problem_1", "problem_2"])
part2 = pd.DataFrame(scores2, index = names[:3], columns = ["problem_3", "problem_4", "problem_5"])
print("\npart 1:")
display(part1)
print("\npart 2:")
display(part2)
part 1:
problem_1 | problem_2 | |
---|---|---|
Ava | 10 | 5 |
Benjamin | 63 | 77 |
Charlotte | 18 | 24 |
Daniel | 29 | 77 |
part 2:
problem_3 | problem_4 | problem_5 | |
---|---|---|---|
Ava | 90 | 57 | 30 |
Benjamin | 4 | 90 | 7 |
Charlotte | 67 | 16 | 34 |
We can combine these DataFrames as follows:
[8]:
pd.concat([part1, part2], axis=1)
[8]:
problem_1 | problem_2 | problem_3 | problem_4 | problem_5 | |
---|---|---|---|---|---|
Ava | 10 | 5 | 90.0 | 57.0 | 30.0 |
Benjamin | 63 | 77 | 4.0 | 90.0 | 7.0 |
Charlotte | 18 | 24 | 67.0 | 16.0 | 34.0 |
Daniel | 29 | 77 | NaN | NaN | NaN |
pd.merge¶
Pandas merge()
function is another tool for combining two DataFrames. To see how it works, lets assume that we have two DataFrames, one of which gives names of instructors teaching math courses, and the second which lists names and office numbers of instructors:
[25]:
names = ["Ava", "Benjamin", "Charlotte", "Daniel", "Emma", "Fredric"]
courses = ["MTH 141", "MTH 142", "MTH 241", "MTH 306", "MTH 309", "MTH 311"]
rooms = ["NSC 216", "Capen 110", "Park 440"]
courses = pd.DataFrame({"course" : courses,
"instructor": np.random.choice(names[1:], len(courses))})
instructors = pd.DataFrame({"name" : names[:-2],
"office" : np.random.randint(100,150, len(names[:-2]))},
dtype="object")
print("\ncourses:")
display(courses)
print("\ninstructors:")
display(instructors)
courses:
course | instructor | |
---|---|---|
0 | MTH 141 | Benjamin |
1 | MTH 142 | Benjamin |
2 | MTH 241 | Daniel |
3 | MTH 306 | Fredric |
4 | MTH 309 | Charlotte |
5 | MTH 311 | Emma |
instructors:
name | office | |
---|---|---|
0 | Ava | 146 |
1 | Benjamin | 124 |
2 | Charlotte | 140 |
3 | Daniel | 133 |
Notice that not all instructors listed in the first table appear in the second table, and vice versa.
Inner join¶
Lets say that we want to create a DataFrame with course name, instructor name, and instructor office number. Using the merge()
function this can be done as follows:
[26]:
inner_merge = pd.merge(courses, instructors,
how="inner",
left_on="instructor",
right_on="name")
inner_merge
[26]:
course | instructor | name | office | |
---|---|---|---|---|
0 | MTH 141 | Benjamin | Benjamin | 124 |
1 | MTH 142 | Benjamin | Benjamin | 124 |
2 | MTH 241 | Daniel | Daniel | 133 |
3 | MTH 309 | Charlotte | Charlotte | 140 |
The first two arguments are DataFrames being merged. We will call the first of them the left DataFrame, and the second the right DataFrame.
The arguments
left_on="instructor"
,right_on="name"
mean that we are combining rows by matching values of the “instructor” column of the left DataFrame and the “name” column of the right DataFrame.The argument
how="inner"
indicates that we are computing an inner join of the DataFrames: rows will be combined only if their values in the “instructor” and “name” columns can be matched.
The merged DataFrame has “instructor” and “name” columns which are identical. We can drop one of them:
[27]:
inner_merge.drop("instructor", # name of the column to drop
axis=1) # indicates that we are dropping a column, not a row
[27]:
course | name | office | |
---|---|---|---|
0 | MTH 141 | Benjamin | 124 |
1 | MTH 142 | Benjamin | 124 |
2 | MTH 241 | Daniel | 133 |
3 | MTH 309 | Charlotte | 140 |
The resulting table lists only these instructors, whose names appear both in the courses DataFrame and in the instructors DataFrame.
Left join¶
Changing the value of the how
argument changes the way DataFrames are merged. Setting how="left"
specifies that we want to get the left join: every row of the left DataFrame will be combined either with a row of the right DataFrame, if it can be matched, or with NaN
values if a match does not exist:
[28]:
left_merge = pd.merge(courses, instructors,
how="left",
left_on="instructor",
right_on="name")
left_merge
[28]:
course | instructor | name | office | |
---|---|---|---|---|
0 | MTH 141 | Benjamin | Benjamin | 124 |
1 | MTH 142 | Benjamin | Benjamin | 124 |
2 | MTH 241 | Daniel | Daniel | 133 |
3 | MTH 306 | Fredric | NaN | NaN |
4 | MTH 309 | Charlotte | Charlotte | 140 |
5 | MTH 311 | Emma | NaN | NaN |
As a result we obtain a table where all courses and their instructors are listed, and NaN
values indicate for which instructors there is no information about their office numbers.
Right join¶
The right join (how="right"
) is symmetrical to the left join: every row of the right DataFrame will be matched with either a row of the left DataFrame, if a match exists, or with NaN
values otherwise:
[29]:
right_merge = pd.merge(courses, instructors,
how="right",
left_on="instructor",
right_on="name")
right_merge
[29]:
course | instructor | name | office | |
---|---|---|---|---|
0 | MTH 141 | Benjamin | Benjamin | 124 |
1 | MTH 142 | Benjamin | Benjamin | 124 |
2 | MTH 241 | Daniel | Daniel | 133 |
3 | MTH 309 | Charlotte | Charlotte | 140 |
4 | NaN | NaN | Ava | 146 |
This time we obtain a DataFrame with names of all instructors which have an office number listed, with NaN
values indicating instructors who do not teach any course.
Outer join¶
The last option is the outer join (how="outer"
). In this case, rows will be matched when a match exists. If a row of either DataFrame does not have a match, it will be matched with NaN
values:
[30]:
outer_merge = pd.merge(courses, instructors,
how="outer",
left_on="instructor",
right_on="name")
outer_merge
[30]:
course | instructor | name | office | |
---|---|---|---|---|
0 | MTH 141 | Benjamin | Benjamin | 124 |
1 | MTH 142 | Benjamin | Benjamin | 124 |
2 | MTH 241 | Daniel | Daniel | 133 |
3 | MTH 306 | Fredric | NaN | NaN |
4 | MTH 309 | Charlotte | Charlotte | 140 |
5 | MTH 311 | Emma | NaN | NaN |
6 | NaN | NaN | Ava | 146 |
In the table above the “instructor” column has NaN
values for instructors who do not teach, and the “name” column has such values for instructors which do not have assigned offices. We can create a Series that replaces NaN
values in the first column by the corresponding values of the second column as follows:
[43]:
# if a value in the "instructor" column is NaN
# replace it by the corresponding value of the "name" column
outer_merge["instructor"].fillna(outer_merge["name"])
[43]:
0 Benjamin
1 Benjamin
2 Daniel
3 Fredric
4 Charlotte
5 Emma
6 Ava
Name: instructor, dtype: object
We can use this to eliminate NaN
values in the “instructor” column:
[44]:
outer_merge["instructor"] = outer_merge["instructor"].combine_first(outer_merge["name"])
outer_merge
[44]:
course | instructor | name | office | |
---|---|---|---|---|
0 | MTH 141 | Benjamin | Benjamin | 124 |
1 | MTH 142 | Benjamin | Benjamin | 124 |
2 | MTH 241 | Daniel | Daniel | 133 |
3 | MTH 306 | Fredric | NaN | NaN |
4 | MTH 309 | Charlotte | Charlotte | 140 |
5 | MTH 311 | Emma | NaN | NaN |
6 | NaN | Ava | Ava | 146 |
The “name” column is no longer needed, so we can drop it:
[45]:
outer_merge.drop("name", axis=1)
[45]:
course | instructor | office | |
---|---|---|---|
0 | MTH 141 | Benjamin | 124 |
1 | MTH 142 | Benjamin | 124 |
2 | MTH 241 | Daniel | 133 |
3 | MTH 306 | Fredric | NaN |
4 | MTH 309 | Charlotte | 140 |
5 | MTH 311 | Emma | NaN |
6 | NaN | Ava | 146 |
The remaining NaN
values indicate that an instructor either does not teach a course or does not have an assigned office.