Combining DataFrames¶

It often happens that data we are working with is split into several DataFrames or Series. Pandas provides tools for combining such data in various ways. We show how to use them below.

pd.concat¶

Lets say that we have two DataFrames with exam scores in some college course. Each DataFrame contains scores of students enrolled in a different section of the course:

[1]:

import pandas as pd
import numpy as np

names = ["Ava", "Benjamin", "Charlotte", "Daniel",
         "Emma", "Fredric", "Gianna", ]

scores1 = np.random.randint(0, 100, 12).reshape(4, 3)
scores2 = np.random.randint(0, 100, 9).reshape(3, 3)

columns  = ["problem_1", "problem_2", "problem_3"]
sec1 = pd.DataFrame(scores1, index = names[:4], columns = columns)
sec2 = pd.DataFrame(scores2, index = names[4:7], columns = columns)

print("\nSection 1:")
display(sec1)
print("\nSection 2:")
display(sec2)


Section 1:

	problem_1	problem_2	problem_3
Ava	2	42	34
Benjamin	28	4	95
Charlotte	42	21	66
Daniel	44	41	59


Section 2:

	problem_1	problem_2	problem_3
Emma	10	26	96
Fredric	42	85	31
Gianna	29	57	84

To combine these two DataFrames into one, we can use pandas concat() function with the list of DataFrames as the argument:

[2]:

pd.concat([sec1, sec2])

[2]:

	problem_1	problem_2	problem_3
Ava	2	42	34
Benjamin	28	4	95
Charlotte	42	21	66
Daniel	44	41	59
Emma	10	26	96
Fredric	42	85	31
Gianna	29	57	84

In the combined DataFrame it can be useful to include labels indicating which rows came from which source DataFrame. We can do it by passing a list of labels to the concat() function. The resulting DataFrame will be equipped with a Multiindex containing the labels:

[3]:

pd.concat([sec1, sec2],                     # list of DataFrames
          keys=["section_1", "section_2"])  # list of labels

[3]:

		problem_1	problem_2	problem_3
section_1	Ava	2	42	34
	Benjamin	28	4	95
	Charlotte	42	21	66
	Daniel	44	41	59
section_2	Emma	10	26	96
	Fredric	42	85	31
	Gianna	29	57	84

So far we combined DataFrames which had the same columns, but this is not required for concatenation. Assume for example, that the last exam problem was different in each course section, and the score tables look as follows:

[4]:

sec1.rename({"problem_3" : "problem_3a"}, axis=1, inplace=True)
sec2.rename({"problem_3" : "problem_3b"}, axis=1, inplace=True)

print("\nSection 1:")
display(sec1)
print("\nSection 2:")
display(sec2)


Section 1:

	problem_1	problem_2	problem_3a
Ava	2	42	34
Benjamin	28	4	95
Charlotte	42	21	66
Daniel	44	41	59


Section 2:

	problem_1	problem_2	problem_3b
Emma	10	26	96
Fredric	42	85	31
Gianna	29	57	84

Concatenation can be still performed, but the resulting DataFrame will contain NaN values in the spots where there is no meaningful data:

[5]:

pd.concat([sec1, sec2], keys=["section_1", "section_2"])

[5]:

		problem_1	problem_2	problem_3a	problem_3b
section_1	Ava	2	42	34.0	NaN
	Benjamin	28	4	95.0	NaN
	Charlotte	42	21	66.0	NaN
	Daniel	44	41	59.0	NaN
section_2	Emma	10	26	NaN	96.0
	Fredric	42	85	NaN	31.0
	Gianna	29	57	NaN	84.0

By default, the concatenated DataFrame contains all columns which appear in at least one of the source DataFrames. To change this, we can use concat() with the argument join="inner". This will produce a DataFrame containing only columns that exist in each source DataFrame:

[6]:

pd.concat([sec1, sec2], keys=["section_1", "section_2"], join="inner")

[6]:

		problem_1	problem_2
section_1	Ava	2	42
	Benjamin	28	4
	Charlotte	42	21
	Daniel	44	41
section_2	Emma	10	26
	Fredric	42	85
	Gianna	29	57

In all examples above we concatenated DataFrames along their 0-axes, i.e. along their rows. We can also use concat() with axis=1 argument to combine DataFrames along their columns. For instance, assume that an exam was given in two parts, and we have separate DataFrames with scores from part 1 and part 2:

[7]:

scores1 = np.random.randint(0, 100, 8).reshape(4, 2)
scores2 = np.random.randint(0, 100, 9).reshape(3, 3)

part1 = pd.DataFrame(scores1, index = names[:4], columns = ["problem_1", "problem_2"])
part2 = pd.DataFrame(scores2, index = names[:3], columns = ["problem_3", "problem_4", "problem_5"])

print("\npart 1:")
display(part1)
print("\npart 2:")
display(part2)


part 1:

	problem_1	problem_2
Ava	10	5
Benjamin	63	77
Charlotte	18	24
Daniel	29	77


part 2:

	problem_3	problem_4	problem_5
Ava	90	57	30
Benjamin	4	90	7
Charlotte	67	16	34

We can combine these DataFrames as follows:

[8]:

pd.concat([part1, part2], axis=1)

[8]:

	problem_1	problem_2	problem_3	problem_4	problem_5
Ava	10	5	90.0	57.0	30.0
Benjamin	63	77	4.0	90.0	7.0
Charlotte	18	24	67.0	16.0	34.0
Daniel	29	77	NaN	NaN	NaN

pd.merge¶

Pandas merge() function is another tool for combining two DataFrames. To see how it works, lets assume that we have two DataFrames, one of which gives names of instructors teaching math courses, and the second which lists names and office numbers of instructors:

[25]:

names = ["Ava", "Benjamin", "Charlotte", "Daniel", "Emma", "Fredric"]
courses = ["MTH 141", "MTH 142", "MTH 241", "MTH 306", "MTH 309", "MTH 311"]
rooms = ["NSC 216", "Capen 110", "Park 440"]

courses = pd.DataFrame({"course" : courses,
                            "instructor": np.random.choice(names[1:], len(courses))})

instructors = pd.DataFrame({"name" : names[:-2],
                          "office" : np.random.randint(100,150, len(names[:-2]))},
                          dtype="object")


print("\ncourses:")
display(courses)
print("\ninstructors:")
display(instructors)


courses:

	course	instructor
0	MTH 141	Benjamin
1	MTH 142	Benjamin
2	MTH 241	Daniel
3	MTH 306	Fredric
4	MTH 309	Charlotte
5	MTH 311	Emma


instructors:

	name	office
0	Ava	146
1	Benjamin	124
2	Charlotte	140
3	Daniel	133

Notice that not all instructors listed in the first table appear in the second table, and vice versa.

Inner join¶

Lets say that we want to create a DataFrame with course name, instructor name, and instructor office number. Using the merge() function this can be done as follows:

[26]:

inner_merge = pd.merge(courses, instructors,
                       how="inner",
                       left_on="instructor",
                       right_on="name")
inner_merge

[26]:

	course	instructor	name	office
0	MTH 141	Benjamin	Benjamin	124
1	MTH 142	Benjamin	Benjamin	124
2	MTH 241	Daniel	Daniel	133
3	MTH 309	Charlotte	Charlotte	140

The first two arguments are DataFrames being merged. We will call the first of them the left DataFrame, and the second the right DataFrame.
The arguments left_on="instructor", right_on="name" mean that we are combining rows by matching values of the “instructor” column of the left DataFrame and the “name” column of the right DataFrame.
The argument how="inner" indicates that we are computing an inner join of the DataFrames: rows will be combined only if their values in the “instructor” and “name” columns can be matched.

The merged DataFrame has “instructor” and “name” columns which are identical. We can drop one of them:

[27]:

inner_merge.drop("instructor",  # name of the column to drop
                 axis=1)        # indicates that we are dropping a column, not a row

[27]:

	course	name	office
0	MTH 141	Benjamin	124
1	MTH 142	Benjamin	124
2	MTH 241	Daniel	133
3	MTH 309	Charlotte	140

The resulting table lists only these instructors, whose names appear both in the courses DataFrame and in the instructors DataFrame.

Left join¶

Changing the value of the how argument changes the way DataFrames are merged. Setting how="left" specifies that we want to get the left join: every row of the left DataFrame will be combined either with a row of the right DataFrame, if it can be matched, or with NaN values if a match does not exist:

[28]:

left_merge = pd.merge(courses, instructors,
                      how="left",
                      left_on="instructor",
                      right_on="name")
left_merge

[28]:

	course	instructor	name	office
0	MTH 141	Benjamin	Benjamin	124
1	MTH 142	Benjamin	Benjamin	124
2	MTH 241	Daniel	Daniel	133
3	MTH 306	Fredric	NaN	NaN
4	MTH 309	Charlotte	Charlotte	140
5	MTH 311	Emma	NaN	NaN

As a result we obtain a table where all courses and their instructors are listed, and NaN values indicate for which instructors there is no information about their office numbers.

Right join¶

The right join (how="right") is symmetrical to the left join: every row of the right DataFrame will be matched with either a row of the left DataFrame, if a match exists, or with NaN values otherwise:

[29]:

right_merge = pd.merge(courses, instructors,
                       how="right",
                       left_on="instructor",
                       right_on="name")
right_merge

[29]:

	course	instructor	name	office
0	MTH 141	Benjamin	Benjamin	124
1	MTH 142	Benjamin	Benjamin	124
2	MTH 241	Daniel	Daniel	133
3	MTH 309	Charlotte	Charlotte	140
4	NaN	NaN	Ava	146

This time we obtain a DataFrame with names of all instructors which have an office number listed, with NaN values indicating instructors who do not teach any course.

Outer join¶

The last option is the outer join (how="outer"). In this case, rows will be matched when a match exists. If a row of either DataFrame does not have a match, it will be matched with NaN values:

[30]:

outer_merge = pd.merge(courses, instructors,
                       how="outer",
                       left_on="instructor",
                       right_on="name")
outer_merge

[30]:

	course	instructor	name	office
0	MTH 141	Benjamin	Benjamin	124
1	MTH 142	Benjamin	Benjamin	124
2	MTH 241	Daniel	Daniel	133
3	MTH 306	Fredric	NaN	NaN
4	MTH 309	Charlotte	Charlotte	140
5	MTH 311	Emma	NaN	NaN
6	NaN	NaN	Ava	146

In the table above the “instructor” column has NaN values for instructors who do not teach, and the “name” column has such values for instructors which do not have assigned offices. We can create a Series that replaces NaN values in the first column by the corresponding values of the second column as follows:

[43]:

# if a value in the "instructor" column is NaN
# replace it by the corresponding value of the "name" column
outer_merge["instructor"].fillna(outer_merge["name"])

[43]:

0     Benjamin
1     Benjamin
2       Daniel
3      Fredric
4    Charlotte
5         Emma
6          Ava
Name: instructor, dtype: object

We can use this to eliminate NaN values in the “instructor” column:

[44]:

outer_merge["instructor"] = outer_merge["instructor"].combine_first(outer_merge["name"])
outer_merge

[44]:

	course	instructor	name	office
0	MTH 141	Benjamin	Benjamin	124
1	MTH 142	Benjamin	Benjamin	124
2	MTH 241	Daniel	Daniel	133
3	MTH 306	Fredric	NaN	NaN
4	MTH 309	Charlotte	Charlotte	140
5	MTH 311	Emma	NaN	NaN
6	NaN	Ava	Ava	146

The “name” column is no longer needed, so we can drop it:

[45]:

outer_merge.drop("name", axis=1)

[45]:

	course	instructor	office
0	MTH 141	Benjamin	124
1	MTH 142	Benjamin	124
2	MTH 241	Daniel	133
3	MTH 306	Fredric	NaN
4	MTH 309	Charlotte	140
5	MTH 311	Emma	NaN
6	NaN	Ava	146

The remaining NaN values indicate that an instructor either does not teach a course or does not have an assigned office.