Combining DataFrames

It often happens that data we are working with is split into several DataFrames or Series. Pandas provides tools for combining such data in various ways. We show how to use them below.

pd.concat

Lets say that we have two DataFrames with exam scores in some college course. Each DataFrame contains scores of students enrolled in a different section of the course:

[1]:
import pandas as pd
import numpy as np

names = ["Ava", "Benjamin", "Charlotte", "Daniel",
         "Emma", "Fredric", "Gianna", ]

scores1 = np.random.randint(0, 100, 12).reshape(4, 3)
scores2 = np.random.randint(0, 100, 9).reshape(3, 3)

columns  = ["problem_1", "problem_2", "problem_3"]
sec1 = pd.DataFrame(scores1, index = names[:4], columns = columns)
sec2 = pd.DataFrame(scores2, index = names[4:7], columns = columns)

print("\nSection 1:")
display(sec1)
print("\nSection 2:")
display(sec2)

Section 1:
problem_1 problem_2 problem_3
Ava 2 42 34
Benjamin 28 4 95
Charlotte 42 21 66
Daniel 44 41 59

Section 2:
problem_1 problem_2 problem_3
Emma 10 26 96
Fredric 42 85 31
Gianna 29 57 84

To combine these two DataFrames into one, we can use pandas concat() function with the list of DataFrames as the argument:

[2]:
pd.concat([sec1, sec2])
[2]:
problem_1 problem_2 problem_3
Ava 2 42 34
Benjamin 28 4 95
Charlotte 42 21 66
Daniel 44 41 59
Emma 10 26 96
Fredric 42 85 31
Gianna 29 57 84

In the combined DataFrame it can be useful to include labels indicating which rows came from which source DataFrame. We can do it by passing a list of labels to the concat() function. The resulting DataFrame will be equipped with a Multiindex containing the labels:

[3]:
pd.concat([sec1, sec2],                     # list of DataFrames
          keys=["section_1", "section_2"])  # list of labels
[3]:
problem_1 problem_2 problem_3
section_1 Ava 2 42 34
Benjamin 28 4 95
Charlotte 42 21 66
Daniel 44 41 59
section_2 Emma 10 26 96
Fredric 42 85 31
Gianna 29 57 84

So far we combined DataFrames which had the same columns, but this is not required for concatenation. Assume for example, that the last exam problem was different in each course section, and the score tables look as follows:

[4]:
sec1.rename({"problem_3" : "problem_3a"}, axis=1, inplace=True)
sec2.rename({"problem_3" : "problem_3b"}, axis=1, inplace=True)

print("\nSection 1:")
display(sec1)
print("\nSection 2:")
display(sec2)

Section 1:
problem_1 problem_2 problem_3a
Ava 2 42 34
Benjamin 28 4 95
Charlotte 42 21 66
Daniel 44 41 59

Section 2:
problem_1 problem_2 problem_3b
Emma 10 26 96
Fredric 42 85 31
Gianna 29 57 84

Concatenation can be still performed, but the resulting DataFrame will contain NaN values in the spots where there is no meaningful data:

[5]:
pd.concat([sec1, sec2], keys=["section_1", "section_2"])
[5]:
problem_1 problem_2 problem_3a problem_3b
section_1 Ava 2 42 34.0 NaN
Benjamin 28 4 95.0 NaN
Charlotte 42 21 66.0 NaN
Daniel 44 41 59.0 NaN
section_2 Emma 10 26 NaN 96.0
Fredric 42 85 NaN 31.0
Gianna 29 57 NaN 84.0

By default, the concatenated DataFrame contains all columns which appear in at least one of the source DataFrames. To change this, we can use concat() with the argument join="inner". This will produce a DataFrame containing only columns that exist in each source DataFrame:

[6]:
pd.concat([sec1, sec2], keys=["section_1", "section_2"], join="inner")
[6]:
problem_1 problem_2
section_1 Ava 2 42
Benjamin 28 4
Charlotte 42 21
Daniel 44 41
section_2 Emma 10 26
Fredric 42 85
Gianna 29 57

In all examples above we concatenated DataFrames along their 0-axes, i.e. along their rows. We can also use concat() with axis=1 argument to combine DataFrames along their columns. For instance, assume that an exam was given in two parts, and we have separate DataFrames with scores from part 1 and part 2:

[7]:
scores1 = np.random.randint(0, 100, 8).reshape(4, 2)
scores2 = np.random.randint(0, 100, 9).reshape(3, 3)

part1 = pd.DataFrame(scores1, index = names[:4], columns = ["problem_1", "problem_2"])
part2 = pd.DataFrame(scores2, index = names[:3], columns = ["problem_3", "problem_4", "problem_5"])

print("\npart 1:")
display(part1)
print("\npart 2:")
display(part2)

part 1:
problem_1 problem_2
Ava 10 5
Benjamin 63 77
Charlotte 18 24
Daniel 29 77

part 2:
problem_3 problem_4 problem_5
Ava 90 57 30
Benjamin 4 90 7
Charlotte 67 16 34

We can combine these DataFrames as follows:

[8]:
pd.concat([part1, part2], axis=1)
[8]:
problem_1 problem_2 problem_3 problem_4 problem_5
Ava 10 5 90.0 57.0 30.0
Benjamin 63 77 4.0 90.0 7.0
Charlotte 18 24 67.0 16.0 34.0
Daniel 29 77 NaN NaN NaN

pd.merge

Pandas merge() function is another tool for combining two DataFrames. To see how it works, lets assume that we have two DataFrames, one of which gives names of instructors teaching math courses, and the second which lists names and office numbers of instructors:

[25]:
names = ["Ava", "Benjamin", "Charlotte", "Daniel", "Emma", "Fredric"]
courses = ["MTH 141", "MTH 142", "MTH 241", "MTH 306", "MTH 309", "MTH 311"]
rooms = ["NSC 216", "Capen 110", "Park 440"]

courses = pd.DataFrame({"course" : courses,
                            "instructor": np.random.choice(names[1:], len(courses))})

instructors = pd.DataFrame({"name" : names[:-2],
                          "office" : np.random.randint(100,150, len(names[:-2]))},
                          dtype="object")


print("\ncourses:")
display(courses)
print("\ninstructors:")
display(instructors)

courses:
course instructor
0 MTH 141 Benjamin
1 MTH 142 Benjamin
2 MTH 241 Daniel
3 MTH 306 Fredric
4 MTH 309 Charlotte
5 MTH 311 Emma

instructors:
name office
0 Ava 146
1 Benjamin 124
2 Charlotte 140
3 Daniel 133

Notice that not all instructors listed in the first table appear in the second table, and vice versa.

Inner join

Lets say that we want to create a DataFrame with course name, instructor name, and instructor office number. Using the merge() function this can be done as follows:

[26]:
inner_merge = pd.merge(courses, instructors,
                       how="inner",
                       left_on="instructor",
                       right_on="name")
inner_merge
[26]:
course instructor name office
0 MTH 141 Benjamin Benjamin 124
1 MTH 142 Benjamin Benjamin 124
2 MTH 241 Daniel Daniel 133
3 MTH 309 Charlotte Charlotte 140
  • The first two arguments are DataFrames being merged. We will call the first of them the left DataFrame, and the second the right DataFrame.

  • The arguments left_on="instructor", right_on="name" mean that we are combining rows by matching values of the “instructor” column of the left DataFrame and the “name” column of the right DataFrame.

  • The argument how="inner" indicates that we are computing an inner join of the DataFrames: rows will be combined only if their values in the “instructor” and “name” columns can be matched.

The merged DataFrame has “instructor” and “name” columns which are identical. We can drop one of them:

[27]:
inner_merge.drop("instructor",  # name of the column to drop
                 axis=1)        # indicates that we are dropping a column, not a row
[27]:
course name office
0 MTH 141 Benjamin 124
1 MTH 142 Benjamin 124
2 MTH 241 Daniel 133
3 MTH 309 Charlotte 140

The resulting table lists only these instructors, whose names appear both in the courses DataFrame and in the instructors DataFrame.

Left join

Changing the value of the how argument changes the way DataFrames are merged. Setting how="left" specifies that we want to get the left join: every row of the left DataFrame will be combined either with a row of the right DataFrame, if it can be matched, or with NaN values if a match does not exist:

[28]:
left_merge = pd.merge(courses, instructors,
                      how="left",
                      left_on="instructor",
                      right_on="name")
left_merge
[28]:
course instructor name office
0 MTH 141 Benjamin Benjamin 124
1 MTH 142 Benjamin Benjamin 124
2 MTH 241 Daniel Daniel 133
3 MTH 306 Fredric NaN NaN
4 MTH 309 Charlotte Charlotte 140
5 MTH 311 Emma NaN NaN

As a result we obtain a table where all courses and their instructors are listed, and NaN values indicate for which instructors there is no information about their office numbers.

Right join

The right join (how="right") is symmetrical to the left join: every row of the right DataFrame will be matched with either a row of the left DataFrame, if a match exists, or with NaN values otherwise:

[29]:
right_merge = pd.merge(courses, instructors,
                       how="right",
                       left_on="instructor",
                       right_on="name")
right_merge
[29]:
course instructor name office
0 MTH 141 Benjamin Benjamin 124
1 MTH 142 Benjamin Benjamin 124
2 MTH 241 Daniel Daniel 133
3 MTH 309 Charlotte Charlotte 140
4 NaN NaN Ava 146

This time we obtain a DataFrame with names of all instructors which have an office number listed, with NaN values indicating instructors who do not teach any course.

Outer join

The last option is the outer join (how="outer"). In this case, rows will be matched when a match exists. If a row of either DataFrame does not have a match, it will be matched with NaN values:

[30]:
outer_merge = pd.merge(courses, instructors,
                       how="outer",
                       left_on="instructor",
                       right_on="name")
outer_merge
[30]:
course instructor name office
0 MTH 141 Benjamin Benjamin 124
1 MTH 142 Benjamin Benjamin 124
2 MTH 241 Daniel Daniel 133
3 MTH 306 Fredric NaN NaN
4 MTH 309 Charlotte Charlotte 140
5 MTH 311 Emma NaN NaN
6 NaN NaN Ava 146

In the table above the “instructor” column has NaN values for instructors who do not teach, and the “name” column has such values for instructors which do not have assigned offices. We can create a Series that replaces NaN values in the first column by the corresponding values of the second column as follows:

[43]:
# if a value in the "instructor" column is NaN
# replace it by the corresponding value of the "name" column
outer_merge["instructor"].fillna(outer_merge["name"])
[43]:
0     Benjamin
1     Benjamin
2       Daniel
3      Fredric
4    Charlotte
5         Emma
6          Ava
Name: instructor, dtype: object

We can use this to eliminate NaN values in the “instructor” column:

[44]:
outer_merge["instructor"] = outer_merge["instructor"].combine_first(outer_merge["name"])
outer_merge
[44]:
course instructor name office
0 MTH 141 Benjamin Benjamin 124
1 MTH 142 Benjamin Benjamin 124
2 MTH 241 Daniel Daniel 133
3 MTH 306 Fredric NaN NaN
4 MTH 309 Charlotte Charlotte 140
5 MTH 311 Emma NaN NaN
6 NaN Ava Ava 146

The “name” column is no longer needed, so we can drop it:

[45]:
outer_merge.drop("name", axis=1)
[45]:
course instructor office
0 MTH 141 Benjamin 124
1 MTH 142 Benjamin 124
2 MTH 241 Daniel 133
3 MTH 306 Fredric NaN
4 MTH 309 Charlotte 140
5 MTH 311 Emma NaN
6 NaN Ava 146

The remaining NaN values indicate that an instructor either does not teach a course or does not have an assigned office.