Pandas Merge Operation: What It Is and When to Use It
How to merge different datasets or DataFrames using the pandas merge() method.
In this post, you'll learn how to merge different datasets or DataFrames using the pandas merge() operation. We'll take a closer look at the parameters of the merge() function, what they mean, and how to use them. We'll also understand how the merge() function differs from pandas concat() and join() functions.
What Is Merge in Pandas?
The pandas library has a method called merge() for combining DataFrames or named Series into a singular DataFrame for better data analysis. The pandas merge operation combines two or more DataFrame objects based on columns or indexes in a similar fashion as join operations performed on databases. The goal is to have a new dataset while the sources remain unchanged.
A pandas merge can be performed using the pandas merge() function or a DataFrame merge() method.
Here's the syntax for the pandas merge() function:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'))
The pd.merge() function parameters are explained as follows:
left: This indicates the DataFrame or named Series to be merged on the left.
right: This indicates the DataFrame or named Series to be merged on the right.
how: This defines the type of merge to be performed. The default is inner; others include outer, cross, left, and right.
on: This specifies the column or index to join on. Both DataFrames must each have the same column for a join to happen; otherwise an intersection of both columns is used.
left_on: This specifies in the left DataFrame the column or index to join on.
right_on: This specifies in the right DataFrame the column or index to join on.
left_index: This uses the index of the left DataFrame for merging.
right_index: This uses the index of the right DataFrame for merging.
suffixes: This outlines the suffixes to be used on overlapping column names in the left and right DataFrames.
Let's see how we can perform a pandas merge on two DataFrames.
Inner Join
One way of merging DataFrames with the merge() function is by using the inner join. The inner join always returns matching rows from DataFrames based on the column being used as the key for the merge. An inner join is done by default when you don't specify how the merge should be done using the how parameter.
Let's see an example:
import pandas as pd
alderman_data = {
'ward': [1, 3, 4, 5, 6],
'alderman': ['Vicky Mintoff', 'Petrina Finney', 'Kennith Gossop', 'Northrup Jaquet', 'Kelby Thaxton'],
'address': ['7 Eggendart Pass', '2 Aberg Circle', '9 Barnett Way', '3 Anhalt Street', '86 Drewry Drive'],
'phone': ['773-450-9926', '312-915-4064', '312-144-7339', '309-237-8875', '309-486-6591'],
'state': ['Illinois', 'Illinois', 'Illinois', 'Illinois', 'Illinois']
}
alderman_df = pd.DataFrame(alderman_data)
population_data = {
'ward': [1, 2, 3, 5, 7],
'pop_2015': [25112, 27557, 27043, 26360, 27467],
'pop_2020': [36778, 43417, 54184, 37978, 55985],
'pop_change': [11666, 15860, 27141, 11618, 28518],
'city': ['Chicago', 'Peoria', 'Chicago', 'Chicago', 'Springfield'],
'state': ['Illinois', 'Illinois', 'Illinois', 'Illinois', 'Illinois'],
'zip': [60691, 61635, 60604, 60614, 62794]
}
population_df = pd.DataFrame(population_data)
print(alderman_df)
print()
print(population_df)
Here’s the output of the above code:
Alderman DataFrame:
ward alderman address phone state
0 1 Vicky Mintoff 7 Eggendart Pass 773-450-9926 Illinois
1 3 Petrina Finney 2 Aberg Circle 312-915-4064 Illinois
2 4 Kennith Gossop 9 Barnett Way 312-144-7339 Illinois
3 5 Northrup Jaquet 3 Anhalt Street 309-237-8875 Illinois
4 6 Kelby Thaxton 86 Drewry Drive 309-486-6591 Illinois
Ward Population DataFrame:
ward pop_2015 pop_2020 pop_change city state zip
0 1 25112 36778 11666 Chicago Illinois 60691
1 2 27557 43417 15860 Peoria Illinois 61635
2 3 27043 54184 27141 Chicago Illinois 60604
3 5 26360 37978 11618 Chicago Illinois 60614
4 7 27467 55985 28518 Springfield Illinois 62794
Let's merge the alderman_df and population_df and see the result.
merged_df = pd.merge(alderman_df, population_df, on="ward")
print(merged_df.head())
Here’s the output of the above code:
ward alderman address phone state_x pop_2015
0 1 Vicky Mintoff 7 Eggendart Pass 773-450-9926 Illinois 25112 \
1 3 Petrina Finney 2 Aberg Circle 312-915-4064 Illinois 27043
2 5 Northrup Jaquet 3 Anhalt Street 309-237-8875 Illinois 26360
pop_2020 pop_change city state_y zip
0 36778 11666 Chicago Illinois 60691
1 54184 27141 Chicago Illinois 60604
2 37978 11618 Chicago Illinois 60614
As can be seen from the above code, we performed a merge of two DataFrames—alderman_df and population_df—on the ward column that's present in both. By default, the new DataFrame has x and y suffixes appended to the state columns because they have similarly named columns in the source datasets. However, you can define proper suffixes for your new DataFrame for any overlapping columns. When performing an inner join, rows with matching values in both DataFrames would be returned. In this case, rows 1, 3, and 5 are returned because the ward columns in the original DataFrames being merged have matching values.
Left Join
Another way to merge DataFrames or named Series is by specifying a left join using the how parameter. The left join returns all rows from the left DataFrame and rows on the right DataFrame where the key column(s) match.
We'll still use the alderman_df and population_df for this example:
left_merge_df = pd.merge(alderman_df, population_df, on='ward', how='left')
print(left_merge_df)
Here’s the output of the above code:
ward alderman address phone state_x pop_2015
0 1 Vicky Mintoff 7 Eggendart Pass 773-450-9926 Illinois 25112.0 \
1 3 Petrina Finney 2 Aberg Circle 312-915-4064 Illinois 27043.0
2 4 Kennith Gossop 9 Barnett Way 312-144-7339 Illinois NaN
3 5 Northrup Jaquet 3 Anhalt Street 309-237-8875 Illinois 26360.0
4 6 Kelby Thaxton 86 Drewry Drive 309-486-6591 Illinois NaN
pop_2020 pop_change city state_y zip
0 36778.0 11666.0 Chicago Illinois 60691.0
1 54184.0 27141.0 Chicago Illinois 60604.0
2 NaN NaN NaN NaN NaN
3 37978.0 11618.0 Chicago Illinois 60614.0
4 NaN NaN NaN NaN NaN
The result of the left join shows that all rows of data from the left DataFrame alderman_df are returned. However, missing values are indicated with NaN for population_df without matching column keys.
Right Join
Performing the right join on two DataFrames is just as easy as we did with the left join. Simply specify right as the value for the how parameter. All rows on the right DataFrame are returned and rows on the left DataFrame that match the key column(s) are returned as well.
right_merge_df = pd.merge(alderman_df, population_df, on='ward', how='right')
print(right_merge_df)
Here's the output of the above code:
ward alderman address phone state_x pop_2015
0 1 Vicky Mintoff 7 Eggendart Pass 773-450-9926 Illinois 25112 \
1 2 NaN NaN NaN NaN 27557
2 3 Petrina Finney 2 Aberg Circle 312-915-4064 Illinois 27043
3 5 Northrup Jaquet 3 Anhalt Street 309-237-8875 Illinois 26360
4 7 NaN NaN NaN NaN 27467
pop_2020 pop_change city state_y zip
0 36778 11666 Chicago Illinois 60691
1 43417 15860 Peoria Illinois 61635
2 54184 27141 Chicago Illinois 60604
3 37978 11618 Chicago Illinois 60614
4 55985 28518 Springfield Illinois 62794
Outer Join
Besides the inner, left, and right joins, there's the outer join. The outer join returns all rows from both datasets whether there's a match in the key column or not. In essence, the outer join will return all rows from merged datasets, with non-matching rows having NaN to indicate missing values. We'll see the result of an outer join using the alderman_df and population_df DataFrames.
outer_merge_df = pd.merge(alderman_df, population_df, on='ward', how='outer')
print(outer_merge_df)
Here's the output of the above code:
ward alderman address phone state_x pop_2015
0 1 Vicky Mintoff 7 Eggendart Pass 773-450-9926 Illinois 25112.0 \
1 3 Petrina Finney 2 Aberg Circle 312-915-4064 Illinois 27043.0
2 4 Kennith Gossop 9 Barnett Way 312-144-7339 Illinois NaN
3 5 Northrup Jaquet 3 Anhalt Street 309-237-8875 Illinois 26360.0
4 6 Kelby Thaxton 86 Drewry Drive 309-486-6591 Illinois NaN
5 2 NaN NaN NaN NaN 27557.0
6 7 NaN NaN NaN NaN 27467.0
pop_2020 pop_change city state_y zip
0 36778.0 11666.0 Chicago Illinois 60691.0
1 54184.0 27141.0 Chicago Illinois 60604.0
2 NaN NaN NaN NaN NaN
3 37978.0 11618.0 Chicago Illinois 60614.0
4 NaN NaN NaN NaN NaN
5 43417.0 15860.0 Peoria Illinois 61635.0
6 55985.0 28518.0 Springfield Illinois 62794.0
How to Merge DataFrames on Indexes
Datasets can be merged based on their indexes by providing arguments for the left_on and right_on or left_index and right_index parameters. There are a few rules to observe when using any other aforementioned parameters:
You can only pass the argument left_on or left_index, not both.
You can only pass the argument right_on or right_index, not both.
You can only pass the argument on or left_on and right_on, not a combination of both.
You can only pass the argument on or left_index and right_index, not a combination of both.
With the left_on and right_on parameters, you pass arguments to indicate columns to be used for merging on the left and on the right respectively. For the left_index and right_index parameters, you set the arguments to True for both.
Let's see an example:
index_merge_df = pd.merge(alderman_df, population_df, left_index=True, right_index=True)
print(index_merge_df)
Here's the output of the above code:
ward_x alderman address phone state_x ward_y
0 1 Vicky Mintoff 7 Eggendart Pass 773-450-9926 Illinois 1 \
1 3 Petrina Finney 2 Aberg Circle 312-915-4064 Illinois 2
2 4 Kennith Gossop 9 Barnett Way 312-144-7339 Illinois 3
3 5 Northrup Jaquet 3 Anhalt Street 309-237-8875 Illinois 5
4 6 Kelby Thaxton 86 Drewry Drive 309-486-6591 Illinois 7
pop_2015 pop_2020 pop_change city state_y zip
0 25112 36778 11666 Chicago Illinois 60691
1 27557 43417 15860 Peoria Illinois 61635
2 27043 54184 27141 Chicago Illinois 60604
3 26360 37978 11618 Chicago Illinois 60614
4 27467 55985 28518 Springfield Illinois 62794
From the output of our code, the merge() function uses the ward column of each dataset for the left and right index. All rows from both datasets were returned, and matching values weren't considered on the left or right index. Therefore, we have five rows in the new dataset.
Difference Between Merge and Concat
What's the difference between concat and merge? The pandas concat() function is used to combine DataFrames or named Series vertically or horizontally. By default, the concat() function joins DataFrames vertically. Alternatively, setting the axis parameter argument to 1 gives a horizontal merge. Unlike the merge() function, only outer or inner joins can be performed on DataFrames when concatenating.
Here's the syntax for the pandas concat() function:
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)
Let's see an example:
alderman_data2 = {
'ward': [9, 10, 11, 12, 13],
'alderman': ['Vicky Griffin', 'Blake Finney', 'Greg Theodore', 'Jacqueline Bob', 'Shelby Boston'],
'address': ['7 Haas Circle', '2 Mayer Lane', '9 Harper Drive', '3 Grover Avenue', '86 Palm Drive'],
'phone': ['312-978-5864', '217-352-4548', '312-568-4492', '815-254-3682', '309-590-9629'],
'state': ['Illinois', 'Illinois', 'Illinois', 'Illinois', 'Illinois']
}
alderman_df2 = pd.DataFrame(alderman_data2)
Here's the output of the above code:
ward alderman address phone state
0 1 Vicky Mintoff 7 Eggendart Pass 773-450-9926 Illinois
1 3 Petrina Finney 2 Aberg Circle 312-915-4064 Illinois
2 4 Kennith Gossop 9 Barnett Way 312-144-7339 Illinois
3 5 Northrup Jaquet 3 Anhalt Street 309-237-8875 Illinois
4 6 Kelby Thaxton 86 Drewry Drive 309-486-6591 Illinois
0 9 Vicky Griffin 7 Haas Circle 312-978-5864 Illinois
1 10 Blake Finney 2 Mayer Lane 217-352-4548 Illinois
2 11 Greg Theodore 9 Harper Drive 312-568-4492 Illinois
3 12 Jacqueline Bob 3 Grover Avenue 815-254-3682 Illinois
4 13 Shelby Boston 86 Palm Drive 309-590-9629 Illinois
Two DataFrame objects alderman_df and alderman_df2 are combined vertically using the concat() function. The two datasets combined still retain their original indexes; however, this can be overridden by setting the ignore_index parameter argument to True.
A common use case for the pandas concat() function is to stack similar data obtained at different points vertically. In the example above, a second part of the Alderman list was added to the previous one to elongate the list. It can also be used to give data a hierarchical index.
Difference Between Merge and Join
Besides the merge() and concat() functions, pandas provides the DataFrame join() method for combining different datasets into a new one. The join() method is similar to the merge() function in terms of parameters and operations. But the biggest difference between the two is that join combines DataFrames based on index while merge combines with both indexes or columns. By default, the join() method performs a left join while the merge does an inner join.
A join() method syntax looks like this:
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False, validate=None)
Here's an example:
joined_df = alderman_df.join(population_df, on='ward', lsuffix='_x', rsuffix='_y')
print(joined_df)
Here's the output of the above code:
ward_x alderman address phone state_x ward_y
0 1 Vicky Mintoff 7 Eggendart Pass 773-450-9926 Illinois 2.0 \
1 3 Petrina Finney 2 Aberg Circle 312-915-4064 Illinois 5.0
2 4 Kennith Gossop 9 Barnett Way 312-144-7339 Illinois 7.0
3 5 Northrup Jaquet 3 Anhalt Street 309-237-8875 Illinois NaN
4 6 Kelby Thaxton 86 Drewry Drive 309-486-6591 Illinois NaN
pop_2015 pop_2020 pop_change city state_y zip
0 27557.0 43417.0 15860.0 Peoria Illinois 61635.0
1 26360.0 37978.0 11618.0 Chicago Illinois 60614.0
2 27467.0 55985.0 28518.0 Springfield Illinois 62794.0
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
Notice an inner join is performed because all rows of data from the alderman_df are returned, while rows with missing data are indicated with NaN where the index doesn't match with the right DataFrame population_df.
Conclusion
In summary, the pandas merge() operation is the most flexible means of combining datasets. It gives you the option of joining DataFrames based on indexes or columns. The function also performs different types of joins: inner, outer, left, right, and cross. It's often the most-used method or function for combining datasets in pandas. What's the alternative to merge in pandas? In general, the concat() and join() methods are alternatives for pandas merge.
The concat() function is ideal when stacking a series of DataFrames vertically. Furthermore, it's used to join data tables horizontally and limited to inner to outer joins.
The join() method is straightforward and often used when you need to combine datasets along the indexes. It performs a left join by default as well as outer, inner, and right.
Depending on your need, you can pick any of these methods to combine your datasets before making an analysis.