Pandas Merge Operation: What It Is and When to Use It

Pandas Merge Operation: What It Is and When to Use It

How to merge different datasets or DataFrames using the pandas merge() method.

·

11 min read

In this post, you'll learn how to merge different datasets or DataFrames using the pandas merge() operation. We'll take a closer look at the parameters of the merge() function, what they mean, and how to use them. We'll also understand how the merge() function differs from pandas concat() and join() functions.

What Is Merge in Pandas?

The pandas library has a method called merge() for combining DataFrames or named Series into a singular DataFrame for better data analysis. The pandas merge operation combines two or more DataFrame objects based on columns or indexes in a similar fashion as join operations performed on databases. The goal is to have a new dataset while the sources remain unchanged.

A pandas merge can be performed using the pandas merge() function or a DataFrame merge() method.

Here's the syntax for the pandas merge() function:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'))

The pd.merge() function parameters are explained as follows:

  • left: This indicates the DataFrame or named Series to be merged on the left.

  • right: This indicates the DataFrame or named Series to be merged on the right.

  • how: This defines the type of merge to be performed. The default is inner; others include outer, cross, left, and right.

  • on: This specifies the column or index to join on. Both DataFrames must each have the same column for a join to happen; otherwise an intersection of both columns is used.

  • left_on: This specifies in the left DataFrame the column or index to join on.

  • right_on: This specifies in the right DataFrame the column or index to join on.

  • left_index: This uses the index of the left DataFrame for merging.

  • right_index: This uses the index of the right DataFrame for merging.

  • suffixes: This outlines the suffixes to be used on overlapping column names in the left and right DataFrames.

Let's see how we can perform a pandas merge on two DataFrames.

Inner Join

One way of merging DataFrames with the merge() function is by using the inner join. The inner join always returns matching rows from DataFrames based on the column being used as the key for the merge. An inner join is done by default when you don't specify how the merge should be done using the how parameter.

Let's see an example:

import pandas as pd

alderman_data = {
    'ward': [1, 3, 4, 5, 6],
    'alderman': ['Vicky Mintoff', 'Petrina Finney', 'Kennith Gossop', 'Northrup Jaquet', 'Kelby Thaxton'],
    'address': ['7 Eggendart Pass', '2 Aberg Circle', '9 Barnett Way', '3 Anhalt Street', '86 Drewry Drive'],
    'phone': ['773-450-9926', '312-915-4064', '312-144-7339', '309-237-8875', '309-486-6591'],
    'state': ['Illinois', 'Illinois', 'Illinois', 'Illinois', 'Illinois']
}

alderman_df = pd.DataFrame(alderman_data)
population_data = { 
    'ward': [1, 2, 3, 5, 7], 
    'pop_2015': [25112, 27557, 27043, 26360, 27467], 
    'pop_2020': [36778, 43417, 54184, 37978, 55985], 
    'pop_change': [11666, 15860, 27141, 11618, 28518], 
    'city': ['Chicago', 'Peoria', 'Chicago', 'Chicago', 'Springfield'], 
    'state': ['Illinois', 'Illinois', 'Illinois', 'Illinois', 'Illinois'], 
    'zip': [60691, 61635, 60604, 60614, 62794] 
}


population_df = pd.DataFrame(population_data) 
print(alderman_df) 
print() 
print(population_df)

Here’s the output of the above code:

Alderman DataFrame:
ward    alderman    address    phone    state
0    1    Vicky Mintoff    7 Eggendart Pass    773-450-9926    Illinois
1    3    Petrina Finney    2 Aberg Circle    312-915-4064    Illinois
2    4    Kennith Gossop    9 Barnett Way    312-144-7339    Illinois
3    5    Northrup Jaquet    3 Anhalt Street    309-237-8875    Illinois
4    6    Kelby Thaxton    86 Drewry Drive    309-486-6591    Illinois

Ward Population DataFrame:
ward    pop_2015    pop_2020    pop_change    city    state    zip
0    1    25112    36778    11666    Chicago    Illinois    60691
1    2    27557    43417    15860    Peoria    Illinois    61635
2    3    27043    54184    27141    Chicago    Illinois    60604
3    5    26360    37978    11618    Chicago    Illinois    60614
4    7    27467    55985    28518    Springfield    Illinois    62794

Let's merge the alderman_df and population_df and see the result.

merged_df = pd.merge(alderman_df, population_df, on="ward") 

print(merged_df.head())

Here’s the output of the above code:

ward         alderman           address         phone   state_x  pop_2015   
0     1    Vicky Mintoff  7 Eggendart Pass  773-450-9926  Illinois     25112  \
1     3   Petrina Finney    2 Aberg Circle  312-915-4064  Illinois     27043   
2     5  Northrup Jaquet   3 Anhalt Street  309-237-8875  Illinois     26360   

   pop_2020  pop_change     city   state_y    zip  
0     36778       11666  Chicago  Illinois  60691  
1     54184       27141  Chicago  Illinois  60604  
2     37978       11618  Chicago  Illinois  60614

As can be seen from the above code, we performed a merge of two DataFrames—alderman_df and population_df—on the ward column that's present in both. By default, the new DataFrame has x and y suffixes appended to the state columns because they have similarly named columns in the source datasets. However, you can define proper suffixes for your new DataFrame for any overlapping columns. When performing an inner join, rows with matching values in both DataFrames would be returned. In this case, rows 1, 3, and 5 are returned because the ward columns in the original DataFrames being merged have matching values.

Left Join

Another way to merge DataFrames or named Series is by specifying a left join using the how parameter. The left join returns all rows from the left DataFrame and rows on the right DataFrame where the key column(s) match.

We'll still use the alderman_df and population_df for this example:

left_merge_df = pd.merge(alderman_df, population_df, on='ward', how='left') 

print(left_merge_df)

Here’s the output of the above code:

ward         alderman           address         phone   state_x  pop_2015   
0     1    Vicky Mintoff  7 Eggendart Pass  773-450-9926  Illinois   25112.0  \
1     3   Petrina Finney    2 Aberg Circle  312-915-4064  Illinois   27043.0   
2     4   Kennith Gossop     9 Barnett Way  312-144-7339  Illinois       NaN   
3     5  Northrup Jaquet   3 Anhalt Street  309-237-8875  Illinois   26360.0   
4     6    Kelby Thaxton   86 Drewry Drive  309-486-6591  Illinois       NaN   

   pop_2020  pop_change     city   state_y      zip  
0   36778.0     11666.0  Chicago  Illinois  60691.0  
1   54184.0     27141.0  Chicago  Illinois  60604.0  
2       NaN         NaN      NaN       NaN      NaN  
3   37978.0     11618.0  Chicago  Illinois  60614.0  
4       NaN         NaN      NaN       NaN      NaN

The result of the left join shows that all rows of data from the left DataFrame alderman_df are returned. However, missing values are indicated with NaN for population_df without matching column keys.

Right Join

Performing the right join on two DataFrames is just as easy as we did with the left join. Simply specify right as the value for the how parameter. All rows on the right DataFrame are returned and rows on the left DataFrame that match the key column(s) are returned as well.

right_merge_df = pd.merge(alderman_df, population_df, on='ward', how='right') 
print(right_merge_df)

Here's the output of the above code:

ward         alderman           address         phone   state_x  pop_2015   
0     1    Vicky Mintoff  7 Eggendart Pass  773-450-9926  Illinois     25112  \
1     2              NaN               NaN           NaN       NaN     27557   
2     3   Petrina Finney    2 Aberg Circle  312-915-4064  Illinois     27043   
3     5  Northrup Jaquet   3 Anhalt Street  309-237-8875  Illinois     26360   
4     7              NaN               NaN           NaN       NaN     27467   

   pop_2020  pop_change         city   state_y    zip  
0     36778       11666      Chicago  Illinois  60691  
1     43417       15860       Peoria  Illinois  61635  
2     54184       27141      Chicago  Illinois  60604  
3     37978       11618      Chicago  Illinois  60614  
4     55985       28518  Springfield  Illinois  62794

Outer Join

Besides the inner, left, and right joins, there's the outer join. The outer join returns all rows from both datasets whether there's a match in the key column or not. In essence, the outer join will return all rows from merged datasets, with non-matching rows having NaN to indicate missing values. We'll see the result of an outer join using the alderman_df and population_df DataFrames.

outer_merge_df = pd.merge(alderman_df, population_df, on='ward', how='outer') 

print(outer_merge_df)

Here's the output of the above code:

 ward         alderman           address         phone   state_x  pop_2015   
0     1    Vicky Mintoff  7 Eggendart Pass  773-450-9926  Illinois   25112.0  \
1     3   Petrina Finney    2 Aberg Circle  312-915-4064  Illinois   27043.0   
2     4   Kennith Gossop     9 Barnett Way  312-144-7339  Illinois       NaN   
3     5  Northrup Jaquet   3 Anhalt Street  309-237-8875  Illinois   26360.0   
4     6    Kelby Thaxton   86 Drewry Drive  309-486-6591  Illinois       NaN   
5     2              NaN               NaN           NaN       NaN   27557.0   
6     7              NaN               NaN           NaN       NaN   27467.0   

   pop_2020  pop_change         city   state_y      zip  
0   36778.0     11666.0      Chicago  Illinois  60691.0  
1   54184.0     27141.0      Chicago  Illinois  60604.0  
2       NaN         NaN          NaN       NaN      NaN  
3   37978.0     11618.0      Chicago  Illinois  60614.0  
4       NaN         NaN          NaN       NaN      NaN  
5   43417.0     15860.0       Peoria  Illinois  61635.0  
6   55985.0     28518.0  Springfield  Illinois  62794.0

How to Merge DataFrames on Indexes

Datasets can be merged based on their indexes by providing arguments for the left_on and right_on or left_index and right_index parameters. There are a few rules to observe when using any other aforementioned parameters:

  • You can only pass the argument left_on or left_index, not both.

  • You can only pass the argument right_on or right_index, not both.

  • You can only pass the argument on or left_on and right_on, not a combination of both.

  • You can only pass the argument on or left_index and right_index, not a combination of both.

With the left_on and right_on parameters, you pass arguments to indicate columns to be used for merging on the left and on the right respectively. For the left_index and right_index parameters, you set the arguments to True for both.

Let's see an example:

index_merge_df = pd.merge(alderman_df, population_df, left_index=True, right_index=True) 

print(index_merge_df)

Here's the output of the above code:

ward_x         alderman           address         phone   state_x  ward_y   
0       1    Vicky Mintoff  7 Eggendart Pass  773-450-9926  Illinois       1  \
1       3   Petrina Finney    2 Aberg Circle  312-915-4064  Illinois       2   
2       4   Kennith Gossop     9 Barnett Way  312-144-7339  Illinois       3   
3       5  Northrup Jaquet   3 Anhalt Street  309-237-8875  Illinois       5   
4       6    Kelby Thaxton   86 Drewry Drive  309-486-6591  Illinois       7   

   pop_2015  pop_2020  pop_change         city   state_y    zip  
0     25112     36778       11666      Chicago  Illinois  60691  
1     27557     43417       15860       Peoria  Illinois  61635  
2     27043     54184       27141      Chicago  Illinois  60604  
3     26360     37978       11618      Chicago  Illinois  60614  
4     27467     55985       28518  Springfield  Illinois  62794

From the output of our code, the merge() function uses the ward column of each dataset for the left and right index. All rows from both datasets were returned, and matching values weren't considered on the left or right index. Therefore, we have five rows in the new dataset.

Difference Between Merge and Concat

What's the difference between concat and merge? The pandas concat() function is used to combine DataFrames or named Series vertically or horizontally. By default, the concat() function joins DataFrames vertically. Alternatively, setting the axis parameter argument to 1 gives a horizontal merge. Unlike the merge() function, only outer or inner joins can be performed on DataFrames when concatenating.

Here's the syntax for the pandas concat() function:

pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)

Let's see an example:

alderman_data2 = {
    'ward': [9, 10, 11, 12, 13],
    'alderman': ['Vicky Griffin', 'Blake Finney', 'Greg Theodore', 'Jacqueline Bob', 'Shelby Boston'],
    'address': ['7 Haas Circle', '2 Mayer Lane', '9 Harper Drive', '3 Grover Avenue', '86 Palm Drive'],
    'phone': ['312-978-5864', '217-352-4548', '312-568-4492', '815-254-3682', '309-590-9629'],
    'state': ['Illinois', 'Illinois', 'Illinois', 'Illinois', 'Illinois']
}

alderman_df2 = pd.DataFrame(alderman_data2)

Here's the output of the above code:

ward         alderman           address         phone     state
0     1    Vicky Mintoff  7 Eggendart Pass  773-450-9926  Illinois
1     3   Petrina Finney    2 Aberg Circle  312-915-4064  Illinois
2     4   Kennith Gossop     9 Barnett Way  312-144-7339  Illinois
3     5  Northrup Jaquet   3 Anhalt Street  309-237-8875  Illinois
4     6    Kelby Thaxton   86 Drewry Drive  309-486-6591  Illinois
0     9    Vicky Griffin     7 Haas Circle  312-978-5864  Illinois
1    10     Blake Finney      2 Mayer Lane  217-352-4548  Illinois
2    11    Greg Theodore    9 Harper Drive  312-568-4492  Illinois
3    12   Jacqueline Bob   3 Grover Avenue  815-254-3682  Illinois
4    13    Shelby Boston     86 Palm Drive  309-590-9629  Illinois

Two DataFrame objects alderman_df and alderman_df2 are combined vertically using the concat() function. The two datasets combined still retain their original indexes; however, this can be overridden by setting the ignore_index parameter argument to True.

A common use case for the pandas concat() function is to stack similar data obtained at different points vertically. In the example above, a second part of the Alderman list was added to the previous one to elongate the list. It can also be used to give data a hierarchical index.

Difference Between Merge and Join

Besides the merge() and concat() functions, pandas provides the DataFrame join() method for combining different datasets into a new one. The join() method is similar to the merge() function in terms of parameters and operations. But the biggest difference between the two is that join combines DataFrames based on index while merge combines with both indexes or columns. By default, the join() method performs a left join while the merge does an inner join.

A join() method syntax looks like this:

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False, validate=None)

Here's an example:

joined_df = alderman_df.join(population_df, on='ward', lsuffix='_x', rsuffix='_y')

print(joined_df)

Here's the output of the above code:

ward_x         alderman           address         phone   state_x  ward_y   
0       1    Vicky Mintoff  7 Eggendart Pass  773-450-9926  Illinois     2.0  \
1       3   Petrina Finney    2 Aberg Circle  312-915-4064  Illinois     5.0   
2       4   Kennith Gossop     9 Barnett Way  312-144-7339  Illinois     7.0   
3       5  Northrup Jaquet   3 Anhalt Street  309-237-8875  Illinois     NaN   
4       6    Kelby Thaxton   86 Drewry Drive  309-486-6591  Illinois     NaN   

   pop_2015  pop_2020  pop_change         city   state_y      zip  
0   27557.0   43417.0     15860.0       Peoria  Illinois  61635.0  
1   26360.0   37978.0     11618.0      Chicago  Illinois  60614.0  
2   27467.0   55985.0     28518.0  Springfield  Illinois  62794.0  
3       NaN       NaN         NaN          NaN       NaN      NaN  
4       NaN       NaN         NaN          NaN       NaN      NaN

Notice an inner join is performed because all rows of data from the alderman_df are returned, while rows with missing data are indicated with NaN where the index doesn't match with the right DataFrame population_df.

Conclusion

In summary, the pandas merge() operation is the most flexible means of combining datasets. It gives you the option of joining DataFrames based on indexes or columns. The function also performs different types of joins: inner, outer, left, right, and cross. It's often the most-used method or function for combining datasets in pandas. What's the alternative to merge in pandas? In general, the concat() and join() methods are alternatives for pandas merge.

The concat() function is ideal when stacking a series of DataFrames vertically. Furthermore, it's used to join data tables horizontally and limited to inner to outer joins.

The join() method is straightforward and often used when you need to combine datasets along the indexes. It performs a left join by default as well as outer, inner, and right.

Depending on your need, you can pick any of these methods to combine your datasets before making an analysis.

Did you find this article valuable?

Support Ginjar Codes by becoming a sponsor. Any amount is appreciated!