Adventures in Machine Learning

Mastering Data Analysis with Pandas Merge Operations

Joining Pandas DataFrames using Merge

If you have ever worked with data and tables, you understand the need to combine tables to get more insights from your data. It is often not enough to have data in a single table; sometimes, you need to combine information from different tables to have a meaningful analysis.

In Pandas, joiningDataFrames is a common task used to merge tables with different types of relationships.

This article will help you understand the different ways to join data frames using Pandas merge from an inner join to an outer join.

We will explore each join to help you choose the best option based on your use case, as well as how to combine the different types of dataframe using Pandas.

Creating DataFrames to be Joined

Before we can explore data frame join, we need first to understand how to create DataFrames to be joined. Pandas allows creating data frames in different ways, including reading data from a file, dictionary, list, or a database.

Let’s create two data frames, which we will use for our example:

data1 = {’employee_id’: [‘1’, ‘2’, ‘3’, ‘4’, ‘5’],

‘name’: [‘John’, ‘Mary’, ‘Jessica’, ‘David’, ‘Joseph’],

‘location’: [‘London’, ‘Paris’, ‘New York’, ‘Melbourne’, ‘Beijing’],

‘department’: [‘HR’, ‘Marketing’, ‘IT’, ‘Finance’, ‘Operations’]}

data2 = {’employee_id’: [‘2’, ‘4’, ‘6’, ‘8’, ’10’],

‘salary’: [‘3000’, ‘5000’, ‘6000’, ‘7000’, ‘4000’],

‘start_date’: [‘2012-04-01’, ‘2018-05-07’, ‘2020-01-01’, ‘2017-05-15’, ‘2015-02-10’]}

# Create DataFrames using the data above

df1 = pd.DataFrame(data1)

df2 = pd.DataFrame(data2)

The first DataFrame, df1, contains employee details such as employee_id, name, location, and department. The second DataFrame, df2, contains salary and start_date details.

Notice that the employee_id column exists in both DataFrames; this column will be used to merge both tables.

Merging DataFrames Using Inner Join

The most common join operation is the inner join in Pandas, which merges two data frames based on a common column. An inner join returns only the rows where the column values match in both DataFrames.

In our example, we can merge the two DataFrames using inner join using the merge function as follows:

merged_inner = pd.merge(df1, df2, on=’employee_id’, how=’inner’)

print(merged_inner)

The output will be:

employee_id name location department salary start_date

1 2 Mary Paris Marketing 3000 2012-04-01

3 4 David Melbourne Finance 5000 2018-05-07

The merged inner Dataframe contains only the rows where the employee_id exists in both DataFrames.

Merging DataFrames Using Left Join

A left join returns all the records from the left table (df1) and matching records from the right table (df2). In other words, the resulting DataFrame will have all the records from the first DataFrame and only matching records from the second DataFrame.

We can merge the two DataFrames using left join as follows:

merged_left = pd.merge(df1, df2, on=’employee_id’, how=’left’)

print(merged_left)

The output will be:

employee_id name location department salary start_date

0 1 John London HR NaN NaN

1 2 Mary Paris Marketing 3000 2012-04-01

2 3 Jessica New York IT NaN NaN

3 4 David Melbourne Finance 5000 2018-05-07

4 5 Joseph Beijing Operations NaN NaN

Notice that all the employee_id from df1 is displayed with NaN displayed in salary and start_date columns. This shows that the employee_id exists in the left table and do not have a corresponding match in the right table.

Merging DataFrames Using Right Join

A right join is similar to a left join, but it returns all the records from the right table (df2), and only matching records from the left table (df1). In other words, the resulting DataFrame will have all the records from the second DataFrame and only matching records from the first DataFrame.

We can merge the two DataFrames using right join as follows:

merged_right = pd.merge(df1, df2, on=’employee_id’, how=’right’)

print(merged_right)

The output will be:

employee_id name location department salary start_date

0 2 Mary Paris Marketing 3000 2012-04-01

1 4 David Melbourne Finance 5000 2018-05-07

2 6 NaN NaN NaN 6000 2020-01-01

3 8 NaN NaN NaN 7000 2017-05-15

4 10 NaN NaN NaN 4000 2015-02-10

Notice that employee_id 2 and 4 are included as there is a match between the two DataFrames with the rest of the rows in the merged Dataframe as null.

Merging DataFrames Using Outer Join

An outer join (also known as a full outer join) returns all the records from both data frames. It returns a DataFrame containing values from both data frames in case of a match and NaN in case of no match.

We can merge the two DataFrames using outer join as follows:

merged_outer = pd.merge(df1, df2, on=’employee_id’, how=’outer’)

print(merged_outer)

The output will be:

employee_id name location department salary start_date

0 1 John London HR NaN NaN

1 2 Mary Paris Marketing 3000 2012-04-01

2 3 Jessica New York IT NaN NaN

3 4 David Melbourne Finance 5000 2018-05-07

4 5 Joseph Beijing Operations NaN NaN

5 6 NaN NaN NaN 6000 2020-01-01

6 8 NaN NaN NaN 7000 2017-05-15

7 10 NaN NaN NaN 4000 2015-02-10

This output shows all the employee_id from both DataFrames, with non-matching columns set to NaN.

Conclusion

In conclusion, Pandas merge function is an essential tool in data manipulation. It helps a data analyst to merge data frames and eliminate redundancy, making it easier to analyze data.

In this article, we explained the different join types from inner join, left join, right join, and outer join. With this knowledge, you can select the best join type based on your use case.

To sum up, this article discussed the process of joining Pandas DataFrames using Merge. We started by creating two DataFrames, and explored different types of DataFrame joins such as inner join, left join, right join, and outer join to help you choose the best option for your use case.

Merging DataFrames using Pandas is an essential and powerful tool for data analysts and scientists. It can help discover patterns and insights that might go unnoticed, and improve business decision-making processes.

Understanding how to combine Pandas DataFrames using a merge operation is critical for data manipulation and analysis.

Popular Posts