Adventures in Machine Learning

Mastering Inner Join in Pandas: A Comprehensive Guide

Hey there, welcome to this article on performing inner join in Pandas. In this article, we will be discussing how to perform inner joins using the Pandas library and provide examples to help you understand the concepts better.

Syntax for Inner Join in Pandas

The syntax for performing inner joins in Pandas involves using the merge() function in Pandas. This function takes in two data frames as inputs.

The syntax for inner join is as follows:

pd.merge(df1, df2, on='common_column', how='inner')

Here, df1 and df2 are the two data frames that we want to merge. ‘common_column’ is the column that these two data frames have in common and on which we want to perform the join.

‘how’ parameter is set to ‘inner’ to perform an inner join.

Example of Inner Join in Pandas

Let’s consider an example to better understand how to perform inner joins. We will use two data frames that contain information about basketball teams.

Structure of Sample DataFrames

Let’s take a look at the information contained within the two data frames we will be using.

DataFrame 1:

  • basketball teams
  • team name
  • points

DataFrame 2:

  • basketball teams
  • team name
  • assists

In this example, we want to merge these two data frames based on the ‘basketball teams’ column.

Now, we are ready to write the code that will perform the inner join:

import pandas as pd

# Create DataFrame 1
df1 = pd.DataFrame({
        'basketball teams': ['LA Lakers', 'Golden State Warriors', 'LA Clippers', 'Houston Rockets'],
        'team name': ['Lakers', 'Warriors', 'Clippers', 'Rockets'],
        'points': [102, 98, 101, 105]})

# Create DataFrame 2
df2 = pd.DataFrame({
        'basketball teams': ['Golden State Warriors', 'LA Lakers', 'LA Clippers', 'Houston Rockets'],
        'team name': ['Warriors', 'Lakers', 'Clippers', 'Rockets'],
        'assists': [23, 18, 25, 21]})

# Merge the two data frames based on 'basketball teams'
result = pd.merge(df1, df2, on='basketball teams', how='inner')
print(result)

The output of this code will produce a new dataframe that combines the data from both DataFrame 1 and DataFrame 2, keeping only rows where there are corresponding entries in both data frames:

   basketball teams team name_x  points team name_y  assists
0         LA Lakers     Lakers     102      Lakers       18
1  Golden State Warriors   Warriors     98    Warriors      23
2        LA Clippers   Clippers    101    Clippers      25
3      Houston Rockets    Rockets    105     Rockets      21

As expected, the output consists of 4 rows, each representing information about the basketball teams. The column basketball teams shows the name of each team, followed by team name_x which is the name of the team in DataFrame 1, and team name_y which is the name of the team in DataFrame 2.

The respective points and assists columns show how many points and assists each team has.

Conclusion

In summary, you have learned how to perform inner joins using Pandas and merge data from two data frames, keeping only the rows where there are corresponding entries in both data frames. This can come in handy whenever you have data spread across different sources and want to combine it with ease.

Keep in mind that outer joins and left/right joins come in handy for combining data within complex scenarios.

3) Performing Inner Join on Sample DataFrames

We have already seen the syntax and example of performing an inner join on two sample data frames. Now let’s apply inner join on DataFrames 1 and 2 and see the result.

Applying Inner Join on DataFrames 1 and 2

import pandas as pd

# creating DataFrame 1
df1 = pd.DataFrame({"basketball teams": ["LA Lakers", "Golden State Warriors", "LA Clippers", "Houston Rockets"],
                   "team name": ["Lakers", "Warriors", "Clippers", "Rockets"],
                   "points": [102, 98, 101, 105]})

# creating DataFrame 2
df2 = pd.DataFrame({"basketball teams": ["Golden State Warriors", "LA Lakers", "LA Clippers", "Houston Rockets"],
                   "team name": ["Warriors", "Lakers", "Clippers", "Rockets"],
                   "assists": [23, 18, 25, 21]})

# applying inner join on DataFrames 1 and 2
result = pd.merge(df1, df2, on='basketball teams', how='inner')
print(result)

Result of Inner Join on DataFrames 1 and 2

The output of the code will produce a merged DataFrame that has combined the data based on the common column ‘basketball teams’ and kept only those rows that have a match in both the data frames.

        basketball teams  team name_x  points team name_y  assists
    0         LA Lakers       Lakers     102      Lakers       18
    1  Golden State Warriors   Warriors     98    Warriors      23
    2        LA Clippers     Clippers    101    Clippers      25
    3      Houston Rockets    Rockets    105     Rockets      21

We can observe that both data frames have a common column, ‘basketball teams,’ and thus are merged based on that column.

Duplicated columns, such as ‘team name,’ result in suffixes _x and _y, respectively. The result of the inner join shows only the rows where ‘basketball teams’ had a match in both data frames.

4) Using Merge Function to Perform Inner Join

In the previous example, we have seen the use of the merge function to perform an inner join on two data frames. Let’s understand the syntax of the pd.merge() function and use the same to perform the inner join.

Syntax for pd.merge() function

When working with Pandas, the pd.merge() function is used to combine two or more data frames into one, including the inner join operation. The syntax for the pd.merge() function is as follows:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

The parameters of the pd.merge() function provide various options to customize the merge operation.

The ‘left’ and ‘right’ parameters are the data frames that we want to merge. The ‘on’ parameter is the key column that we will use to join both data frames.

The above syntax can be further modified to incorporate an inner join. The ‘how’ parameter specifies the type of join operation to be performed, and we can pass the value of ‘inner’ to this parameter to apply inner join operation.

Let’s see an example of how to perform an inner join using the pd.merge() function. Applying Inner Join using pd.merge() function

import pandas as pd

# create data frames
df1 = pd.DataFrame({'basketball teams': ['LA Lakers', 'Golden State Warriors', 'LA Clippers', 'Houston Rockets'],
                    'team name': ['Lakers', 'Warriors', 'Clippers', 'Rockets'],
                    'points': [102, 98, 101, 105]})
df2 = pd.DataFrame({'basketball teams': ['Golden State Warriors', 'LA Lakers', 'LA Clippers', 'Houston Rockets'],
                    'team name': ['Warriors', 'Lakers', 'Clippers', 'Rockets'],
                    'assists': [23, 18, 25, 21]})

# Merge the two data frames based on the 'basketball teams' column
merged_df = pd.merge(df1, df2, on='basketball teams', how='inner')
print(merged_df)

Result of Inner Join using pd.merge() function

The output of the code will produce a merged DataFrame that has combined the data based on the common column ‘basketball teams’ and kept only those rows that have a match in both the data frames.

        basketball teams  team name_x  points team name_y  assists
    0         LA Lakers       Lakers     102      Lakers       18
    1  Golden State Warriors   Warriors     98    Warriors      23
    2        LA Clippers     Clippers    101    Clippers      25
    3      Houston Rockets    Rockets    105     Rockets      21

The result of the inner join shows only the rows where ‘basketball teams’ had a match in both data frames. In both examples, we have successfully applied inner join operation using two different approaches.

The pd.merge() function provides more flexibility when performing joins than the previous approach, as we can customize the join operation with more options provided by the function.

Conclusion

In this article, we discussed how to perform inner joins on two data frames using Pandas. We highlighted the syntax and provided examples for using merge() function and pd.merge() function to perform inner join.

The inner join operation allows us to combine and extract only the matching data from both data frames. This is useful when we need to work with data from multiple sources and merge it together to achieve better insights.

In this article, we learned about performing inner joins on two data frames using the Pandas library in Python. We saw two methods for performing inner joins: using the merge() function and the pd.merge() function.

Inner joins help us extract only the matching data from multiple sources that both share a common column. These types of joins are crucial in data analysis when we want to combine data from different sources.

The syntax and examples presented in this article provide a strong foundation for performing inner joins in Pandas. Remember to customize the join operation according to your specific needs and keep the merge_df object as the resulting merged data frame.

Popular Posts