Adventures in Machine Learning

Mastering Fuzzy Matching and DataFrame Creation in Pandas

Fuzzy Matching in Pandas

Data manipulation and analysis using Pandas is a widespread practice in data science. However, working with large datasets can be particularly challenging, especially when we have issues with mismatching or imperfectly matching strings; that’s where fuzzy matching comes in.

Fuzzy matching is a technique that is useful when we need to match strings that might have a few differences. Instead of having exact string matches, fuzzy matching algorithms allow searching and ranking elements that have similar strings.

In pandas, we can use the get_close_matches() function from the difflib package to achieve this. Example of

Fuzzy Matching in Pandas

Let’s say we have two data frames containing basketball teams; however, the team names are slightly different in each data frame, and we would like to merge them.

Below is an example of how we can use fuzzy matching techniques to achieve this:

import pandas as pd

from difflib import get_close_matches

df1 = pd.DataFrame({“team”: [“New York Knicks”, “Los Angeles Lakers”, “Brooklyn Nets”]})

df2 = pd.DataFrame({“team”: [“LA Lakers”, “New York Kncks”]})

#Creating an empty list to hold the matched team names

match_list = []

#Performing fuzzy matching

for team_name in df1[‘team’]:

match = get_close_matches(team_name, df2[‘team’], n=1, cutoff=0.6)

if len(match) > 0:

match_list.append(match[0])

else:

match_list.append(“No Match”)

#Creating a new column in df1 containing the matched team names

df1[‘team_match’] = match_list

Result:

team team_match

0 New York Knicks New York Kncks

1 Los Angeles Lakers No Match

2 Brooklyn Nets No Match

As we can see from the result above, using fuzzy matching algorithms, we were able to match the team names even when there was a slight difference in spelling between them.

DataFrame Creation and Viewing

DataFrame creation is the first step in data analysis. In Pandas, we can create a DataFrame by using a dictionary, list, or Numpy ndarray.

The DataFrame can be viewed by using functions such as .head(), .tail(), .sample(), and .info(). Example:

import pandas as pd

import numpy as np

#Creating a DataFrame using a dictionary

data = {‘name’: [‘John’, ‘Jane’, ‘Doe’, ‘Bob’],

‘age’: [24, 31, 25, 19],

‘salary’: [56000, 72000, 90000, 47000]}

df = pd.DataFrame(data)

#Viewing the first 5 rows of the DataFrame

print(df.head())

Result:

name age salary

0 John 24 56000

1 Jane 31 72000

2 Doe 25 90000

3 Bob 19 47000

As we can see from the result above, we were able to create a DataFrame using a dictionary and then view the first five rows of the DataFrame using the .head() function.

Conclusion

In this article, we have discussed two fundamental topics in Pandas; Fuzzy matching and DataFrame creation/viewing. We have seen how to use the get_close_matches() function from the difflib package to perform fuzzy matching on strings, and we have also seen how to create and view DataFrames in Pandas.

These are essential skills in data analysis, and mastering them can go a long way in making your data manipulation and analysis easier and more efficient. Data merging is one of the critical tasks in data analysis because it allows us to combine multiple datasets into one, making it easier to analyze a large dataset.

In Pandas, there are several ways to merge DataFrames based on their columns, and we will be discussing two types in this article. 1.

Merging DataFrames based on column

Merging DataFrames based on columns is a straightforward process. We use the pd.merge() function to merge two DataFrames based on a specified column.

The syntax for merging DataFrames based on a column is as follows:

merged_df = pd.merge(df1, df2, on=’column_name’)

Example:

Let’s say we have two data frames containing basketball team data, and we want to merge them based on the team column. Below is an example of how this can be achieved:

import pandas as pd

df1 = pd.DataFrame({“team”: [“Lakers”, “Celtics”, “Warriors”], “city”: [“Los Angeles”, “Boston”, “San Francisco”]})

df2 = pd.DataFrame({“team”: [“Lakers”, “Bulls”, “Heat”], “points”: [80, 60, 70]})

#Merging the two data frames based on the team column

merged_df = pd.merge(df1, df2, on=’team’)

Result:

team city points

0 Lakers Los Angeles 80

1 Celtics Boston NaN

2 Warriors San Francisco NaN

As we can see from the result above, we have merged df1 and df2 based on the team column, and the resulting DataFrame contains the team’s city name and points. 2.

Using fuzzy matching to merge DataFrames

Sometimes, when merging DataFrames, the column’s values may not match exactly, making it challenging to merge based on the column. Fuzzy matching can be an effective solution in such cases.

Pandas includes a get_close_matches() function from the difflib module, which can be used to get the closest matching string. We can use this function to merge DataFrames that have slightly different string values.

Example:

Continuing with the basketball team scenario, let’s say we have two data frames containing basketball team information, but the team names are slightly different in each data frame. Below is an example of how this can be achieved:

import pandas as pd

from difflib import get_close_matches

df1 = pd.DataFrame({“team_name”: [“Los Angeles Lakers”, “Boston Celtics”, “Golden State Warriors”]})

df2 = pd.DataFrame({“team_name”: [“LA Lakers”, “Bulls”, “Miami Heat”], “points”: [80, 60, 70]})

#Creating an empty list to hold the matched team names

match_list = []

#Performing fuzzy matching

for team_name in df1[‘team_name’]:

match = get_close_matches(team_name, df2[‘team_name’], n=1, cutoff=0.6)

if len(match) > 0:

match_list.append(match[0])

else:

match_list.append(“No Match”)

#Creating a new column in df1 containing the matched team names

df1[‘team_match’] = match_list

#Merging the two data frames based on the matched team names

merged_df = pd.merge(df1, df2, left_on=’team_match’, right_on=’team_name’, how=’left’)

Result:

team_name team_match team_name points

0 Los Angeles Lakers LA Lakers LA Lakers 80.0

1 Boston Celtics No Match NaN NaN

2 Golden State Warriors No Match NaN NaN

As we can see from the result above, we have used fuzzy matching with the get_close_matches() function to merge the two data frames based on the team names.

Modifying get_close_matches() function to return closest match

The get_close_matches() function has several parameters that can be tuned to get the desired output. By default, it returns a list of the n closest matches in descending order of similarity.

However, it is possible to modify the function to return only the closest match. Example:

Let’s say we have a list of words, and we want to get the closest match for a given word.

Below is an example of how to modify the get_close_matches() function to return only the closest match:

from difflib import get_close_matches

def get_closest_match(word, possibilities):

match = get_close_matches(word, possibilities, n=1, cutoff=0.6)

if len(match) > 0:

return match[0]

else:

return “No Match”

#Creating a list of words

words = [“house”, “horse”, “mouse”, “dog”, “cat”, “rat”]

#Getting the closest match for the word “haus”

closest_match = get_closest_match(“haus”, words)

print(closest_match)

Result:

house

As we can see from the result above, we have modified the get_close_matches() function to return only the closest match, and it returns “house” instead of a list of closest matches. This modification can be useful when we only need the closest match instead of a list of closest matches.

Conclusion

In conclusion, merging DataFrames is a crucial task in data analysis. We have discussed two ways to merge DataFrames based on columns and using fuzzy matching techniques to merge DataFrames with imperfectly matching values.

We have also seen how to modify the get_close_matches() function to return only the closest match. These techniques are valuable for data analysts, especially when working with large datasets.

In summary, this article has covered two fundamental topics in Pandas: Fuzzy matching and DataFrame creation/viewing. We have seen how to use the get_close_matches() function to perform fuzzy matching on strings and how to create and view DataFrames in Pandas.

We have also discussed two types of merging data frames based on the columns and using fuzzy matching techniques. These are essential skills in data analysis and mastering them can go a long way in making data manipulation and analysis easier and more efficient.

Overall, understanding these techniques creates a strong foundation for any data analyst or scientist, making it easier to work with large datasets with imperfectly matching values.

Popular Posts