Fuzzy Matching in Pandas
Data manipulation and analysis using Pandas is a widespread practice in data science. However, working with large datasets can be particularly challenging, especially when we have issues with mismatching or imperfectly matching strings; that’s where fuzzy matching comes in.
Fuzzy matching is a technique that is useful when we need to match strings that might have a few differences. Instead of having exact string matches, fuzzy matching algorithms allow searching and ranking elements that have similar strings.
In pandas, we can use the get_close_matches()
function from the difflib
package to achieve this. Example of
Fuzzy Matching in Pandas
Let’s say we have two data frames containing basketball teams; however, the team names are slightly different in each data frame, and we would like to merge them.
Below is an example of how we can use fuzzy matching techniques to achieve this:
import pandas as pd
from difflib import get_close_matches
df1 = pd.DataFrame({"team": ["New York Knicks", "Los Angeles Lakers", "Brooklyn Nets"]})
df2 = pd.DataFrame({"team": ["LA Lakers", "New York Kncks"]})
#Creating an empty list to hold the matched team names
match_list = []
#Performing fuzzy matching
for team_name in df1['team']:
match = get_close_matches(team_name, df2['team'], n=1, cutoff=0.6)
if len(match) > 0:
match_list.append(match[0])
else:
match_list.append("No Match")
#Creating a new column in df1 containing the matched team names
df1['team_match'] = match_list
Result:
team | team_match |
---|---|
New York Knicks | New York Kncks |
Los Angeles Lakers | No Match |
Brooklyn Nets | No Match |
As we can see from the result above, using fuzzy matching algorithms, we were able to match the team names even when there was a slight difference in spelling between them.
DataFrame Creation and Viewing
DataFrame creation is the first step in data analysis. In Pandas, we can create a DataFrame by using a dictionary, list, or Numpy ndarray.
The DataFrame can be viewed by using functions such as .head()
, .tail()
, .sample()
, and .info()
. Example:
import pandas as pd
import numpy as np
#Creating a DataFrame using a dictionary
data = {'name': ['John', 'Jane', 'Doe', 'Bob'],
'age': [24, 31, 25, 19],
'salary': [56000, 72000, 90000, 47000]}
df = pd.DataFrame(data)
#Viewing the first 5 rows of the DataFrame
print(df.head())
Result:
name | age | salary |
---|---|---|
John | 24 | 56000 |
Jane | 31 | 72000 |
Doe | 25 | 90000 |
Bob | 19 | 47000 |
As we can see from the result above, we were able to create a DataFrame using a dictionary and then view the first five rows of the DataFrame using the .head()
function.
Conclusion
In this article, we have discussed two fundamental topics in Pandas; Fuzzy matching and DataFrame creation/viewing. We have seen how to use the get_close_matches()
function from the difflib
package to perform fuzzy matching on strings, and we have also seen how to create and view DataFrames in Pandas.
These are essential skills in data analysis, and mastering them can go a long way in making your data manipulation and analysis easier and more efficient. Data merging is one of the critical tasks in data analysis because it allows us to combine multiple datasets into one, making it easier to analyze a large dataset.
In Pandas, there are several ways to merge DataFrames based on their columns, and we will be discussing two types in this article. 1.
Merging DataFrames based on column
Merging DataFrames based on columns is a straightforward process. We use the pd.merge()
function to merge two DataFrames based on a specified column.
The syntax for merging DataFrames based on a column is as follows:
merged_df = pd.merge(df1, df2, on='column_name')
Example:
Let’s say we have two data frames containing basketball team data, and we want to merge them based on the team column. Below is an example of how this can be achieved:
import pandas as pd
df1 = pd.DataFrame({"team": ["Lakers", "Celtics", "Warriors"], "city": ["Los Angeles", "Boston", "San Francisco"]})
df2 = pd.DataFrame({"team": ["Lakers", "Bulls", "Heat"], "points": [80, 60, 70]})
#Merging the two data frames based on the team column
merged_df = pd.merge(df1, df2, on='team')
Result:
team | city | points |
---|---|---|
Lakers | Los Angeles | 80.0 |
Celtics | Boston | NaN |
Warriors | San Francisco | NaN |
As we can see from the result above, we have merged df1
and df2
based on the team column, and the resulting DataFrame contains the team’s city name and points. 2.
Using fuzzy matching to merge DataFrames
Sometimes, when merging DataFrames, the column’s values may not match exactly, making it challenging to merge based on the column. Fuzzy matching can be an effective solution in such cases.
Pandas includes a get_close_matches()
function from the difflib
module, which can be used to get the closest matching string. We can use this function to merge DataFrames that have slightly different string values.
Example:
Continuing with the basketball team scenario, let’s say we have two data frames containing basketball team information, but the team names are slightly different in each data frame. Below is an example of how this can be achieved:
import pandas as pd
from difflib import get_close_matches
df1 = pd.DataFrame({"team_name": ["Los Angeles Lakers", "Boston Celtics", "Golden State Warriors"]})
df2 = pd.DataFrame({"team_name": ["LA Lakers", "Bulls", "Miami Heat"], "points": [80, 60, 70]})
#Creating an empty list to hold the matched team names
match_list = []
#Performing fuzzy matching
for team_name in df1['team_name']:
match = get_close_matches(team_name, df2['team_name'], n=1, cutoff=0.6)
if len(match) > 0:
match_list.append(match[0])
else:
match_list.append("No Match")
#Creating a new column in df1 containing the matched team names
df1['team_match'] = match_list
#Merging the two data frames based on the matched team names
merged_df = pd.merge(df1, df2, left_on='team_match', right_on='team_name', how='left')
Result:
team_name | team_match | team_name | points |
---|---|---|---|
Los Angeles Lakers | LA Lakers | LA Lakers | 80.0 |
Boston Celtics | No Match | NaN | NaN |
Golden State Warriors | No Match | NaN | NaN |
As we can see from the result above, we have used fuzzy matching with the get_close_matches()
function to merge the two data frames based on the team names.
Modifying get_close_matches()
function to return closest match
The get_close_matches()
function has several parameters that can be tuned to get the desired output. By default, it returns a list of the n
closest matches in descending order of similarity.
However, it is possible to modify the function to return only the closest match. Example:
Let’s say we have a list of words, and we want to get the closest match for a given word.
Below is an example of how to modify the get_close_matches()
function to return only the closest match:
from difflib import get_close_matches
def get_closest_match(word, possibilities):
match = get_close_matches(word, possibilities, n=1, cutoff=0.6)
if len(match) > 0:
return match[0]
else:
return "No Match"
#Creating a list of words
words = ["house", "horse", "mouse", "dog", "cat", "rat"]
#Getting the closest match for the word "haus"
closest_match = get_closest_match("haus", words)
print(closest_match)
Result:
house
As we can see from the result above, we have modified the get_close_matches()
function to return only the closest match, and it returns "house"
instead of a list of closest matches. This modification can be useful when we only need the closest match instead of a list of closest matches.
Conclusion
In conclusion, merging DataFrames is a crucial task in data analysis. We have discussed two ways to merge DataFrames based on columns and using fuzzy matching techniques to merge DataFrames with imperfectly matching values.
We have also seen how to modify the get_close_matches()
function to return only the closest match. These techniques are valuable for data analysts, especially when working with large datasets.
In summary, this article has covered two fundamental topics in Pandas: Fuzzy matching and DataFrame creation/viewing. We have seen how to use the get_close_matches()
function to perform fuzzy matching on strings and how to create and view DataFrames in Pandas.
We have also discussed two types of merging data frames based on the columns and using fuzzy matching techniques. These are essential skills in data analysis and mastering them can go a long way in making data manipulation and analysis easier and more efficient.
Overall, understanding these techniques creates a strong foundation for any data analyst or scientist, making it easier to work with large datasets with imperfectly matching values.