Removing Duplicate Rows from a Pandas DataFrame
Duplicate rows in datasets can be a common problem, especially when working with large datasets. These duplicate rows can skew analysis results and lead to incorrect conclusions.
In this article, we’ll discuss how to remove duplicate rows from a Pandas DataFrame.
1) Removing Duplicates Based on Latest Timestamp
In some cases, we may have duplicate rows in our data, but we only want to keep the row with the latest timestamp. We can accomplish this by following the syntax below:
df.drop_duplicates(subset=['column_1', 'column_2'], keep='last')
The drop_duplicates()
function takes two arguments – subset
and keep
.
The subset
argument is used to specify columns that we want to check for duplicates. In our example, we’re checking for duplicates in columns ‘column_1’ and ‘column_2’.
The keep
argument is set to ‘last’, which means we want to keep the row with the latest timestamp. If we wanted to keep the row with the earliest timestamp, we would set keep
to ‘first’.
2) Removing Duplicate Rows Based on a Specific Column
In other situations, we may only be interested in removing duplicates based on a specific column in our dataset. We can do this by using the following syntax:
df.drop_duplicates(subset=['column_name'])
In this example, we’re only checking for duplicates based on ‘column_name’.
We’re not concerned with duplicates in any other columns.
It’s important to note that drop_duplicates()
modifies the existing DataFrame in-place.
If you want to create a new DataFrame with the duplicates removed, you’ll need to assign the output of drop_duplicates()
to a new variable.
Examples
Let’s take a look at some examples to illustrate how to use the syntax above to remove duplicate rows.
Example 1 – Removing Duplicates Based on Latest Timestamp
First, let’s create a small DataFrame with duplicate rows based on different timestamps:
import pandas as pd
data = {'id': [1, 1, 2, 3, 3],
'value': [10, 20, 30, 40, 50],
'timestamp': ['2021-01-01 12:00:00',
'2021-01-01 12:30:00',
'2021-01-01 13:00:00',
'2021-01-01 14:00:00',
'2021-01-01 14:30:00']}
df = pd.DataFrame(data)
Our DataFrame looks like this:
id value timestamp
0 1 10 2021-01-01 12:00:00
1 1 20 2021-01-01 12:30:00
2 2 30 2021-01-01 13:00:00
3 3 40 2021-01-01 14:00:00
4 3 50 2021-01-01 14:30:00
Notice that rows 0 and 1, as well as rows 3 and 4 are duplicates. We only want to keep the row with the latest timestamp in each case.
We can use the syntax from earlier to remove these duplicates:
df.drop_duplicates(subset=['id', 'value'], keep='last', inplace=True)
Our updated DataFrame looks like this:
id value timestamp
1 1 20 2021-01-01 12:30:00
2 2 30 2021-01-01 13:00:00
4 3 50 2021-01-01 14:30:00
Rows 0 and 3 have been removed since they were duplicates and we’ve only kept the rows with the latest timestamps for each id/value combination.
Example 2 – Removing Duplicate Rows Based on a Specific Column
Now let’s look at an example where we only want to remove duplicate rows based on a specific column. We’ll create another small DataFrame:
import pandas as pd
data = {'id': [1, 1, 2, 3, 3],
'value_1': [10, 20, 30, 40, 50],
'value_2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data)
Our DataFrame looks like this:
id value_1 value_2
0 1 10 5
1 1 20 6
2 2 30 7
3 3 40 8
4 3 50 9
Notice that rows 0 and 1 are duplicates based on the ‘id’ column. We can use the following syntax to remove these duplicates:
df.drop_duplicates(subset=['id'], inplace=True)
Our updated DataFrame looks like this:
id value_1 value_2
0 1 10 5
2 2 30 7
3 3 40 8
As expected, rows 1 and 4 were removed since they were duplicates based on the ‘id’ column.
3) Removing Rows Based on Duplicate Values in Multiple Columns
We will be using the syntax below, which is similar to the one we used before.
df.drop_duplicates(subset=['col1', 'col2', ...], keep='last', inplace=True)
This syntax will help us identify and remove duplicates data based on duplicate values in multiple columns specified in subset
parameter.
The keep
parameter specifies the duplicate entry that is removed and which ones to be kept. Here we made it equal to ‘last’, which means only the last occurrence of the duplicate values in a column subset will be present in the resulting dataset.
Example Using Syntax
Let’s take a look at an example to see how this looks in action. Suppose we have an imaginary dataset of sales done by various employees where we’re looking to remove rows with duplicate values in multiple columns.
We will create a DataFrame with various columns and rows.
import pandas as pd
# Creating the DataFrame
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'F', 'G', ],
'State':['California', 'Missouri', 'Missouri', 'California', 'California', 'New Jersey', 'Ohio', 'Ohio'],
'Year':[2019, 2018, 2019, 2018, 2020, 2021, 2020, 2021],
'Amount':[1200, 2200, 1200, 5200, 3300, 1100, 4500, 1200]
}
df = pd.DataFrame(data)
And that returns the following DataFrame:
Name State Year Amount
0 A California 2019 1200
1 B Missouri 2018 2200
2 C Missouri 2019 1200
3 D California 2018 5200
4 E California 2020 3300
5 F New Jersey 2021 1100
6 F Ohio 2020 4500
7 G Ohio 2021 1200
Suppose we want to remove duplicated rows based on the Name
and State
columns to create a clean dataset like this:
Name State Year Amount
0 A California 2019 1200
1 B Missouri 2018 2200
2 C Missouri 2019 1200
3 D California 2018 5200
4 E California 2020 3300
5 F New Jersey 2021 1100
6 F Ohio 2020 4500
7 G Ohio 2021 1200
Using the syntax we used earlier, we can remove the duplicated rows as shown below:
df.drop_duplicates(subset = ['Name','State'], keep='last', inplace=True)
Here, keep
is set to last
, meaning that if there are duplicate values in the subset, the last one will be kept, while the rest are dropped. The rows that have been dropped have duplicate values in both Name
and State
columns.
The resulting DataFrame looks like this:
Name State Year Amount
0 A California 2019 1200
1 B Missouri 2018 2200
2 C Missouri 2019 1200
3 D California 2018 5200
4 E California 2020 3300
5 F New Jersey 2021 1100
6 F Ohio 2020 4500
7 G Ohio 2021 1200
Conclusion
In this article, we discussed how to remove rows with duplicate values based on multiple columns in Pandas using the drop_duplicates()
function. The Pandas library is a powerful data analysis tool, and understanding how to use the drop_duplicates()
function can help us clean and transform our dataset, removing unnecessary duplicates.
With these skills, you can be confident in your ability to work with large datasets that have several columns with potential duplicate values. In this article, we’ve explored how to remove duplicate rows from a Pandas DataFrame using the drop_duplicates()
function.
We’ve discussed three methods to remove duplicates: based on a specific column, based on the latest timestamp in a column subset, and based on duplicate values in multiple columns. Removing duplicate rows is essential as it can result in inaccurate analysis results and skewed conclusions.
With these skills, you can confidently manage a large dataset with duplicate entries, clean and transform it, and improve the accuracy of data analysis. Overall, the drop_duplicates()
function is a powerful tool that every data analyst should know.