Removing Duplicate Rows from a Pandas DataFrame
Duplicate rows in datasets can be a common problem, especially when working with large datasets. These duplicate rows can skew analysis results and lead to incorrect conclusions.
In this article, well discuss how to remove duplicate rows from a Pandas DataFrame.
1) Removing Duplicates Based on Latest Timestamp
In some cases, we may have duplicate rows in our data, but we only want to keep the row with the latest timestamp. We can accomplish this by following the syntax below:
“`
df.drop_duplicates(subset=[‘column_1’, ‘column_2′], keep=’last’)
“`
The `drop_duplicates()` function takes two arguments – `subset` and `keep`.
The `subset` argument is used to specify columns that we want to check for duplicates. In our example, were checking for duplicates in columns ‘column_1’ and ‘column_2’.
The `keep` argument is set to ‘last’, which means we want to keep the row with the latest timestamp. If we wanted to keep the row with the earliest timestamp, we would set `keep` to ‘first’.
2) Removing Duplicate Rows Based on a Specific Column
In other situations, we may only be interested in removing duplicates based on a specific column in our dataset. We can do this by using the following syntax:
“`
df.drop_duplicates(subset=[‘column_name’])
“`
In this example, were only checking for duplicates based on ‘column_name’.
Were not concerned with duplicates in any other columns.
Its important to note that `drop_duplicates()` modifies the existing DataFrame in-place.
If you want to create a new DataFrame with the duplicates removed, youll need to assign the output of `drop_duplicates()` to a new variable.
Examples
Lets take a look at some examples to illustrate how to use the syntax above to remove duplicate rows.
Example 1 – Removing Duplicates Based on Latest Timestamp
First, lets create a small DataFrame with duplicate rows based on different timestamps:
“`
import pandas as pd
data = {‘id’: [1, 1, 2, 3, 3],
‘value’: [10, 20, 30, 40, 50],
‘timestamp’: [‘2021-01-01 12:00:00’,
‘2021-01-01 12:30:00’,
‘2021-01-01 13:00:00’,
‘2021-01-01 14:00:00’,
‘2021-01-01 14:30:00’]}
df = pd.DataFrame(data)
“`
Our DataFrame looks like this:
“`
id value timestamp
0 1 10 2021-01-01 12:00:00
1 1 20 2021-01-01 12:30:00
2 2 30 2021-01-01 13:00:00
3 3 40 2021-01-01 14:00:00
4 3 50 2021-01-01 14:30:00
“`
Notice that rows 0 and 1, as well as rows 3 and 4 are duplicates. We only want to keep the row with the latest timestamp in each case.
We can use the syntax from earlier to remove these duplicates:
“`
df.drop_duplicates(subset=[‘id’, ‘value’], keep=’last’, inplace=True)
“`
Our updated DataFrame looks like this:
“`
id value timestamp
1 1 20 2021-01-01 12:30:00
2 2 30 2021-01-01 13:00:00
4 3 50 2021-01-01 14:30:00
“`
Rows 0 and 3 have been removed since they were duplicates and weve only kept the rows with the latest timestamps for each id/value combination.
Example 2 – Removing Duplicate Rows Based on a Specific Column
Now lets look at an example where we only want to remove duplicate rows based on a specific column. Well create another small DataFrame:
“`
import pandas as pd
data = {‘id’: [1, 1, 2, 3, 3],
‘value_1’: [10, 20, 30, 40, 50],
‘value_2’: [5, 6, 7, 8, 9]}
df = pd.DataFrame(data)
“`
Our DataFrame looks like this:
“`
id value_1 value_2
0 1 10 5
1 1 20 6
2 2 30 7
3 3 40 8
4 3 50 9
“`
Notice that rows 0 and 1 are duplicates based on the ‘id’ column. We can use the following syntax to remove these duplicates:
“`
df.drop_duplicates(subset=[‘id’], inplace=True)
“`
Our updated DataFrame looks like this:
“`
id value_1 value_2
0 1 10 5
2 2 30 7
3 3 40 8
“`
As expected, rows 1 and 4 were removed since they were duplicates based on the ‘id’ column.
Conclusion
In this article, weve covered two ways to remove duplicate rows from a Pandas DataFrame. By using the `drop_duplicates()` function, we can easily remove duplicates based on a specific column or based on the latest timestamp.
By following the syntax outlined in this article, you can easily clean up your datasets and ensure that your analysis results are accurate. There are instances when we have to remove rows with duplicate values in multiple columns.
This means that we’re looking for rows where multiple columns have the same values, and it’s a scenario we must be prepared for when working with large datasets. In this article, we’ll learn how to remove rows with duplicate values in multiple columns by using Pandas drop_duplicates() function.
3) Removing Rows Based on Duplicate Values in Multiple Columns
We will be using the syntax below, which is similar to the one we used before.
“`
df.drop_duplicates(subset=[‘col1’, ‘col2′, …], keep=’last’, inplace=True)
“`
This syntax will help us identify and remove duplicates data based on duplicate values in multiple columns specified in `subset` parameter.
The `keep` parameter specifies the duplicate entry that is removed and which ones to be kept. Here we made it equal to ‘last’, which means only the last occurrence of the duplicate values in a column subset will be present in the resulting dataset.
Example Using Syntax
Let’s take a look at an example to see how this looks in action. Suppose we have an imaginary dataset of sales done by various employees where we’re looking to remove rows with duplicate values in multiple columns.
We will create a DataFrame with various columns and rows. “`
import pandas as pd
# Creating the DataFrame
data = {‘Name’:[‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘F’, ‘G’, ],
‘State’:[‘California’, ‘Missouri’, ‘Missouri’, ‘California’, ‘California’, ‘New Jersey’, ‘Ohio’, ‘Ohio’],
‘Year’:[2019, 2018, 2019, 2018, 2020, 2021, 2020, 2021],
‘Amount’:[1200, 2200, 1200, 5200, 3300, 1100, 4500, 1200]
}
df = pd.DataFrame(data)
“`
And that returns the following DataFrame:
“`
Name State Year Amount
0 A California 2019 1200
1 B Missouri 2018 2200
2 C Missouri 2019 1200
3 D California 2018 5200
4 E California 2020 3300
5 F New Jersey 2021 1100
6 F Ohio 2020 4500
7 G Ohio 2021 1200
“`
Suppose we want to remove duplicated rows based on the `Name` and `State` columns to create a clean dataset like this:
“`
Name State Year Amount
0 A California 2019 1200
1 B Missouri 2018 2200
2 C Missouri 2019 1200
3 D California 2018 5200
4 E California 2020 3300
5 F New Jersey 2021 1100
6 F Ohio 2020 4500
7 G Ohio 2021 1200
“`
Using the syntax we used earlier, we can remove the duplicated rows as shown below:
“`
df.drop_duplicates(subset = [‘Name’,’State’], keep=’last’, inplace=True)
“`
Here, `keep` is set to `last`, meaning that if there are duplicate values in the subset, the last one will be kept, while the rest are dropped. The rows that have been dropped have duplicate values in both `Name` and `State` columns.
The resulting DataFrame looks like this:
“`
Name State Year Amount
0 A California 2019 1200
1 B Missouri 2018 2200
2 C Missouri 2019 1200
3 D California 2018 5200
4 E California 2020 3300
5 F New Jersey 2021 1100
6 F Ohio 2020 4500
7 G Ohio 2021 1200
“`
Conclusion
In this article, we discussed how to remove rows with duplicate values based on multiple columns in Pandas using the drop_duplicates() function. The Pandas library is a powerful data analysis tool, and understanding how to use the drop_duplicates() function can help us clean and transform our dataset, removing unnecessary duplicates.
With these skills, you can be confident in your ability to work with large datasets that have several columns with potential duplicate values. In this article, we’ve explored how to remove duplicate rows from a Pandas DataFrame using the drop_duplicates() function.
We’ve discussed three methods to remove duplicates: based on a specific column, based on the latest timestamp in a column subset, and based on duplicate values in multiple columns. Removing duplicate rows is essential as it can result in inaccurate analysis results and skewed conclusions.
With these skills, you can confidently manage a large dataset with duplicate entries, clean and transform it, and improve the accuracy of data analysis. Overall, the drop_duplicates() function is a powerful tool that every data analyst should know.