Adventures in Machine Learning

Mastering Data Cleaning: Removing Duplicate Rows in Pandas

Removing Duplicate Rows from a Pandas DataFrame

Duplicate rows in datasets can be a common problem, especially when working with large datasets. These duplicate rows can skew analysis results and lead to incorrect conclusions.

In this article, we’ll discuss how to remove duplicate rows from a Pandas DataFrame.

1) Removing Duplicates Based on Latest Timestamp

In some cases, we may have duplicate rows in our data, but we only want to keep the row with the latest timestamp. We can accomplish this by following the syntax below:

df.drop_duplicates(subset=['column_1', 'column_2'], keep='last')

The drop_duplicates() function takes two arguments – subset and keep.

The subset argument is used to specify columns that we want to check for duplicates. In our example, we’re checking for duplicates in columns ‘column_1’ and ‘column_2’.

The keep argument is set to ‘last’, which means we want to keep the row with the latest timestamp. If we wanted to keep the row with the earliest timestamp, we would set keep to ‘first’.

2) Removing Duplicate Rows Based on a Specific Column

In other situations, we may only be interested in removing duplicates based on a specific column in our dataset. We can do this by using the following syntax:

df.drop_duplicates(subset=['column_name'])

In this example, we’re only checking for duplicates based on ‘column_name’.

We’re not concerned with duplicates in any other columns.

It’s important to note that drop_duplicates() modifies the existing DataFrame in-place.

If you want to create a new DataFrame with the duplicates removed, you’ll need to assign the output of drop_duplicates() to a new variable.

Examples

Let’s take a look at some examples to illustrate how to use the syntax above to remove duplicate rows.

Example 1 – Removing Duplicates Based on Latest Timestamp

First, let’s create a small DataFrame with duplicate rows based on different timestamps:

import pandas as pd
data = {'id': [1, 1, 2, 3, 3], 
        'value': [10, 20, 30, 40, 50],
        'timestamp': ['2021-01-01 12:00:00', 
                      '2021-01-01 12:30:00', 
                      '2021-01-01 13:00:00', 
                      '2021-01-01 14:00:00', 
                      '2021-01-01 14:30:00']}
df = pd.DataFrame(data)

Our DataFrame looks like this:

   id  value            timestamp
0   1     10  2021-01-01 12:00:00
1   1     20  2021-01-01 12:30:00
2   2     30  2021-01-01 13:00:00
3   3     40  2021-01-01 14:00:00
4   3     50  2021-01-01 14:30:00

Notice that rows 0 and 1, as well as rows 3 and 4 are duplicates. We only want to keep the row with the latest timestamp in each case.

We can use the syntax from earlier to remove these duplicates:

df.drop_duplicates(subset=['id', 'value'], keep='last', inplace=True)

Our updated DataFrame looks like this:

   id  value            timestamp
1   1     20  2021-01-01 12:30:00
2   2     30  2021-01-01 13:00:00
4   3     50  2021-01-01 14:30:00

Rows 0 and 3 have been removed since they were duplicates and we’ve only kept the rows with the latest timestamps for each id/value combination.

Example 2 – Removing Duplicate Rows Based on a Specific Column

Now let’s look at an example where we only want to remove duplicate rows based on a specific column. We’ll create another small DataFrame:

import pandas as pd
data = {'id': [1, 1, 2, 3, 3], 
        'value_1': [10, 20, 30, 40, 50],
        'value_2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data)

Our DataFrame looks like this:

   id  value_1  value_2
0   1       10        5
1   1       20        6
2   2       30        7
3   3       40        8
4   3       50        9

Notice that rows 0 and 1 are duplicates based on the ‘id’ column. We can use the following syntax to remove these duplicates:

df.drop_duplicates(subset=['id'], inplace=True)

Our updated DataFrame looks like this:

   id  value_1  value_2
0   1       10        5
2   2       30        7
3   3       40        8

As expected, rows 1 and 4 were removed since they were duplicates based on the ‘id’ column.

3) Removing Rows Based on Duplicate Values in Multiple Columns

We will be using the syntax below, which is similar to the one we used before.

df.drop_duplicates(subset=['col1', 'col2', ...], keep='last', inplace=True)

This syntax will help us identify and remove duplicates data based on duplicate values in multiple columns specified in subset parameter.

The keep parameter specifies the duplicate entry that is removed and which ones to be kept. Here we made it equal to ‘last’, which means only the last occurrence of the duplicate values in a column subset will be present in the resulting dataset.

Example Using Syntax

Let’s take a look at an example to see how this looks in action. Suppose we have an imaginary dataset of sales done by various employees where we’re looking to remove rows with duplicate values in multiple columns.

We will create a DataFrame with various columns and rows.

import pandas as pd
# Creating the DataFrame
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'F', 'G', ],
        'State':['California', 'Missouri', 'Missouri', 'California', 'California', 'New Jersey', 'Ohio', 'Ohio'],
        'Year':[2019, 2018, 2019, 2018, 2020, 2021, 2020, 2021],
        'Amount':[1200, 2200, 1200, 5200, 3300, 1100, 4500, 1200]
       }
df = pd.DataFrame(data)

And that returns the following DataFrame:

  Name        State  Year  Amount
0    A   California  2019    1200
1    B     Missouri  2018    2200
2    C     Missouri  2019    1200
3    D   California  2018    5200
4    E   California  2020    3300
5    F   New Jersey  2021    1100
6    F         Ohio  2020    4500
7    G         Ohio  2021    1200

Suppose we want to remove duplicated rows based on the Name and State columns to create a clean dataset like this:

  Name        State  Year  Amount
0    A   California  2019    1200
1    B     Missouri  2018    2200
2    C     Missouri  2019    1200
3    D   California  2018    5200
4    E   California  2020    3300
5    F   New Jersey  2021    1100
6    F         Ohio  2020    4500
7    G         Ohio  2021    1200

Using the syntax we used earlier, we can remove the duplicated rows as shown below:

df.drop_duplicates(subset = ['Name','State'], keep='last', inplace=True)

Here, keep is set to last, meaning that if there are duplicate values in the subset, the last one will be kept, while the rest are dropped. The rows that have been dropped have duplicate values in both Name and State columns.

The resulting DataFrame looks like this:

  Name        State  Year  Amount
0    A   California  2019    1200
1    B     Missouri  2018    2200
2    C     Missouri  2019    1200
3    D   California  2018    5200
4    E   California  2020    3300
5    F   New Jersey  2021    1100
6    F         Ohio  2020    4500
7    G         Ohio  2021    1200

Conclusion

In this article, we discussed how to remove rows with duplicate values based on multiple columns in Pandas using the drop_duplicates() function. The Pandas library is a powerful data analysis tool, and understanding how to use the drop_duplicates() function can help us clean and transform our dataset, removing unnecessary duplicates.

With these skills, you can be confident in your ability to work with large datasets that have several columns with potential duplicate values. In this article, we’ve explored how to remove duplicate rows from a Pandas DataFrame using the drop_duplicates() function.

We’ve discussed three methods to remove duplicates: based on a specific column, based on the latest timestamp in a column subset, and based on duplicate values in multiple columns. Removing duplicate rows is essential as it can result in inaccurate analysis results and skewed conclusions.

With these skills, you can confidently manage a large dataset with duplicate entries, clean and transform it, and improve the accuracy of data analysis. Overall, the drop_duplicates() function is a powerful tool that every data analyst should know.

Popular Posts