Removing Duplicate Rows in Pandas DataFrames
Dataframe manipulation in pandas is one of the most sought-after skills in the data analysis field. Duplicate rows can be problematic while performing data analysis tasks and can lead to unnecessary errors.
Pandas provides the ‘drop_duplicates()’ function to deal with this issue. This function can remove the duplicate rows across all columns or a subset of columns from any given pandas dataframe.
In this article, we will discuss the different ways to remove duplicate rows in a pandas dataframe. Using drop_duplicates() function:
Pandas drop_duplicates() function is an in-built function that is used to remove duplicate rows in a pandas dataframe.
By default, this function considers all the columns in the dataframe while removing duplicates. The syntax for using this function is as follows:
df.drop_duplicates()
Where ‘df’ is the variable that stores the dataframe.
However, we need to specify additional arguments to perform specific tasks such as removing duplicates across specific columns.
Removing Duplicates Across All Columns:
To remove duplicates across all columns in a dataframe, we need to use the ‘keep’ parameter in ‘drop_duplicates()’ function.
By default, keep is set to ‘first’ which means it keeps the first occurrence and deletes the subsequent occurrences. To keep the last occurrence, we can set keep parameter as ‘last’.
To keep none, i.e., delete all duplicates, we can set the keep parameter as ‘False’. The syntax for removing duplicates across all columns is as follows:
df.drop_duplicates(keep=False)
This will remove all the duplicate rows from the dataframe.
Note that when we set keep parameter as ‘False’, it removes all the rows with duplicate values, irrespective of their occurrence.
Removing Duplicates Across Specific Columns:
We can also remove duplicates across specific columns of the dataframe.
To achieve this, we need to specify the column(s) we want to consider while removing duplicates. We pass a list of columns to the ‘subset’ parameter of the ‘drop_duplicates()’ function.
The syntax for removing duplicates across specific columns is as follows:
df.drop_duplicates(subset=['Column_name1', 'Column_name2', 'Column_name3'])
Here, we have specified the list of columns we want to consider while removing duplicates.
Example 1: Remove Duplicates Across All Columns
Let’s take a simple example to illustrate how to remove duplicate rows from a pandas dataframe.
Suppose we have a dataframe ‘df’ as follows:
ID | Name | Age |
---|---|---|
1 | Jane | 20 |
2 | Bob | 22 |
3 | Sam | 20 |
1 | Jane | 20 |
2 | Bob | 22 |
We can remove the duplicate rows using the following code:
df.drop_duplicates(keep=False)
The resulting output would be:
ID | Name | Age |
---|---|---|
1 | Jane | 20 |
2 | Bob | 22 |
3 | Sam | 20 |
Result Explanation:
Using the ‘drop_duplicates()’ function on the dataframe ‘df’ removed all the duplicate rows that have the same values across all columns.
The ‘keep=False’ parameter deleted all the occurrences of duplicate rows. The result shows only the first occurrence (also called original) of each unique row kept.
Example 2: Remove Duplicates Across Specific Columns
Let us take another example to show how to remove duplicate rows across specific columns.
Suppose we have a dataframe ‘df’ as follows:
Name | DOB | Age |
---|---|---|
John | 02-05-2000 | 21 |
Amanda | 09-06-1998 | 23 |
John | 02-05-2000 | 21 |
Ryan | 12-08-2001 | 20 |
We can remove duplicates across specific columns ‘Name’ and ‘DOB’ using the following code:
df.drop_duplicates(subset=['Name', 'DOB'])
The resulting output would be:
Name | DOB | Age |
---|---|---|
John | 02-05-2000 | 21 |
Amanda | 09-06-1998 | 23 |
Ryan | 12-08-2001 | 20 |
Result Explanation:
The resulting output shows the first occurrence of each unique row based on ‘Name’ and ‘DOB’ columns.
When we specify specific column(s) in the ‘subset’ parameter, the duplicate rows will be compared based on those specific columns.
Additional Resources:
Pandas documentation provides in-depth details on various functions and methods.
The ‘drop_duplicates()’ function is well-documented with examples and use cases that demonstrate its usage in data analysis tasks. This function comes in handy while performing exploratory data analysis and data preprocessing.
The pandas documentation provides detailed examples along with syntax and parameter explanations to help users understand the function better.
Conclusion:
In this article, we discussed how to remove duplicate rows in a pandas dataframe using the ‘drop_duplicates()’ function.
We explained how to use this function to remove duplicates across all columns and specific columns. The removal of duplicate rows is a crucial step for cleaning and preprocessing data before analysis.
We also provided an example of how to remove duplicates across specific columns in a dataframe using ‘drop_duplicates()’ function. Additionally, we discussed the importance of referring to documentation while working with pandas and provided additional resources for further exploration of this function.
In conclusion, the ‘drop_duplicates()’ function in pandas is a powerful tool for removing duplicate rows from dataframes. It can remove duplicates across all columns or a specific subset of columns in the dataframe.
This is an essential step in data preprocessing used to remove errors and inconsistencies from a dataset before analysis. By understanding how to use this function, data analysts and data scientists can efficiently clean their data to improve the accuracy of their analysis.
Remember to check pandas documentation for further information on how to remove duplicates and other useful functions. Always keep data cleaning at the forefront of any data analysis task to ensure excellent and reliable results.