Dropping Rows with NaN Values in Pandas DataFrame
Have you ever opened up a large dataset, only to find that it’s riddled with missing values? If so, you’re not alone.
Missing data can be a common problem, as data is often collected from a variety of sources and may not always be complete. Luckily, with Pandas DataFrame, you can easily drop rows with NaN values to clean up your data and ensure that your analyses are accurate.
This article will cover several ways to drop rows that contain NaN values in Pandas DataFrame.
Dropping Rows with Any NaN Values
The easiest way to drop rows with missing values in a Pandas DataFrame is by using the dropna()
method. By default, dropna()
drops the rows that contain any NaN values.
For example, consider a DataFrame df
with NaN values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,2,np.nan], 'B': [4,np.nan,np.nan], 'C': [7,8,9]})
print(df)
Output:
A B C
0 1.0 4.0 7
1 2.0 NaN 8
2 NaN NaN 9
To drop rows with any NaN values, simply call the dropna()
method without any arguments:
df = df.dropna()
print(df)
Output:
A B C
0 1.0 4.0 7
As we can see, row 1 and 2 were dropped because they contained NaN values. The remaining row is the only complete row in the DataFrame.
Dropping Rows with All NaN Values
Sometimes, you may want to drop rows that contain only NaN values. To do this, you can use the all
argument in dropna()
.
For example, let’s modify the above df
by adding a row that contains only NaN values:
df = pd.DataFrame({'A': [1,np.nan,3], 'B': [4,5,np.nan], 'C': [7,np.nan,np.nan], 'D': [np.nan,np.nan,np.nan]})
print(df)
Output:
A B C D
0 1.0 4.0 7.0 NaN
1 NaN 5.0 NaN NaN
2 3.0 NaN NaN NaN
To drop rows that contain only NaN values, we can specify all
as the thresh
argument in dropna()
:
df = df.dropna(thresh=1)
print(df)
Output:
A B C D
0 1.0 4.0 7.0 NaN
1 NaN 5.0 NaN NaN
2 3.0 NaN NaN NaN
As we can see, row 1 was not dropped because it contained non-NaN values in column B. Rows 2 and 3 were only dropped because they contained all NaN values.
Dropping Rows Below a Certain Threshold
Another useful approach is to drop rows that have a specified minimum number of non-NaN values. This can be done by using the thresh
argument in dropna()
.
For example:
df = pd.DataFrame({'A': [1,np.nan,3], 'B': [4,5,np.nan], 'C': [7,np.nan,np.nan], 'D': [np.nan,np.nan,np.nan]})
print(df)
Output:
A B C D
0 1.0 4.0 7.0 NaN
1 NaN 5.0 NaN NaN
2 3.0 NaN NaN NaN
Suppose we want to drop rows that have less than 2 non-NaN values. We can set thresh=2
in dropna()
:
df = df.dropna(thresh=2)
print(df)
Output:
A B C D
0 1.0 4.0 7.0 NaN
As we can see, only row 1 was retained because it contained two non-NaN values in columns A and B.
Dropping Rows with NaN Values in a Specific Column
Lastly, you may want to drop rows that contain NaN values in a specific column. This can be done by specifying the column name(s) in the subset
argument of dropna()
.
For example:
df = pd.DataFrame({'A': [1,np.nan,3], 'B': [4,5,np.nan], 'C': [7,np.nan,9], 'D': [np.nan,2,np.nan]})
print(df)
Output:
A B C D
0 1.0 4.0 7.0 NaN
1 NaN 5.0 NaN 2.0
2 3.0 NaN 9.0 NaN
Suppose we want to drop rows that contain NaN values in column A, we can specify subset=['A']
in dropna()
:
df = df.dropna(subset=['A'])
print(df)
Output:
A B C D
0 1.0 4.0 7.0 NaN
2 3.0 NaN 9.0 NaN
As we can see, row 1 was retained because it had a non-NaN value in column A.
Resetting Index After Dropping Rows with NaNs
When you drop rows with NaN values, you might end up with an index that contains gaps or missing values. To reset the index, you can use the reset_index()
method.
For example:
df = pd.DataFrame({'A': [1,2,np.nan], 'B': [4,np.nan,np.nan], 'C': [7,8,9]})
df = df.dropna().reset_index(drop=True)
print(df)
Output:
A B C
0 1.0 4.0 7
In the above example, we first drop all rows that contain at least one NaN value and then reset the index to start from 0 using reset_index()
. The drop=True
argument is used to drop the original index column that contained the gaps.
Conclusion
Missing data is a common problem in data science. In this article, we demonstrated several approaches to drop rows containing NaN values using Pandas DataFrame.
We covered how to drop rows with any, all, and below a certain number of non-NaN values. Additionally, we showed how to drop rows that contain NaN values in a specific column.
Lastly, we covered how to reset the index after dropping NaN values. By employing these techniques, you can clean your data and ensure that your analyses are accurate.
To ensure accurate analyses, it’s important to clean up data that contains missing values. This article demonstrated several ways to drop rows with NaN values in Pandas DataFrame, including dropping rows with any, all, and below a certain number of non-NaN values, as well as dropping rows that contain NaN values in a specific column.
We also covered how to reset the index after dropping NaN values. By using these techniques, you can effectively clean up your data and prevent inaccurate results.
Remember, taking care of missing values is a critical aspect of data analysis, and the tools provided by Pandas DataFrame make it an easy process.