Drop Columns with Missing Values in Pandas DataFrame
Data cleaning is an essential task in any data analysis process. One common issue that data analysts face is handling missing values in the data set.
In pandas DataFrame, this can be done using the dropna()
function to remove rows or columns containing missing or NA values. In this article, we will focus on removing columns from pandas DataFrame.
We will explore various ways to drop columns based on different criteria. Removing columns with missing or NA values is important because it helps to improve the quality of the data set.
It reduces the risk of producing biased results or erroneous interpretations due to incomplete or incorrect data. Pandas provide different ways to deal with missing values, we will explore them one by one.
1. Drop Column Where at Least One Value is Missing
The first approach to drop a column with missing values is the simplest one, use the dropna()
function with axis
set to ‘columns’. For example, consider the following DataFrame.
import pandas as pd
import numpy as np
data = {'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, np.nan],
'C': [np.nan, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1.0 5.0 NaN
1 2.0 NaN 10.0
2 NaN 7.0 11.0
3 4.0 NaN 12.0
To drop column ‘B’ which contains missing values, simply call the dropna()
function with axis='columns'
.
df.dropna(axis='columns', inplace=True)
print(df)
A
0 1.0
1 2.0
2 NaN
3 4.0
2. Drop Column Where All Values are Missing
The second approach is to drop a column when all values are missing. This can be done by using the how
parameter with the value set to ‘all’.
df = pd.DataFrame(data)
df.dropna(axis='columns', how='all', inplace=True)
print(df)
A B C
0 1.0 5.0 NaN
1 2.0 NaN 10.0
2 NaN 7.0 11.0
3 4.0 NaN 12.0
Notice that column ‘C’ remains because it has at least one non-missing value.
3. Drop Column with the Number of NA
The third approach is to keep columns based on a threshold number of non-missing values. This can be done using the thresh
parameter.
For example, to keep columns that have at least 3 non-NA values, we can set thresh=3
.
df = pd.DataFrame(data)
df.dropna(axis='columns', thresh=3, inplace=True)
print(df)
B C
0 5.0 NaN
1 NaN 10.0
2 7.0 11.0
3 NaN 12.0
Columns ‘A’, ‘B’, and ‘C’ have 3, 2, and 3 non-missing values, respectively. Only column ‘B’ and ‘C’ have at least 3 non-missing values, so column ‘A’ is dropped.
4. Drop NA from Defined Rows
The fourth approach is to drop columns only if the missing value occurs in specific rows. This can be done using the subset
parameter.
For example, to drop column ‘B’ only if it has a missing value in rows 0 or 3, we can set subset=['B']
and call dropna
with a subset
parameter.
df = pd.DataFrame(data)
df.dropna(axis='columns', subset=['B'], inplace=True, how='any', thresh=None, subset=None)
print(df)
A C
0 1.0 NaN
1 2.0 10.0
2 NaN 11.0
3 4.0 12.0
Only column ‘B’ was removed because it has missing values in rows 0 and 3.
5. Drop Column with Missing Values in Place
Finally, there is also an option to update the original DataFrame in place by using the inplace
parameter in the dropna
function. For example:
df = pd.DataFrame(data)
df.dropna(axis='columns', inplace=True)
print(df)
A
0 1.0
1 2.0
2 NaN
3 4.0
The original DataFrame has been modified and now only contains the remaining column ‘A’.
Conclusion
In summary, there are different ways to remove columns with missing or NA values in pandas DataFrame. By using the dropna()
function with different parameters such as axis
, how
, thresh
, subset
, and inplace
, we can customize our data cleaning process to fit our specific needs.
It is important to remember that removing columns should be done thoughtfully, and the impact on the overall analysis and result should be considered before removing any columns. In conclusion, this article focused on removing columns with missing or NA values in pandas DataFrames.
We explored various ways to use the dropna()
function to drop columns based on different criteria such as missing values, empty columns, defined rows, and a threshold number of non-missing values. The importance of data cleaning to improve the quality and validity of the final analysis was emphasized.
It is crucial to handle missing values thoughtfully and consider their impact on the overall analysis and result before removing any columns. Nevertheless, using the dropna()
function with pandas allows us to customize our data cleaning process and improve the quality of our data set.