Data Cleaning with dropna() Function: Removing Missing Values in Rows and Columns
Data cleaning is a crucial aspect of data analysis. Missing values can cause significant errors in calculations and may lead to incorrect conclusions.
The Pandas dropna() function can help eliminate missing values. In this article, we will explore how to use the dropna() function, specifically the thresh argument, to remove rows and columns with missing values.
Using thresh argument to drop rows with missing values:
The dropna() function has several optional parameters, including thresh, which specifies the minimum number of nonNaN values required to keep a row or column. The thresh argument is useful when we require a minimum number of data points within a column or row before it is retained.
By default, this parameter is set to None, and it drops any rows or columns with missing values. To drop rows with missing values, we can modify the thresh parameter to set a minimum number of nonNaN values.
This ensures that each row has a minimum number of required values before it is retained. For example, to remove rows with less than two nonNaN values, we can set the thresh parameter to 2.
Using thresh argument to drop columns with missing values:
Similarly, we can use the dropna() function to eliminate columns with missing values. To retain a column, we can modify the thresh parameter to set a minimum number of nonNaN values within that column.
Suppose we have a dataset with ten columns, and a column has only one nonNaN value, then we can use the following code;
df.dropna(thresh=2, axis=1)
This code would keep only the columns that have at least two nonNaN values.
Example 1: Only Keep Rows with the Minimum Number of nonNaN Values:
Let’s work on an example to better understand the process.
Consider a dataset containing information about students’ grades in four subjects. The data includes some missing values, as shown below.
Student ID  Math  Science  English  History 

1  67  83  70  55 
2  75  87  NaN  90 
3  NaN  75  NaN  76 
4  48  NaN  55  67 
5  NaN  75  78  89 
6  67  NaN  80  44 
To remove rows with less than two nonNaN values, we can use the following code:
df.dropna(thresh=2)
This code would eliminate rows that have less than two nonNaN values, producing the resulting data as shown:
Student ID  Math  Science  English  History 

1  67  83  70  55 
As we can see, only the row with the index value of 1 fulfills the minimum nonNaN requirement.
Example 2: Only Keep Rows with Minimum % of NonNaN Values
Sometimes, we may prefer to drop rows based on a certain percentage of nonNaN values instead of a fixed number. If a row has fewer missing values than a particular percentage threshold, we can choose to keep it.
The thresh argument calculates the nonNaN values’ percentage and compares it with the set threshold. Consider the following example:
ID  Math  Science  English  History 

1  67  83  70  55 
2  75  87  NaN  90 
3  NaN  75  NaN  76 
4  48  NaN  55  67 
5  NaN  75  78  89 
6  67  NaN  80  44 
7  NaN  95  88  89 
8  90  85  78  90 
9  92  87  80  NaN 
10  90  74  71  81 
If we want to drop rows with less than 70% nonNaN values, we can do so by setting the thresh parameter to 0.7.
df.dropna(thresh=0.7*df.shape[1])
The dropna() function calculates the percentage of nonNaN values based on the number of columns multiplied by the threshold value.
In this case, the dataset has four columns, so the function drops any row with less than 70% of its four columns being nonNaN. As a result, rows 3 and 5 get dropped:
ID  Math  Science  English  History 

1  67  83  70  55 
2  75  87  NaN  90 
4  48  NaN  55  67 
6  67  NaN  80  44 
7  NaN  95  88  89 
8  90  85  78  90 
9  92  87  80  NaN 
10  90  74  71  81 
Example 3: Only Keep Columns with Minimum Number of nonNaN Values
The dropna() function can also eliminate columns based on the number of nonNaN values.
For instance, if we want to filter out columns with less than six nonNaN values in our dataset, we can set the thresh parameter to 6. “`df.dropna(thresh=6, axis=1)“`
This code would eliminate columns with less than six nonNaN values, producing the resulting data as shown below:
ID  Science  History 

1  83  55 
2  87  90 
3  75  76 
4  NaN  67 
5  75  89 
6  NaN  44 
7  95  89 
8  85  90 
9  87  NaN 
10  74  81 
This code keeps only the required columns, Science and History.
The function removes Math and English because they have fewer than six nonNaN values. Similarly, the column labeled ID gets dropped since it contains only nonNaN values, and there is no useful data to analyze.
Example 4: Only Keep Columns with Minimum % of nonNaN Values
In some cases, we may need to remove columns that have too many missing values. Instead of defining a specific number of nonNaN values, we can use the thresh argument to specify the minimum percentage of nonNaN values required to keep a column.
For instance, if we want to remove columns with less than 70% nonNaN values, we can do so by setting the thresh parameter to 0.7.
df.dropna(thresh=int(df.shape[0]*0.7), axis=1)
In this code, we use the .shape[0] method to get the number of rows in the dataset and multiply it with the threshold value of 0.7. The int() function converts the product into an integer to be used for the minimum nonNaN value calculation. This code will drop any columns with less than 70% nonNaN values, producing the resulting data:
ID  English  History 

1  70  55 
2  NaN  90 
3  NaN  76 
4  55  67 
5  78  89 
6  80  44 
7  88  89 
8  78  90 
9  80  NaN 
10  71  81 
We retained only the columns with at least 70% of nonNaN values, keeping English and History columns and removing the Math and Science columns.
Conclusion:
The dropna() function in Pandas is an essential tool to eliminate missing values from datasets. Using the thresh argument to filter rows and columns based on nonNaN values’ minimum percentage or count can help standardize data.
By removing irrelevant data, we achieve cleaner data that produce more accurate insights without sacrificing valuable data points. Always remember to validate the dataset after cleaning to ensure that we keep enough data points to support our analysis.
Additional Resources
Cleaning data requires different techniques and methods, and exploring new resources can help improve the process. There are numerous resources available to assist with data cleaning and Pandas.
Some resources include documentation, tutorials, and StackOverflow threads. Here are some resources to help with data cleaning:

Pandas Documentation: The official Pandas website has extensive documentation to guide users in using the Pandas library effectively. The documentation provides detailed explanations and examples of using different Pandas functions, including dropna().

Kaggle: Kaggle hosts numerous datasets and kernels on different topics, including data cleaning.
These resources include detailed code explanations and insights on different techniques for data cleaning.

DataQuest: DataQuest is an online learning platform that offers data analysis and data science courses. Their courses cover the basics of data analysis and cleaning using Pandas functions.

StackOverflow: StackOverflow is a trusted online forum where data scientists and programmers can ask and answer questions about data cleaning and other related topics.
In conclusion, Data cleaning is a fundamental aspect of data analysis. Pandas’ dropna() function, with the thresh argument, is a useful tool in eliminating missing values in rows and columns based on a specific number or percentage of nonNaN values.
Additionally, accessing and exploring new resources can help improve the data cleaning process and provide insights into using different techniques. In conclusion, data cleaning is crucial for accurate data analysis, and the dropna() function in Pandas can help eliminate missing values in rows and columns based on specific criteria.
We explored using the thresh argument to drop rows and columns based on a minimum number or percentage of nonNaN values. Additionally, we provided some resources for improving the data cleaning process, including documentation, tutorials, and online forums.
Proper data cleaning with Pandas can help produce meaningful insights that lead to accurate conclusions, making this method a valuable tool for data analysts. Remember to check your dataset thoroughly after cleaning and explore multiple resources to ensure that you are using the best techniques for your data.