Counting Missing Values in Pandas DataFrame
Missing values in data are problematic as they can cause errors during data analysis. Incomplete or missing data can come about due to a variety of reasons such as data corruption, human error during data entry, or the non-existence of certain features.
One popular tool used to deal with data in Python is the Pandas library, which provides powerful data structures and tools for data manipulation and analysis. In this article, we will discuss how to count missing values in Pandas DataFrame.
We will start by looking at how to count the total missing values in a DataFrame. Then, we will move on to counting the missing values for each column and row, analyzing and interpreting the results with the help of examples.
Total Missing Values
Counting the total number of missing values in a DataFrame is an important first step in data analysis and cleaning. Pandas provides the .isnull()
method to check if a value is missing or not.
Calling this method on a DataFrame will return a DataFrame with the same structure where the values are replaced with a boolean value True
if the value is missing and False
otherwise. We can then use the .sum()
method to count the number of missing values in each column, and finally, use the .sum()
method once more to count the total number of missing values in the entire DataFrame.
For example, let’s create a DataFrame with missing values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [np.nan, 6, 7, np.nan], 'C': [8, np.nan, 10, 11]})
This gives us the following DataFrame:
A B C
0 1.0 NaN 8.0
1 2.0 6.0 NaN
2 NaN 7.0 10.0
3 4.0 NaN 11.0
Now, let’s count the total number of missing values in the DataFrame:
print(df.isnull().sum().sum())
Output: 4
This tells us that there are four missing values in the entire DataFrame.
Missing Values per Column
The next step is to count the number of missing values for each column in the DataFrame. This can be done by calling the .isnull()
method on the DataFrame, followed by the .sum()
method with the argument axis=0
.
For example, let’s use the same DataFrame as before:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [np.nan, 6, 7, np.nan], 'C': [8, np.nan, 10, 11]})
Now, let’s count the number of missing values in each column:
print(df.isnull().sum(axis=0))
Output:
A 1
B 2
C 1
dtype: int64
This tells us that there is one missing value in column A, two missing values in column B, and one missing value in column C. We can also calculate the percentage of missing values in each column by dividing the number of missing values by the total number of values in the column and multiplying by 100.
This is useful in determining which columns have the highest percentage of missing values and might need to be dropped from the analysis. For example, let’s calculate the percentage of missing values in each column:
print(round(df.isnull().sum(axis=0) / len(df) * 100, 2))
Output:
A 25.00
B 50.00
C 25.00
dtype: float64
This tells us that column B has the highest percentage of missing values at 50%.
Missing Values per Row
Finally, we can count the number of missing values for each row in the DataFrame. This can be done by calling the .isnull()
method on the DataFrame, followed by the .sum()
method with the argument axis=1
.
For example, let’s use the same DataFrame as before:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [np.nan, 6, 7, np.nan], 'C': [8, np.nan, 10, 11]})
Now, let’s count the number of missing values in each row:
print(df.isnull().sum(axis=1))
Output:
0 1
1 1
2 1
3 1
dtype: int64
This tells us that there is one missing value in each row. We can also calculate the percentage of missing values in each row by dividing the number of missing values by the total number of values in the row and multiplying by 100.
This is useful in determining which rows have the highest percentage of missing values and might need to be dropped from the analysis. For example, let’s calculate the percentage of missing values in each row:
print(round(df.isnull().sum(axis=1) / len(df.columns) * 100, 2))
Output:
0 33.33
1 33.33
2 33.33
3 33.33
dtype: float64
This tells us that each row has a 33.33% percentage of missing values.
Conclusion
Counting missing values in a Pandas DataFrame is an important step in data analysis and preparation. In this article, we learned how to count the total number of missing values in a DataFrame, as well as the number of missing values per column and row.
We also learned how to calculate the percentage of missing values in each column and row. These tools are essential in identifying missing values and deciding on an appropriate course of action, such as imputation or dropping the missing values.
Counting missing values in a Pandas DataFrame is a crucial step in data analysis and preparation. In this article, we have covered how to count the total number of missing values in a DataFrame, as well as the number of missing values per column and row, and how to calculate the percentage of missing values in each.
These tools are necessary for identifying missing values and deciding on an appropriate course of action. By using Pandas to count missing values, we can ensure the accuracy of our data and make informed decisions based on complete information.
Remember, missing data needs to be addressed, and thorough data cleaning and preparation are essential for accurate data analysis.