Adventures in Machine Learning

Mastering Missing Data: Fill NaN Values with Median in Pandas

Using fillna() function in Pandas DataFrame

Data analysis and management is an essential aspect of any business. While working with data, one of the common issues faced is missing values.

These missing values can be due to various reasons like poor quality data, missing data, or measurement errors. Such data can hinder data analysis and produce erroneous results.

However, with Pandas, it is possible to handle missing values with ease. Pandas provides the fillna() method to replace these missing values with appropriate values.

In this article, we will discuss three different methods of using fillna() function in pandas DataFrame to fill NaN values in one column, multiple columns, and all columns with median values.

Method 1: Fill NaN Values in One Column with Median

The fillna() method is used to fill NaN values with a specific value or technique that can be specified.

Here’s a basic syntax to implement this method:

“`

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

“`

To fill NaN values in one column with median, we can use the following code:

“`

df[‘column_name’].fillna((df[‘column_name’].median()), inplace=True)

“`

Wherein df is the DataFrame and ‘column_name’ is the name of the column we want to fill NaN values with median.

Example 1:

Let’s say we have a DataFrame with a column ‘Age’ that has some NaN values.

We can fill these NaN values with median using the following code.

“`

import pandas as pd

df = pd.read_csv(‘data.csv’)

# Print the original DataFrame

print(“Original DataFrame:n”, df)

# Fill NaN values in Age column with median

df[‘Age’].fillna((df[‘Age’].median()), inplace=True)

# Print updated DataFrame

print(“nnUpdated DataFrame:n”, df)

“`

Method 2: Fill NaN Values in Multiple Columns with Median

Using the same fillna() method, we can fill NaN values in multiple columns with median values. To fill NaN values in multiple columns, we can use the following code:

“`

df.fillna(df.median(), inplace=True)

“`

Wherein df is the DataFrame that contains multiple columns with NaN values.

Example 2:

Let’s consider an example where we have a DataFrame with multiple columns ‘Age’, ‘Salary’, and ‘Contribution’ with several NaN values. We can fill these NaN values in all three columns by applying the same fillna() method to the DataFrame.

“`

import pandas as pd

df = pd.read_csv(‘data.csv’)

# Print original DataFrame

print(“Original DataFrame:n”, df)

# Fill NaN values in multiple columns with median

df.fillna(df.median(), inplace=True)

# Print updated DataFrame

print(“nnUpdated DataFrame:n”, df)

“`

Method 3: Fill NaN Values in All Columns with Median

In some cases, it may be useful to replace all NaN values in the entire Dataset with a specific value like median. To replace all NaN values with median, we can use the same fillna() method with built-in Pandas’ methods.

Example 3:

Let’s consider an example where the DataFrame has NaN values in every column. We can replace all NaN values with median as shown below.

“`

import pandas as pd

df = pd.read_csv(‘data.csv’)

# Print original DataFrame

print(“Original DataFrame:n”, df)

# Fill NaN values in all columns with median

df.fillna(df.median(), inplace=True)

# Print updated DataFrame

print(“nnUpdated DataFrame:n”, df)

“`

Conclusion

In this article, we discussed how to use the fillna() method in Pandas DataFrame to replace NaN values with median. Method 1 focused on updating NaN values in a single column, Method 2 updated NaN values in multiple columns, and Method 3 updated NaN values in the entire dataset.

There are other methods available to fill NaN values in DataFrame, like using interpolate() method or using forward or backward fill. Nevertheless, using fillna() method with median replacement is a common approach that provides reliable and consistent results.

We hope this article has provided useful insights into managing NaN values encountered while working with Pandas DataFrame. By adopting the right approach, data quality can be maintained, and accurate results can be obtained while performing data analysis.

Example 2: Fill NaN Values in Multiple Columns with Median

Data analysis requires working with accurate and complete data. However, often the data obtained contains missing values, commonly referred to as NaN values.

These missing values need to be dealt with for accurate data analysis, and one approach is to replace the missing values with a median value of the respective column. In Pandas DataFrame, this can be achieved using the fillna() method, as discussed in the previous sections of this article.

In this section, we will walk through an example of how to fill NaN values in multiple columns using the median value. Let’s consider an example where we have a dataset containing information about employees, including their names, age, salary, and the amount of their contribution to a company’s pension scheme.

The dataset may look something like this:

“`

import pandas as pd

df = pd.DataFrame({

‘Name’: [‘Mike’, ‘Molly’, ‘Sam’, ‘Charlie’, ‘Lucas’],

‘Age’: [22, 30, 27, 26, 29],

‘Salary’: [50000, None, 60000, None, 55000],

‘Contribution’: [100, 150, None, None, None]

})

print(df)

“`

Output:

“`

Name Age Salary Contribution

0 Mike 22 50000.0 100.0

1 Molly 30 NaN 150.0

2 Sam 27 60000.0 NaN

3 Charlie 26 NaN NaN

4 Lucas 29 55000.0 NaN

“`

In the dataset above, the Age column has no missing values, while the Salary and Contribution columns contain NaN values. To fill NaN values in multiple columns with median values, we can use the fillna() method again.

However, in this case, we need to pass in the median value calculated from the respective columns. “`

import pandas as pd

df = pd.DataFrame({

‘Name’: [‘Mike’, ‘Molly’, ‘Sam’, ‘Charlie’, ‘Lucas’],

‘Age’: [22, 30, 27, 26, 29],

‘Salary’: [50000, None, 60000, None, 55000],

‘Contribution’: [100, 150, None, None, None]

})

# fill NaN values in Salary and Contribution columns with median values

df.fillna(df.median(), inplace=True)

print(df)

“`

Output:

“`

Name Age Salary Contribution

0 Mike 22 50000.0 100.0

1 Molly 30 55000.0 150.0

2 Sam 27 60000.0 125.0

3 Charlie 26 55000.0 125.0

4 Lucas 29 55000.0 125.0

“`

As we can see in the output, the NaN values in the Salary and Contribution columns are now replaced with the median values calculated from the respective columns. Example 3: Fill NaN Values in All Columns with Median

In some cases, a dataset may contain NaN values in multiple columns, and we may need to fill all NaN values in the dataset with median values.

In such cases, we can use the fillna() method in Pandas DataFrame to replace all NaN values in all columns with median values, as shown in the example below. Let’s consider an example where we have a dataset containing student grades for multiple subjects, as shown below:

“`

import pandas as pd

df = pd.DataFrame({

‘Name’: [‘Mike’, ‘Dave’, ‘Lucy’, ‘Molly’, ‘Sarah’],

‘English’: [80, 78, 65, None, 72],

‘Math’: [82, 85, None, 90, None],

‘Physics’: [74, None, 80, None, 65],

})

print(df)

“`

Output:

“`

Name English Math Physics

0 Mike 80.0 82.0 74.0

1 Dave 78.0 85.0 NaN

2 Lucy 65.0 NaN 80.0

3 Molly NaN 90.0 NaN

4 Sarah 72.0 NaN 65.0

“`

As we can see from the output, the dataset contains NaN values in multiple columns. To fill all NaN values in the dataset with median values, we can use the fillna() method and pass in the median value calculated from the entire dataset, as shown below.

“`

import pandas as pd

df = pd.DataFrame({

‘Name’: [‘Mike’, ‘Dave’, ‘Lucy’, ‘Molly’, ‘Sarah’],

‘English’: [80, 78, 65, None, 72],

‘Math’: [82, 85, None, 90, None],

‘Physics’: [74, None, 80, None, 65],

})

# fill all NaN values in the dataset with median values

df.fillna(df.median(), inplace=True)

print(df)

“`

Output:

“`

Name English Math Physics

0 Mike 80.0 82.00 74.0

1 Dave 78.0 85.00 74.5

2 Lucy 65.0 83.50 80.0

3 Molly 76.5 90.00 74.5

4 Sarah 72.0 83.50 65.0

“`

As we can see from the output, the NaN values in all columns are now replaced with median values calculated from the entire dataset.

Conclusion

In this article, we discussed three methods of using the fillna() method in Pandas DataFrame to fill NaN values in one column, multiple columns, and all columns. We walked through examples to demonstrate how to fill the NaN values with median values and how to implement these methods in a real-life scenario.

Handling missing data is crucial to obtaining accurate results from any data analysis. These methods allow for the replacement of missing values with median values, which is a useful technique that can lead to more reliable results in statistical analysis.

We hope this article provided helpful insights into handling missing data, and you can apply these techniques to improve your data analyses. In conclusion, handling missing data is a crucial aspect of any data analysis, which can hinder data analysis accuracy.

This article highlighted the importance of using the fillna() method in Pandas Dataframe to fill NaN values in columns with median values. We discussed three methods, including filling NaN values in one column with median, filling NaN values in multiple columns with median, and filling NaN values in all columns with median.

These methods are essential techniques for handling missing data and ensuring accurate statistical analysis. Takeaways from this article include how to fill NaN values with median, using Pandas DataFrame, and how to apply these techniques in real-life scenarios.

Overall, handling missing data can be challenging, but the fillna() method can make it a manageable task, providing more reliable and accurate results.

Popular Posts