Adventures in Machine Learning

Mastering Pandas: Efficient Data Manipulation for Data Analysis

Pandas groupby() function is a powerful tool that allows you to group data and perform calculations on specific columns. In this article, we will explore how to use the groupby() function in pandas to calculate the mean of a column while ignoring NaN values by default.

We will also show you how to display NaN values if they are present in your dataset.

Grouping by one column and calculating mean of another column

Suppose we have a DataFrame containing information about basketball players. We want to group the DataFrame by the

team name and calculate the average height for each

team. We can use the groupby() function in pandas to group the DataFrame by the

team column and then apply the mean() function to the height column.


import pandas as pd
import numpy as np
data = {'team': ['Raptors', 'Warriors', 'Lakers', 'Celtics', 'Raptors', 'Warriors', 'Lakers', 'Celtics'],
'player': ['Kawhi Leonard', 'Stephen Curry', 'LeBron James', 'Kyrie Irving', 'Marc Gasol', 'Kevin Durant', 'Anthony Davis', 'Gordon Hayward'],
'height': [201, 191, 203, np.nan, 216, 208, 208, 202]}
df = pd.DataFrame(data)
grouped = df.groupby('team')['height'].mean()
print(grouped)

In this example, we are grouping the DataFrame by the

team column and then applying the mean() function to the height column. The output shows the mean height for each

team.


team
Celtics 202.0
Lakers 205.5
Raptors 208.5
Warriors 199.5
Name: height, dtype: float64

Ignoring NaN values by default

By default, pandas ignores NaN (Not a Number) values in calculations. That means if there is any NaN value in a column, pandas does not include that value in the calculation.

For example, if we add a NaN value to the Celtics

team height column in the above DataFrame, pandas will still be able to calculate the mean height for the Celtics.


data = {'team': ['Raptors', 'Warriors', 'Lakers', 'Celtics', 'Raptors', 'Warriors', 'Lakers', 'Celtics'],
'player': ['Kawhi Leonard', 'Stephen Curry', 'LeBron James', 'Kyrie Irving', 'Marc Gasol', 'Kevin Durant', 'Anthony Davis', 'Gordon Hayward'],
'height': [201, 191, 203, np.nan, 216, 208, 208, np.nan]}
df = pd.DataFrame(data)
grouped = df.groupby('team')['height'].mean()
print(grouped)

The output shows that Celtics still has a mean height value of 201 even though there is a NaN value in the height column.


team
Celtics NaN
Lakers 205.5
Raptors 208.5
Warriors 199.5
Name: height, dtype: float64

Displaying NaN if NaN values are present

In some cases, you may want to see if there are any NaN values in a column. You can do this by using the dropna() function in pandas.

For example, if we add a NaN value to the Celtics

team height column in the above DataFrame and want to see if there are any NaN values in the height column, we can use the dropna() function.


data = {'team': ['Raptors', 'Warriors', 'Lakers', 'Celtics', 'Raptors', 'Warriors', 'Lakers', 'Celtics'],
'player': ['Kawhi Leonard', 'Stephen Curry', 'LeBron James', 'Kyrie Irving', 'Marc Gasol', 'Kevin Durant', 'Anthony Davis', 'Gordon Hayward'],
'height': [201, 191, 203, np.nan, 216, np.nan, 208, np.nan]}
df = pd.DataFrame(data)
grouped = df.groupby('team')['height'].mean().dropna()
print(grouped)

The output shows that Celtics has a NaN value in the height column.


team
Lakers 205.5
Raptors 208.5
Name: height, dtype: float64

Conclusion

In this article, we explored how to use the groupby() function in pandas to calculate the mean of a column while ignoring NaN values by default. We also showed you how to display NaN values if they are present in your dataset.

Pandas groupby() function is a powerful tool that can help you analyze and manipulate data in a tabular form. Understanding how to use this function will give you the ability to perform a variety of data analysis tasks more efficiently and effectively.

Pandas groupby() function is just one of the many powerful tools available in pandas library that can help you perform different common tasks on a DataFrame more efficiently. In this section, we will explore some of these common tasks in pandas and explain how to perform them.

Filtering Data

One of the most common tasks you may encounter when working with large datasets is filtering data. In pandas, you can filter data using boolean indexing.

Boolean indexing allows you to select data based on certain conditions. For example, if we have a DataFrame containing information about basketball players and we want to filter the data to only show players above a certain height, we can use the following code:


import pandas as pd
import numpy as np
data = {'player': ['Kawhi Leonard', 'Stephen Curry', 'LeBron James', 'Kyrie Irving', 'Marc Gasol', 'Kevin Durant', 'Anthony Davis', 'Gordon Hayward'],
'team': ['Raptors', 'Warriors', 'Lakers', 'Celtics', 'Raptors', 'Warriors', 'Lakers', 'Celtics'],
'height': [201, 191, 203, 188, 216, 208, 208, 203]}
df = pd.DataFrame(data)
tall_players = df[df['height'] > 200]
print(tall_players)

The output shows only the players with height above 200.


player team height
0 Kawhi Leonard Raptors 201
2 LeBron James Lakers 203
4 Marc Gasol Raptors 216
5 Kevin Durant Warriors 208
6 Anthony Davis Lakers 208
7 Gordon Hayward Celtics 203

Reordering Columns

Sometimes, you may want to change the order of columns in your DataFrame. To do this in pandas, you can use the reindex() function.

For example, if we have a DataFrame containing the same information about basketball players, but we want to reorder the columns so that the

team column comes first, we can use the following code:


data = {'player': ['Kawhi Leonard', 'Stephen Curry', 'LeBron James', 'Kyrie Irving', 'Marc Gasol', 'Kevin Durant', 'Anthony Davis', 'Gordon Hayward'],
'team': ['Raptors', 'Warriors', 'Lakers', 'Celtics', 'Raptors', 'Warriors', 'Lakers', 'Celtics'],
'height': [201, 191, 203, 188, 216, 208, 208, 203]}
df = pd.DataFrame(data)
cols = ['team', 'player', 'height']
df = df.reindex(columns=cols)
print(df)

The output shows the new DataFrame with the

team column coming first.


team player height
0 Raptors Kawhi Leonard 201
1 Warriors Stephen Curry 191
2 Lakers LeBron James 203
3 Celtics Kyrie Irving 188
4 Raptors Marc Gasol 216
5 Warriors Kevin Durant 208
6 Lakers Anthony Davis 208
7 Celtics Gordon Hayward 203

Renaming Columns

You may also want to change the name of columns in your DataFrame. To do this in pandas, you can use the rename() function.

For example, if we have a DataFrame with the same information about basketball players, but we want to change the name of the player column to name, we can use the following code:


data = {'player': ['Kawhi Leonard', 'Stephen Curry', 'LeBron James', 'Kyrie Irving', 'Marc Gasol', 'Kevin Durant', 'Anthony Davis', 'Gordon Hayward'],
'team': ['Raptors', 'Warriors', 'Lakers', 'Celtics', 'Raptors', 'Warriors', 'Lakers', 'Celtics'],
'height': [201, 191, 203, 188, 216, 208, 208, 203]}
df = pd.DataFrame(data)
df.rename(columns={'player': 'name'}, inplace=True)
print(df)

The output shows the new DataFrame with the player column renamed to name.


name team height
0 Kawhi Leonard Raptors 201
1 Stephen Curry Warriors 191
2 LeBron James Lakers 203
3 Kyrie Irving Celtics 188
4 Marc Gasol Raptors 216
5 Kevin Durant Warriors 208
6 Anthony Davis Lakers 208
7 Gordon Hayward Celtics 203

Replacing Values

Another common task you may encounter when working with data is replacing certain values. To do this in pandas, you can use the replace() function.

For example, suppose we have a DataFrame containing information about the same basketball players, but we want to replace all occurrences of Raptors with Toronto Raptors in the

team column. We can use the following code:


data = {'player': ['Kawhi Leonard', 'Stephen Curry', 'LeBron James', 'Kyrie Irving', 'Marc Gasol', 'Kevin Durant', 'Anthony Davis', 'Gordon Hayward'],
'team': ['Raptors', 'Warriors', 'Lakers', 'Celtics', 'Raptors', 'Warriors', 'Lakers', 'Celtics'],
'height': [201, 191, 203, 188, 216, 208, 208, 203]}
df = pd.DataFrame(data)
df['team'].replace({'Raptors': 'Toronto Raptors'}, inplace=True)
print(df)

The output shows the new DataFrame with all occurrences of Raptors in the

team column replaced with Toronto Raptors.


player team height
0 Kawhi Leonard Toronto Raptors 201
1 Stephen Curry Warriors 191
2 LeBron James Lakers 203
3 Kyrie Irving Celtics 188
4 Marc Gasol Toronto Raptors 216
5 Kevin Durant Warriors 208
6 Anthony Davis Lakers 208
7 Gordon Hayward Celtics 203

Conclusion

In this section, we explored some of the most common tasks you may encounter when working with pandas DataFrames, including filtering data, reordering columns, renaming columns, and replacing values. Understanding how to perform these tasks in pandas will give you the ability to manipulate and prepare data efficiently and effectively.

In this article, we explored the power of pandas groupby() function, how to calculate the mean of a column while ignoring NaN values, and displaying NaN values if present. Additionally, we covered some common tasks that you may encounter when working with large datasets, including filtering data, reordering and renaming columns, and replacing values.

Understanding how to perform these tasks efficiently and effectively in pandas is crucial for data analysis and preparation. The takeaways from this article include the importance of being familiar with pandas functions, Boolean indexing, and data manipulation for data-driven tasks.

Having this skillset will save time and improve the quality of data analytics in any professional setting.

Popular Posts