Adventures in Machine Learning

Mastering Pandas GroupBy: An Essential Skill for Every Data Scientist

Data science and analytics have become an essential part of many industries in recent years. Handling large datasets, sorting and analyzing data, and obtaining meaningful insights are crucial skills for aspiring data scientists.

One of the most useful functions in data analysis is grouping data. In pandas, one can group data based on some criteria or key elements and perform specific operations on them.

In this article, we will explore pandas GroupBy basics and learn how to use it to group data in various ways.

Dataset Description

Before we proceed, let’s briefly describe the dataset we will be using throughout this article. We will be using a dataset containing information about various movies.

The dataset includes columns such as the name of the movie, its genre, director’s name, release year, and box-office collection. The objective of using this dataset is to explore how we can use GroupBy to obtain meaningful insights from it.

Prerequisites

To understand the concepts discussed in this article, you should have a basic understanding of Python programming and the pandas library. Familiarity with data analysis and manipulation concepts will also be useful.

Dataset Loading

Let’s start by loading the dataset into our Python environment. We will be using the pandas library to load and manipulate the dataset.

import pandas as pd
# Load the dataset
movie_data = pd.read_csv('movie_data.csv')

pandas GroupBy Basics

pandas provides a powerful and flexible way to group data based on one or more columns using the GroupBy function. The GroupBy function essentially splits the dataset into smaller groups based on the key column(s), applies some function to each group, and then combines the results back into a new DataFrame.

Split-Apply-Combine Process

The GroupBy function follows the split-apply-combine process, which can be summarized in the following steps:

  1. Split the dataset into groups based on key criteria
  2. Apply a function to each group to perform aggregate calculations
  3. Combine the results of step 2 back into a new DataFrame

DataFrameGroupBy Object

Before we start grouping data, let’s first understand the DataFrameGroupBy object that’s returned when we apply the GroupBy function to a DataFrame. The DataFrameGroupBy object is not a DataFrame, but rather a special object that stores information about the groups but does not calculate anything until an aggregation function is called.

python
grouped_by_genre = movie_data.groupby('Genre')

type(grouped_by_genre)

Output:


Grouping on a Single Column

The most basic use of GroupBy is to group data based on a single column. The `groupby()` method can be called on a DataFrame with the column name as the argument.

Once the data is grouped, we can apply any of the following methods for aggregation:

  • `count`: Number of non-null observations in each group
  • `sum`: Sum of values in each group
  • `mean`: Mean of values in each group
  • `min`: Minimum value in each group
  • `max`: Maximum value in each group
  • `median`: Median value in each group
  • `std`: Standard deviation of values in each group
  • `var`: Variance of values in each group
python
# Group the data by genre and calculate count
grouped_by_genre_count = movie_data.groupby('Genre').count()
print(grouped_by_genre_count[['Name','BoxOffice']])
# Group the data by genre and calculate mean
grouped_by_genre_mean = movie_data.groupby('Genre').mean()
print(grouped_by_genre_mean[['BoxOffice']])

Output:

          Name  BoxOffice

Genre                    
Action       4          4
Adventure    5          5
Comedy       5          5
Drama        4          4
Thriller     2          2
           BoxOffice
Genre               
Action    197.500000
Adventure 194.200000
Comedy    246.600000
Drama     104.250000
Thriller  332.500000

Grouping on Multiple Columns

We can also group data based on multiple columns. To do so, we pass a list of column names to the `groupby()` method.

python
grouped_by_year_genre = movie_data.groupby(['Year', 'Genre'], as_index=False).size()
print(grouped_by_year_genre.head())

Output:

   Year   Genre  size

0  2010  Comedy     3
1  2010   Drama     1
2  2011  Action     1
3  2011   Drama     1
4  2011  Comedy     2

Using Lambda Functions with GroupBy

One of the most powerful features of pandas GroupBy is the ability to use lambda functions and apply them to the grouped data. We can define our lambda functions to perform custom calculations or transformations on the grouped data.

python
# Define a lambda function to calculate the percentage of box-office collection for each movie
percent_boxoffice = lambda x: (x['BoxOffice'] / x['BoxOffice'].sum()) * 100
# Group the data by genre and apply the lambda function to calculate the percentage of box-office collection
grouped_by_genre.apply(percent_boxoffice)

Output:

      BoxOffice
0  16.567870   
1  25.040701   
2  30.065359   
3  19.326511   
4   8.999559   
5  48.353736   
6  22.748801   
7  28.895462   
8  15.850479   
9  32.502968   
10 22.638274   
11 17.008280   
12 13.803144   
13  7.816787   
14 26.303813   
15 27.839099   
16 20.573426   
17 25.767662   
18 27.131800   
19  9.162781   
20 38.319506   
21 12.022169   
22 15.274405   
23 23.267586   
24 31.275234   
25 18.110498   
26 11.438719   
27 21.024294   
28 17.284277   
29 32.253555   
30 15.755932   
31 14.378822   
32 27.520074   
33 19.731927   
34 36.796338   
35 20.951728   
36 31.765806   
37 22.740385   
38 15.077024   
39 24.716632   
40 29.859020   
41 12.860090   
42 10.500352   
43 36.973354   
44 20.637204   
45 39.795760   
46 15.246437   
47 15.568441   
48 22.972843   
49 25.417658   
50 20.794823   
51 18.073707   
52 13.498512   
53 13.261507   
54 27.803296   
55 21.684161   
56 24.731234   
57 23.740020   
58 43.315835   
59 17.652426   
60 17.312492   
61 26.616528   
62 20.230020   
63 21.232681

Conclusion

In conclusion, pandas GroupBy is a powerful and flexible function that enables us to group data based on single or multiple columns and apply various aggregation functions or custom lambda functions for data transformation. While we’ve only touched the surface of what we can do with pandas GroupBy, this article provides a good starting point for anyone who wants to learn more about handling and manipulating data with pandas.

Grouping with Resampling

In the previous section, we learned about grouping data based on one or more columns. In this section, we will explore time-based grouping operations using resampling.

Resampling is a powerful tool that allows us to convert a time series from one frequency to another.

Time-based Grouping Operations

Data collected over time often contains time-related information such as timestamps. Grouping data by time allows for efficient analysis of data over specific time intervals.

For instance, we might want to group daily sales data by month or week. pandas provides several time-based grouping options that enable us to aggregate data over specific time intervals such as hour, day, week, month, year, etc.

Resampling with Frequency Strings

Resampling is the ability to group data by time intervals other than the one it was collected at. In pandas, resampling refers to the process of converting a time series from one frequency to another.

With resampling, we can downsample our data (reduce its frequency) or upsample our data (increase its frequency). We can use the `.resample()` method in pandas to achieve resampling.

python
# Resample daily stock prices to weekly frequency
weekly_stock_prices = stock_prices.resample('W').mean()

The `’W’` argument passed to `resample()` indicates a weekly time frequency. We can also pass other frequency arguments such as `’D’ (day), ‘M’ (month), ‘Q’ (quarter), ‘Y’ (year)`, etc.

U.S. Congress Dataset Example

Let’s explore how we can use resampling and grouping operations on a publicly available dataset – the U.S. Congress dataset. This dataset contains information about bills introduced in the U.S. Congress and their status at various stages of the legislative process.

We will be focusing on the timestamp information to group the data.

Dataset Description

The U.S. Congress dataset contains the following columns:

  • congress: the session of Congress (identified by the year they took office).
  • chamber: the chamber where the bill was introduced, the Senate or House of Representatives.
  • bill_number: the unique identification number of the bill.
  • bill_title: the title of the bill.
  • introduction_date: the date the bill was introduced to Congress.
  • latest_major_action_date: the date of the latest significant action taken on the bill.
  • latest_major_action: a description of the latest significant action taken on the bill.

Grouping on Derived Arrays

We can calculate derived arrays based on the timestamps to perform grouping operations. In this section, we will explore how we can group the U.S. Congress dataset by day of the week and hour of the day.

Grouping by Day of the Week

python
# Convert introduction_date column to datetime format
congress_data['introduction_date'] = pd.to_datetime(congress_data['introduction_date'])
# Group data by day of the week
congress_data_weekday_grouped = congress_data.groupby(congress_data['introduction_date'].dt.day_name())['congress'].count()
# Sort data by day of the week
congress_data_weekday_grouped = congress_data_weekday_grouped.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
# Plot the results
congress_data_weekday_grouped.plot(kind='bar', xlabel='Day of the Week', ylabel='Bill Count', title='Billby Day of the Week')

Output:

output1

In the above code snippet, we first convert the introduction_date column to a datetime format. We then group the data by the day of the week using the `.dt.day_name()` method, which extracts the day name from the datetime column.

We then count the number of bills grouped by each day of the week and sort them in the right order. Finally, we plot the results using a bar plot.

Grouping by Hour of the Day

python
# Convert latest_major_action_date column to datetime format
congress_data['latest_major_action_date'] = pd.to_datetime(congress_data['latest_major_action_date'])
# Create bins for hours of the day
hours_of_day_bins = pd.IntervalIndex.from_tuples([(0,4),(4,8),(8,12),(12,16),(16,20),(20,24)])
# Group data by hour of the day
congress_data_hour_grouped = congress_data.groupby(pd.cut(congress_data['latest_major_action_date'].dt.hour, bins=hours_of_day_bins))[' congress'].count()
# Plot the results
congress_data_hour_grouped.plot(kind='bar', xlabel='Hour of the Day', ylabel='Bill Count', title='Bill Major Action by Hour of the Day')

Output:

output2

In the above code snippet, we first convert the latest_major_action_date column to a datetime format. We then create bins for the hours of the day using the `IntervalIndex.from_tuples()` method.

We then group the data by the hour of the day using the `pd.cut()` function and count the number of bills grouped by each interval. Finally, we plot the results using a bar plot.

Conclusion

In this article, we explored time-based grouping operations using resampling and derived arrays. We learned how to resample time series data using the `.resample()` method and group data based on specific intervals such as day of the week and hour of the day.

We applied these concepts to a real-world dataset – the U.S. Congress dataset – to demonstrate how we can use time-based grouping operations in practice to perform meaningful analysis on data with timestamps.

Air Quality Dataset Example

In this section, we will explore another example of using grouping and resampling operations on the Air Quality dataset. The Air Quality dataset contains daily air quality concentrations for several pollutants in the United States.

Dataset Description

The Air Quality dataset contains the following columns:

  • date: the date the observation was taken.
  • ozone: the ozone concentration in parts per billion (ppb).
  • pm25: the particulate matter concentration (PM2.5) in micrograms per cubic meter (ug/m3).
  • wind: the wind speed in miles per hour (mph).

Grouping with Resampling

Resampling is a powerful tool for data analysis that enables us to convert a time series from one frequency to another. With resampling, we can downsample our data (reduce its frequency) or upsample our data (increase its frequency).

We can use the `.resample()` method in pandas to achieve resampling.

python
# Resample daily air quality data to monthly frequency
air_quality_monthly = air_quality.groupby(pd.Grouper(key='date', freq='M')).mean()

In the above code snippet, we group the data by the month using the `pd.Grouper()` method and the `’M’` argument passed to `freq`.

We then calculate the mean air quality concentrations for each month using the `.mean()` method.

News Aggregator Dataset Example

In this example, we will explore the News Aggregator dataset to demonstrate how we can group data by categories and perform various operations on them.

Dataset Description

The News Aggregator dataset contains a collection of news headlines and their corresponding categories. The dataset contains the following columns:

  • ID: the unique identifier for the article.
  • title: the headline of the article.
  • url: the URL of the article.
  • publisher: the publisher of the article.
  • category: the category of the article.
  • story: the original newswire story that the article appeared in.
  • hostname: the hostname of the URL.

Grouping by Categories

Counting Articles by Category

python
# Count articles per category
category_counts = news_data.groupby('category').size()
# Sort categories by count
category_counts = category_counts.sort_values(ascending=False)
# Plot the results
category_counts.plot(kind='bar', xlabel='Category', ylabel='Article Count', title='Article Count by Category')

In the above code snippet, we first group the data by the category column and then calculate the size of each group using the `.size()` method. We then sort the categories in descending order based on the size and plot the results using a bar plot.

Finding the Earliest and Latest Articles by Category

python
# Find the earliest and latest articles by category
earliest_article = news_data.groupby('category')['timestamp'].min()
latest_article = news_data.groupby('category')['timestamp'].max()
# Combine the results into a dataframe
article_dates = pd.concat([earliest_article, latest_article], axis=1, keys=['Earliest', 'Latest'])

In the above code snippet, we group the data by

Popular Posts