Data science and analytics have become an essential part of many industries in recent years. Handling large datasets, sorting and analyzing data, and obtaining meaningful insights are crucial skills for aspiring data scientists.
One of the most useful functions in data analysis is grouping data. In pandas, one can group data based on some criteria or key elements and perform specific operations on them.
In this article, we will explore pandas GroupBy basics and learn how to use it to group data in various ways.
Dataset Description
Before we proceed, let’s briefly describe the dataset we will be using throughout this article. We will be using a dataset containing information about various movies.
The dataset includes columns such as the name of the movie, its genre, director’s name, release year, and box-office collection. The objective of using this dataset is to explore how we can use GroupBy to obtain meaningful insights from it.
Prerequisites
To understand the concepts discussed in this article, you should have a basic understanding of Python programming and the pandas library. Familiarity with data analysis and manipulation concepts will also be useful.
Dataset Loading
Let’s start by loading the dataset into our Python environment. We will be using the pandas library to load and manipulate the dataset.
import pandas as pd
# Load the dataset
movie_data = pd.read_csv('movie_data.csv')
pandas GroupBy Basics
pandas provides a powerful and flexible way to group data based on one or more columns using the GroupBy function. The GroupBy function essentially splits the dataset into smaller groups based on the key column(s), applies some function to each group, and then combines the results back into a new DataFrame.
Split-Apply-Combine Process
The GroupBy function follows the split-apply-combine process, which can be summarized in the following steps:
- Split the dataset into groups based on key criteria
- Apply a function to each group to perform aggregate calculations
- Combine the results of step 2 back into a new DataFrame
DataFrameGroupBy Object
Before we start grouping data, let’s first understand the DataFrameGroupBy object that’s returned when we apply the GroupBy function to a DataFrame. The DataFrameGroupBy object is not a DataFrame, but rather a special object that stores information about the groups but does not calculate anything until an aggregation function is called.
python
grouped_by_genre = movie_data.groupby('Genre')
type(grouped_by_genre)
Output:
Grouping on a Single Column
The most basic use of GroupBy is to group data based on a single column. The `groupby()` method can be called on a DataFrame with the column name as the argument.
Once the data is grouped, we can apply any of the following methods for aggregation:
- `count`: Number of non-null observations in each group
- `sum`: Sum of values in each group
- `mean`: Mean of values in each group
- `min`: Minimum value in each group
- `max`: Maximum value in each group
- `median`: Median value in each group
- `std`: Standard deviation of values in each group
- `var`: Variance of values in each group
python
# Group the data by genre and calculate count
grouped_by_genre_count = movie_data.groupby('Genre').count()
print(grouped_by_genre_count[['Name','BoxOffice']])
# Group the data by genre and calculate mean
grouped_by_genre_mean = movie_data.groupby('Genre').mean()
print(grouped_by_genre_mean[['BoxOffice']])
Output:
Name BoxOffice
Genre
Action 4 4
Adventure 5 5
Comedy 5 5
Drama 4 4
Thriller 2 2
BoxOffice
Genre
Action 197.500000
Adventure 194.200000
Comedy 246.600000
Drama 104.250000
Thriller 332.500000
Grouping on Multiple Columns
We can also group data based on multiple columns. To do so, we pass a list of column names to the `groupby()` method.
python
grouped_by_year_genre = movie_data.groupby(['Year', 'Genre'], as_index=False).size()
print(grouped_by_year_genre.head())
Output:
Year Genre size
0 2010 Comedy 3
1 2010 Drama 1
2 2011 Action 1
3 2011 Drama 1
4 2011 Comedy 2
Using Lambda Functions with GroupBy
One of the most powerful features of pandas GroupBy is the ability to use lambda functions and apply them to the grouped data. We can define our lambda functions to perform custom calculations or transformations on the grouped data.
python
# Define a lambda function to calculate the percentage of box-office collection for each movie
percent_boxoffice = lambda x: (x['BoxOffice'] / x['BoxOffice'].sum()) * 100
# Group the data by genre and apply the lambda function to calculate the percentage of box-office collection
grouped_by_genre.apply(percent_boxoffice)
Output:
BoxOffice
0 16.567870
1 25.040701
2 30.065359
3 19.326511
4 8.999559
5 48.353736
6 22.748801
7 28.895462
8 15.850479
9 32.502968
10 22.638274
11 17.008280
12 13.803144
13 7.816787
14 26.303813
15 27.839099
16 20.573426
17 25.767662
18 27.131800
19 9.162781
20 38.319506
21 12.022169
22 15.274405
23 23.267586
24 31.275234
25 18.110498
26 11.438719
27 21.024294
28 17.284277
29 32.253555
30 15.755932
31 14.378822
32 27.520074
33 19.731927
34 36.796338
35 20.951728
36 31.765806
37 22.740385
38 15.077024
39 24.716632
40 29.859020
41 12.860090
42 10.500352
43 36.973354
44 20.637204
45 39.795760
46 15.246437
47 15.568441
48 22.972843
49 25.417658
50 20.794823
51 18.073707
52 13.498512
53 13.261507
54 27.803296
55 21.684161
56 24.731234
57 23.740020
58 43.315835
59 17.652426
60 17.312492
61 26.616528
62 20.230020
63 21.232681
Conclusion
In conclusion, pandas GroupBy is a powerful and flexible function that enables us to group data based on single or multiple columns and apply various aggregation functions or custom lambda functions for data transformation. While we’ve only touched the surface of what we can do with pandas GroupBy, this article provides a good starting point for anyone who wants to learn more about handling and manipulating data with pandas.
Grouping with Resampling
In the previous section, we learned about grouping data based on one or more columns. In this section, we will explore time-based grouping operations using resampling.
Resampling is a powerful tool that allows us to convert a time series from one frequency to another.
Time-based Grouping Operations
Data collected over time often contains time-related information such as timestamps. Grouping data by time allows for efficient analysis of data over specific time intervals.
For instance, we might want to group daily sales data by month or week. pandas provides several time-based grouping options that enable us to aggregate data over specific time intervals such as hour, day, week, month, year, etc.
Resampling with Frequency Strings
Resampling is the ability to group data by time intervals other than the one it was collected at. In pandas, resampling refers to the process of converting a time series from one frequency to another.
With resampling, we can downsample our data (reduce its frequency) or upsample our data (increase its frequency). We can use the `.resample()` method in pandas to achieve resampling.
python
# Resample daily stock prices to weekly frequency
weekly_stock_prices = stock_prices.resample('W').mean()
The `’W’` argument passed to `resample()` indicates a weekly time frequency. We can also pass other frequency arguments such as `’D’ (day), ‘M’ (month), ‘Q’ (quarter), ‘Y’ (year)`, etc.
U.S. Congress Dataset Example
Let’s explore how we can use resampling and grouping operations on a publicly available dataset – the U.S. Congress dataset. This dataset contains information about bills introduced in the U.S. Congress and their status at various stages of the legislative process.
We will be focusing on the timestamp information to group the data.
Dataset Description
The U.S. Congress dataset contains the following columns:
- congress: the session of Congress (identified by the year they took office).
- chamber: the chamber where the bill was introduced, the Senate or House of Representatives.
- bill_number: the unique identification number of the bill.
- bill_title: the title of the bill.
- introduction_date: the date the bill was introduced to Congress.
- latest_major_action_date: the date of the latest significant action taken on the bill.
- latest_major_action: a description of the latest significant action taken on the bill.
Grouping on Derived Arrays
We can calculate derived arrays based on the timestamps to perform grouping operations. In this section, we will explore how we can group the U.S. Congress dataset by day of the week and hour of the day.
Grouping by Day of the Week
python
# Convert introduction_date column to datetime format
congress_data['introduction_date'] = pd.to_datetime(congress_data['introduction_date'])
# Group data by day of the week
congress_data_weekday_grouped = congress_data.groupby(congress_data['introduction_date'].dt.day_name())['congress'].count()
# Sort data by day of the week
congress_data_weekday_grouped = congress_data_weekday_grouped.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
# Plot the results
congress_data_weekday_grouped.plot(kind='bar', xlabel='Day of the Week', ylabel='Bill Count', title='Billby Day of the Week')
Output:
In the above code snippet, we first convert the introduction_date column to a datetime format. We then group the data by the day of the week using the `.dt.day_name()` method, which extracts the day name from the datetime column.
We then count the number of bills grouped by each day of the week and sort them in the right order. Finally, we plot the results using a bar plot.
Grouping by Hour of the Day
python
# Convert latest_major_action_date column to datetime format
congress_data['latest_major_action_date'] = pd.to_datetime(congress_data['latest_major_action_date'])
# Create bins for hours of the day
hours_of_day_bins = pd.IntervalIndex.from_tuples([(0,4),(4,8),(8,12),(12,16),(16,20),(20,24)])
# Group data by hour of the day
congress_data_hour_grouped = congress_data.groupby(pd.cut(congress_data['latest_major_action_date'].dt.hour, bins=hours_of_day_bins))[' congress'].count()
# Plot the results
congress_data_hour_grouped.plot(kind='bar', xlabel='Hour of the Day', ylabel='Bill Count', title='Bill Major Action by Hour of the Day')
Output:
In the above code snippet, we first convert the latest_major_action_date column to a datetime format. We then create bins for the hours of the day using the `IntervalIndex.from_tuples()` method.
We then group the data by the hour of the day using the `pd.cut()` function and count the number of bills grouped by each interval. Finally, we plot the results using a bar plot.
Conclusion
In this article, we explored time-based grouping operations using resampling and derived arrays. We learned how to resample time series data using the `.resample()` method and group data based on specific intervals such as day of the week and hour of the day.
We applied these concepts to a real-world dataset – the U.S. Congress dataset – to demonstrate how we can use time-based grouping operations in practice to perform meaningful analysis on data with timestamps.
Air Quality Dataset Example
In this section, we will explore another example of using grouping and resampling operations on the Air Quality dataset. The Air Quality dataset contains daily air quality concentrations for several pollutants in the United States.
Dataset Description
The Air Quality dataset contains the following columns:
- date: the date the observation was taken.
- ozone: the ozone concentration in parts per billion (ppb).
- pm25: the particulate matter concentration (PM2.5) in micrograms per cubic meter (ug/m3).
- wind: the wind speed in miles per hour (mph).
Grouping with Resampling
Resampling is a powerful tool for data analysis that enables us to convert a time series from one frequency to another. With resampling, we can downsample our data (reduce its frequency) or upsample our data (increase its frequency).
We can use the `.resample()` method in pandas to achieve resampling.
python
# Resample daily air quality data to monthly frequency
air_quality_monthly = air_quality.groupby(pd.Grouper(key='date', freq='M')).mean()
In the above code snippet, we group the data by the month using the `pd.Grouper()` method and the `’M’` argument passed to `freq`.
We then calculate the mean air quality concentrations for each month using the `.mean()` method.
News Aggregator Dataset Example
In this example, we will explore the News Aggregator dataset to demonstrate how we can group data by categories and perform various operations on them.
Dataset Description
The News Aggregator dataset contains a collection of news headlines and their corresponding categories. The dataset contains the following columns:
- ID: the unique identifier for the article.
- title: the headline of the article.
- url: the URL of the article.
- publisher: the publisher of the article.
- category: the category of the article.
- story: the original newswire story that the article appeared in.
- hostname: the hostname of the URL.
Grouping by Categories
Counting Articles by Category
python
# Count articles per category
category_counts = news_data.groupby('category').size()
# Sort categories by count
category_counts = category_counts.sort_values(ascending=False)
# Plot the results
category_counts.plot(kind='bar', xlabel='Category', ylabel='Article Count', title='Article Count by Category')
In the above code snippet, we first group the data by the category column and then calculate the size of each group using the `.size()` method. We then sort the categories in descending order based on the size and plot the results using a bar plot.
Finding the Earliest and Latest Articles by Category
python
# Find the earliest and latest articles by category
earliest_article = news_data.groupby('category')['timestamp'].min()
latest_article = news_data.groupby('category')['timestamp'].max()
# Combine the results into a dataframe
article_dates = pd.concat([earliest_article, latest_article], axis=1, keys=['Earliest', 'Latest'])
In the above code snippet, we group the data by