Grouping and Plotting Data with Pandas
Are you tired of manually plotting your sales data by product and day in Microsoft Excel? Do you want to learn a more efficient way to visualize your data using Python?
Lucky for you, pandas dataframes provide an easy and effective method to group and plot data. In this article, we will cover two methods to group and plot data: Method 1 – Group By & Plot Multiple Lines in One Plot, and Method 2 – Group By & Plot Lines in Individual Subplots.
By the end of this article, you will have a better understanding of how to manipulate your data and create visually appealing graphs using pandas dataframes.
Creating a Pandas DataFrame
Before we delve into the methods of grouping and plotting data, we need a dataset to work with. In this example, we will create a pandas dataframe to analyze the sales performance of three products over a five-day period.
Let’s start by importing the pandas library and creating our dataframe:
import pandas as pd
# Create dictionary with sales data
data = {"day": [1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
"product": ["A", "B", "A", "B", "A", "B", "A", "B", "A", "B"],
"sales": [100, 200, 150, 250, 300, 350, 400, 450, 500, 550]}
# Convert dictionary to pandas dataframe
df = pd.DataFrame(data)
# Preview dataframe
print(df)
day product sales
0 1 A 100
1 1 B 200
2 2 A 150
3 2 B 250
4 3 A 300
5 3 B 350
6 4 A 400
7 4 B 450
8 5 A 500
9 5 B 550
As you can see, we have created a dataframe with three columns: day, product, and sales. The day column represents the day of the sales, the product column represents the product being sold, and the sales column represents the amount of sales for that product on that day.
Method 1 – Group By & Plot Multiple Lines in One Plot
Now that we have our dataframe, let’s group and plot our data to gain insights into our sales performance. Method 1 involves grouping the data by the product column and plotting multiple lines on one graph to compare sales performance over time.
import matplotlib.pyplot as plt
# Group data by product column and create line chart
df.groupby("product")["sales"].plot(kind="line", legend=True)
# Add axis labels and title
plt.xlabel("Day")
plt.ylabel("Sales")
plt.title("Sales Performance by Product")
# Show plot
plt.show()
This code will group the sales data by the product column and create a line chart with two lines representing products A and B. The legend parameter is set to True to show the product labels in the graph.
We also added axis labels and a title to the graph to provide context for the data being presented.
Method 2 – Group By & Plot Lines in Individual Subplots
Method 2 involves grouping the data by the day column and plotting each product’s sales on a separate subplot to compare sales performance by day and product.
# Pivot dataframe to create multi-level index
pivot = df.pivot_table(index="day", columns="product", values="sales")
# Reset index to flatten pivot table
reset = pivot.reset_index()
# Create subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
# Plot data on each subplot
reset.plot(x="day", y=["A", "B"], ax=axs, kind="line", legend=False)
# Add axis labels and title
for ax in axs:
ax.set_xlabel("Day")
ax.set_ylabel("Sales")
ax.set_title("Sales Performance by Product")
# Show plots
plt.show()
In this code, we first create a pivot table with the day column as the index, the product column as the columns, and the sales column as the values. We then reset the index to flatten the pivot table.
This creates a dataframe that we can easily plot on multiple subplots. We create a figure with two subplots and plot the sales data for each product on their respective subplot.
We added axis labels and titles to provide context for the data being presented.
Conclusion
In this article, we covered two methods for grouping and plotting data using pandas dataframes. Method 1 involved grouping the data by the product column and plotting multiple lines on one graph to compare sales performance over time.
Method 2 involved grouping the data by the day column and plotting each product’s sales on a separate subplot to compare sales performance by day and product. By manipulating data and visualizing it with pandas dataframes, we can quickly gain insights into our sales data.
We hope this article has helped you learn a more efficient way to visualize your data and make data-driven decisions.
Creating Common Visualizations in Pandas
Pandas offers a wide range of visualization options to represent your data effectively. In this section, we will discuss some of the most commonly used visualizations in pandas: bar charts, pie charts, scatter plots, and histograms.
Bar Charts
A bar chart is a useful tool for comparing the distribution of a categorical variable. It displays the frequency of each category as a bar, with the height representing the frequency.
To create a bar chart in pandas, we can use the plot.bar() method and specify the column to plot.
import pandas as pd
import matplotlib.pyplot as plt
# Create data
data = {"product": ["A", "B", "C", "D", "E"],
"sales": [100, 200, 150, 250, 300]}
# Convert dictionary to dataframe
df = pd.DataFrame(data)
# Create bar chart
df.plot.bar(x="product", y="sales", color="blue")
# Add title and axis labels
plt.title("Sales by Product")
plt.xlabel("Product")
plt.ylabel("Sales (in dollars)")
# Show plot
plt.show()
This code will create a bar chart with the product on the x-axis and sales on the y-axis. We set the color of the bars to blue for better visibility.
We then added a title and axis labels to provide context for the data being presented.
Pie Charts
A pie chart is a circular chart that displays the percentage distribution of a categorical variable. It is useful for comparing the proportion of each category and can be created in pandas using the plot.pie() method.
import pandas as pd
import matplotlib.pyplot as plt
# Create data
data = {"product": ["A", "B", "C", "D", "E"],
"sales": [100, 200, 150, 250, 300]}
# Convert dictionary to dataframe
df = pd.DataFrame(data)
# Create pie chart
df.plot.pie(y="sales", labels=df["product"], autopct="%1.1f%%")
# Add title
plt.title("Sales by Product")
# Show plot
plt.show()
This code will create a pie chart with the product names as labels and the sales percentage as each slice’s value. The autopct parameter is set to “%1.1f%%” to display the percentage with one decimal place.
We added a title to the graph to provide context for the data being presented.
Scatter Plots
Scatter plots are useful when we want to compare two continuous variables. They display the relationship between two variables as data points on a two-dimensional plane.
We can use the plot.scatter() method to create scatter plots in pandas.
import pandas as pd
import matplotlib.pyplot as plt
# Create data
data = {"age": [25, 30, 35, 40, 45],
"income": [35000, 50000, 75000, 90000, 100000]}
# Convert dictionary to dataframe
df = pd.DataFrame(data)
# Create scatter plot
df.plot.scatter(x="age", y="income", color="purple")
# Add title and axis labels
plt.title("Income by Age")
plt.xlabel("Age (in years)")
plt.ylabel("Income (in dollars)")
# Show plot
plt.show()
This code will create a scatter plot with age on the x-axis and income on the y-axis. We set the color of the points to purple for better visibility.
We then added a title and axis labels to provide context for the data being presented.
Histograms
Histograms are useful when we want to visualize the distribution of a continuous variable. They display the frequency of data points falling into certain intervals or bins.
We can use the plot.hist() method to create histograms in pandas.
import pandas as pd
import matplotlib.pyplot as plt
# Create data
data = {"age": [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
"income": [35000, 50000, 75000, 90000, 100000, 120000, 140000, 160000, 170000, 180000]}
# Convert dictionary to dataframe
df = pd.DataFrame(data)
# Create histogram
df.plot.hist(y="income", bins=5, color="green")
# Add title and axis labels
plt.title("Income Distribution")
plt.xlabel("Income (in dollars)")
plt.ylabel("Frequency")
# Show plot
plt.show()
This code will create a histogram with income on the y-axis and the frequency of data points on the x-axis. We set the number of bins to five to group the data points into five intervals.
We then added a title and axis labels to provide context for the data being presented.
Conclusion
In conclusion, pandas provides a variety of visualization tools to represent data effectively. The visualizations discussed in this section are some of the most commonly used, but pandas offers many more options to explore.
By effectively using pandas visualization tools, we can easily gain insights into our data and make data-driven decisions. In summary, pandas dataframes offer an efficient and effective method to group and plot data.
We covered two main methods for grouping and plotting data: Method 1 – Group By & Plot Multiple Lines in One Plot and Method 2 – Group By & Plot Lines in Individual Subplots. We also discussed four commonly used visualizations in pandas: bar charts, pie charts, scatter plots, and histograms.
By using these tools, we can gain valuable insights into our data and make data-driven decisions. Python programming and pandas provide an easy and powerful way to analyze data and create visualizations that help us communicate our findings with our colleagues and stakeholders.