Adventures in Machine Learning

Efficient Data Visualization with Pandas: Grouping and Plotting Sales Data in Python

Are you tired of manually plotting your sales data by product and day in Microsoft Excel? Do you want to learn a more efficient way to visualize your data using Python?

Lucky for you, pandas dataframes provide an easy and effective method to group and plot data. In this article, we will cover two methods to group and plot data: Method 1 – Group By & Plot Multiple Lines in One Plot, and Method 2 – Group By & Plot Lines in Individual Subplots.

By the end of this article, you will have a better understanding of how to manipulate your data and create visually appealing graphs using pandas dataframes.

Creating a Pandas DataFrame

Before we delve into the methods of grouping and plotting data, we need a dataset to work with. In this example, we will create a pandas dataframe to analyze the sales performance of three products over a five-day period.

Let’s start by importing the pandas library and creating our dataframe:

“`python

import pandas as pd

# Create dictionary with sales data

data = {“day”: [1, 1, 2, 2, 3, 3, 4, 4, 5, 5],

“product”: [“A”, “B”, “A”, “B”, “A”, “B”, “A”, “B”, “A”, “B”],

“sales”: [100, 200, 150, 250, 300, 350, 400, 450, 500, 550]}

# Convert dictionary to pandas dataframe

df = pd.DataFrame(data)

# Preview dataframe

print(df)

“`

“`

day product sales

0 1 A 100

1 1 B 200

2 2 A 150

3 2 B 250

4 3 A 300

5 3 B 350

6 4 A 400

7 4 B 450

8 5 A 500

9 5 B 550

“`

As you can see, we have created a dataframe with three columns: day, product, and sales. The day column represents the day of the sales, the product column represents the product being sold, and the sales column represents the amount of sales for that product on that day.

Method 1 – Group By & Plot Multiple Lines in One Plot

Now that we have our dataframe, let’s group and plot our data to gain insights into our sales performance. Method 1 involves grouping the data by the product column and plotting multiple lines on one graph to compare sales performance over time.

“`python

import matplotlib.pyplot as plt

# Group data by product column and create line chart

df.groupby(“product”)[“sales”].plot(kind=”line”, legend=True)

# Add axis labels and title

plt.xlabel(“Day”)

plt.ylabel(“Sales”)

plt.title(“Sales Performance by Product”)

# Show plot

plt.show()

“`

This code will group the sales data by the product column and create a line chart with two lines representing products A and B. The legend parameter is set to True to show the product labels in the graph.

We also added axis labels and a title to the graph to provide context for the data being presented. Method 2 – Group By & Plot Lines in Individual Subplots

Method 2 involves grouping the data by the day column and plotting each product’s sales on a separate subplot to compare sales performance by day and product.

“`python

# Pivot dataframe to create multi-level index

pivot = df.pivot_table(index=”day”, columns=”product”, values=”sales”)

# Reset index to flatten pivot table

reset = pivot.reset_index()

# Create subplots

fig, axs = plt.subplots(1, 2, figsize=(10, 5))

# Plot data on each subplot

reset.plot(x=”day”, y=[“A”, “B”], ax=axs, kind=”line”, legend=False)

# Add axis labels and title

for ax in axs:

ax.set_xlabel(“Day”)

ax.set_ylabel(“Sales”)

ax.set_title(“Sales Performance by Product”)

# Show plots

plt.show()

“`

In this code, we first create a pivot table with the day column as the index, the product column as the columns, and the sales column as the values. We then reset the index to flatten the pivot table.

This creates a dataframe that we can easily plot on multiple subplots. We create a figure with two subplots and plot the sales data for each product on their respective subplot.

We added axis labels and titles to provide context for the data being presented.

Conclusion

In this article, we covered two methods for grouping and plotting data using pandas dataframes. Method 1 involved grouping the data by the product column and plotting multiple lines on one graph to compare sales performance over time.

Method 2 involved grouping the data by the day column and plotting each product’s sales on a separate subplot to compare sales performance by day and product. By manipulating data and visualizing it with pandas dataframes, we can quickly gain insights into our sales data.

We hope this article has helped you learn a more efficient way to visualize your data and make data-driven decisions.

Creating Common Visualizations in Pandas

Pandas offers a wide range of visualization options to represent your data effectively. In this section, we will discuss some of the most commonly used visualizations in pandas: bar charts, pie charts, scatter plots, and histograms.

Bar Charts

A bar chart is a useful tool for comparing the distribution of a categorical variable. It displays the frequency of each category as a bar, with the height representing the frequency.

To create a bar chart in pandas, we can use the plot.bar() method and specify the column to plot. “`python

import pandas as pd

import matplotlib.pyplot as plt

# Create data

data = {“product”: [“A”, “B”, “C”, “D”, “E”],

“sales”: [100, 200, 150, 250, 300]}

# Convert dictionary to dataframe

df = pd.DataFrame(data)

# Create bar chart

df.plot.bar(x=”product”, y=”sales”, color=”blue”)

# Add title and axis labels

plt.title(“Sales by Product”)

plt.xlabel(“Product”)

plt.ylabel(“Sales (in dollars)”)

# Show plot

plt.show()

“`

This code will create a bar chart with the product on the x-axis and sales on the y-axis. We set the color of the bars to blue for better visibility.

We then added a title and axis labels to provide context for the data being presented.

Pie Charts

A pie chart is a circular chart that displays the percentage distribution of a categorical variable. It is useful for comparing the proportion of each category and can be created in pandas using the plot.pie() method.

“`python

import pandas as pd

import matplotlib.pyplot as plt

# Create data

data = {“product”: [“A”, “B”, “C”, “D”, “E”],

“sales”: [100, 200, 150, 250, 300]}

# Convert dictionary to dataframe

df = pd.DataFrame(data)

# Create pie chart

df.plot.pie(y=”sales”, labels=df[“product”], autopct=”%1.1f%%”)

# Add title

plt.title(“Sales by Product”)

# Show plot

plt.show()

“`

This code will create a pie chart with the product names as labels and the sales percentage as each slice’s value. The autopct parameter is set to “%1.1f%%” to display the percentage with one decimal place.

We added a title to the graph to provide context for the data being presented.

Scatter Plots

Scatter plots are useful when we want to compare two continuous variables. They display the relationship between two variables as data points on a two-dimensional plane.

We can use the plot.scatter() method to create scatter plots in pandas. “`python

import pandas as pd

import matplotlib.pyplot as plt

# Create data

data = {“age”: [25, 30, 35, 40, 45],

“income”: [35000, 50000, 75000, 90000, 100000]}

# Convert dictionary to dataframe

df = pd.DataFrame(data)

# Create scatter plot

df.plot.scatter(x=”age”, y=”income”, color=”purple”)

# Add title and axis labels

plt.title(“Income by Age”)

plt.xlabel(“Age (in years)”)

plt.ylabel(“Income (in dollars)”)

# Show plot

plt.show()

“`

This code will create a scatter plot with age on the x-axis and income on the y-axis. We set the color of the points to purple for better visibility.

We then added a title and axis labels to provide context for the data being presented.

Histograms

Histograms are useful when we want to visualize the distribution of a continuous variable. They display the frequency of data points falling into certain intervals or bins.

We can use the plot.hist() method to create histograms in pandas. “`python

import pandas as pd

import matplotlib.pyplot as plt

# Create data

data = {“age”: [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],

“income”: [35000, 50000, 75000, 90000, 100000, 120000, 140000, 160000, 170000, 180000]}

# Convert dictionary to dataframe

df = pd.DataFrame(data)

# Create histogram

df.plot.hist(y=”income”, bins=5, color=”green”)

# Add title and axis labels

plt.title(“Income Distribution”)

plt.xlabel(“Income (in dollars)”)

plt.ylabel(“Frequency”)

# Show plot

plt.show()

“`

This code will create a histogram with income on the y-axis and the frequency of data points on the x-axis. We set the number of bins to five to group the data points into five intervals.

We then added a title and axis labels to provide context for the data being presented.

Conclusion

In conclusion, pandas provides a variety of visualization tools to represent data effectively. The visualizations discussed in this section are some of the most commonly used, but pandas offers many more options to explore.

By effectively using pandas visualization tools, we can easily gain insights into our data and make data-driven decisions. In summary, pandas dataframes offer an efficient and effective method to group and plot data.

We covered two main methods for grouping and plotting data: Method 1 – Group By & Plot Multiple Lines in One Plot and Method 2 – Group By & Plot Lines in Individual Subplots. We also discussed four commonly used visualizations in pandas: bar charts, pie charts, scatter plots, and histograms.

By using these tools, we can gain valuable insights into our data and make data-driven decisions. Python programming and pandas provide an easy and powerful way to analyze data and create visualizations that help us communicate our findings with our colleagues and stakeholders.

Popular Posts