Creating Bar Plots in Python
Data visualization is an essential part of data analysis. It is known that graphs and charts enable us to understand and analyze data effectively.
One of the popular data visualization techniques is Creating Bar Plots. Creating Bar Plots can help you to visualize categorical data for comparison purposes.
In python, there are several ways to create bar plots. In this article, we will look at two types of Bar plots that can be created using Pandas Crosstab: Grouped Bar Plot and Stacked Bar Plot.
Creating Grouped Bar Plot
A Grouped Bar Plot shows the frequency count or two variables across factors. In simple terms, it shows the distribution of data across different categories.
Grouped Bar Plots can be used to compare two variables across factors. To create a Grouped Bar Plot, we will use Pandas Crosstab function.
This function is used to compute a simple cross-tabulation of two variables.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Data
df = pd.DataFrame({'Gender': ['Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Female'],
'Age': ['20-30', '20-30', '20-30', '30-40', '30-40', '20-30', '30-40', '20-30'],
'Count': [10, 15, 12, 8, 9, 7, 11, 6]})
# Compute Frequency Count
freq_count = pd.crosstab(df['Gender'], df['Age'])
# Create Grouped Bar Plot
freq_count.plot(kind='bar', figsize=(8, 6))
plt.title('Count of Age Group across Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
In the above code, we have defined a dataframe df with columns Gender, Age, and Count. Using pandas crosstab function, we have computed a frequency count for Gender and Age columns.
Finally, we have created a Grouped Bar plot using the plot function with the kind attribute set to ‘bar’.
The resulting plot shows the count of age groups, i.e., 20-30 and 30-40, across Gender. You can see that females have a higher frequency count for both age groups.
Creating a Stacked Bar Plot
A Stacked Bar Plot is another way to visualize two variables across factors. A Stacked Bar Plot shows the total value of each factor, and the shares of the two variables are presented as stacked bars.
In simple terms, the Stacked Bar Plot shows how much each factor contributes to the total, and how the two variables contribute to each factor. Let’s understand this with an example.
Consider we have a dataset of a company’s sales recorded over months with two different products. We want to know the total sales made during a particular month along with the sales made by each product during that month.
A Stacked Bar plot can help us visualize this data. To create a Stacked Bar Plot, we will use Pandas Crosstab function again.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Data
df = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug'],
'Product A': [4000, 5500, 3000, 7000, 6200, 8000, 6000, 5500],
'Product B': [2000, 4000, 1500, 3000, 2900, 4000, 2200, 3000]})
# Compute Monthly Sales for each Product
monthly_sales = pd.crosstab(df['Month'], df[['Product A', 'Product B']].idxmax(axis=1))
# Create Stacked Bar Plot
monthly_sales.plot(kind='bar', stacked=True, figsize=(8, 6))
plt.title('Monthly Sales of Products A and B')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()
In the above code, we created a dataframe with columns Month, Product A, and Product B. We used the pandas crosstab function to compute the monthly sales for each product.
Finally, we created a Stacked Bar plot using the plot function, with the stacked attribute set to True. The resulting plot shows the total sales made during each month along with the sales made by each product during that month.
You can see that the total sales for the month of June are the highest, with Product A contributing more to the total sales than Product B.
Grouped Bar Plot Example
Let’s take another example to create a Grouped Bar Plot. Consider we have a dataset that represents the age and gender distribution of customers who buy a particular product.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Data
df = pd.DataFrame({'Gender': ['Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male'],
'Age': ['20-30', '20-30', '20-30', '30-40', '30-40', '20-30', '30-40', '20-30', '30-40']})
# Compute Frequency Count
freq_count = pd.crosstab(df['Gender'], df['Age'])
# Create Grouped Bar Plot
freq_count.plot(kind='bar', figsize=(8, 6))
plt.title('Age Distribution across Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
In the above code, we have created a dataframe with columns Gender and Age. Using pandas crosstab function, we have computed the frequency count for Gender and Age.
Finally, we have created a Grouped Bar plot to show the Age Distribution across Gender. The resulting plot shows the count of customers across Gender for age groups 20-30 and 30-40.
You can see that the age group 20-30 is more represented in the Male population, while the age group 30-40 is more represented in the Female population.
Stacked Bar Plot Example
Let’s take another example to create a Stacked Bar Plot. Consider we have a dataset of customers who bought products A and B.
We want to visualize the percentage of customers who bought both products compared to the percentage of customers who bought only one of the products.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Data
df = pd.DataFrame({'Product A': [1, 0, 0, 1, 0, 0, 1, 0, 0, 1],
'Product B': [1, 0, 0, 0, 1, 0, 1, 0, 1, 0]})
# Compute Percentage
total_customers = len(df)
combinations_df = df.groupby(['Product A', 'Product B']).size().reset_index()
combinations_df.columns = ['Product A', 'Product B', 'Count']
combinations_df['Percentage'] = (combinations_df['Count'] / total_customers) * 100
# Compute Composition
composition = pd.crosstab(df['Product A'], df['Product B'], normalize='all') * 100
composition.loc['Total'] = composition.sum()
# Create Stacked Bar Plot for Percentage
combinations_df.plot(kind='bar', x='Product A', y='Percentage', color=['r', 'g'], figsize=(8, 6))
plt.title('Percentage of Customers who bought both Products A and B compared to only one of the products')
plt.xlabel('Product A')
plt.ylabel('Percentage')
plt.legend(loc=1)
plt.show()
# Create Stacked Bar Plot for Composition
composition.plot(kind='bar', stacked=True, figsize=(8, 6))
plt.title('Composition of Customers who bought Products A and B')
plt.xlabel('Product A')
plt.ylabel('Percentage')
plt.xticks(rotation=0)
plt.show()
In the above code, we created a dataframe with columns Product A and Product B. Using groupby and pandas crosstab functions, we computed the percentage of customers who bought both products compared to the percentage of customers who bought only one of the products.
Finally, we created stacked bar plots for the percentage and composition of customers who bought Products A and B. The percentage plot shows that 30% of customers bought both products, while 30% bought only Product A, and 40% bought only Product B.
The composition plot shows the same data in a different way, with stacked bars representing the percentage of customers who bought Products A and B according to the number of products they bought.
Conclusion
In this article, we learned how to create two types of Bar plots using Pandas Crosstab function. Grouped Bar Plots and Stacked Bar Plots are used to visualize categorical data and can be useful for comparing two variables across factors.
We also looked at some examples to help understand these types of Bar plots better. Visualizing data through Bar plots is an effective way to communicate insights and analysis to stakeholders.
With the right tools, such as Pandas Crosstab function, we can create and customize powerful graphs to help us better understand our data.
Reading and Writing Data with Pandas
The first and foremost task in any data analysis project is reading and writing data from and to different formats. Pandas provides tools for reading and writing data from multiple sources, including CSV, Excel, SQL, and more.
Reading Data
Reading data is a crucial step in any data analysis project. Pandas provides several functions to read data in different formats.
The most common function used for reading data is `pd.read_csv()`. This function reads data from a CSV file and returns a Pandas DataFrame.
import pandas as pd
# Read CSV File
df = pd.read_csv('data.csv')
# Print First Five Rows
print(df.head())
In the above example, we have imported the Pandas library and used the `pd.read_csv()` function to read the data from a CSV file. We have then printed the first five rows of the DataFrame using the `head()` method.
Writing Data
After analyzing and manipulating data, we may sometimes need to save the results in a different format for future usage. Pandas provides several functions for this task.
One of the commonly used functions is `to_csv()`, which writes data from a Pandas DataFrame to a CSV file.
import pandas as pd
# Read CSV File
df = pd.read_csv('data.csv')
# Modify Data
# Write Data back to CSV File
df.to_csv('new_data.csv', index=False)
In the above example, we have used the `pd.read_csv()` function to read the data from a CSV file. We have then modified the data as required and used the `to_csv()` function to write the modified data back to a CSV file.
Data Manipulation
Data Manipulation is one of the core tasks in data analysis. Pandas provides several functions and methods to manipulate data in a DataFrame.
Let’s take a look at some of the most commonly used data manipulation tasks.
Selecting Data
Selecting Data is a crucial task in data analysis. Pandas provides several methods to select data from a DataFrame.
The most commonly used methods are `loc[]` and `iloc[]`.
import pandas as pd
# Read CSV File
df = pd.read_csv('data.csv')
# Select Data using loc[]
df1 = df.loc[:, ['Name', 'Age']]
# Select Data using iloc[]
df2 = df.iloc[:, [0, 1]]
# Print First Five Rows of Both DataFrames
print(df1.head())
print(df2.head())
In the above example, we have used the `pd.read_csv()` function to read the data from a CSV file. We have then used the `loc[]` method to select two columns from the DataFrame and created a new DataFrame.
Similarly, we have used the `iloc[]` method to select two columns from the DataFrame and created a new DataFrame. Finally, we have printed the first five rows of both DataFrames to check if the selection is correct.
Filtering Data
Filtering data is an essential task in data analysis. Pandas provides several methods to filter data in a DataFrame.
The most commonly used method is the `query()` method.
import pandas as pd
# Read CSV File
df = pd.read_csv('data.csv')
# Filter Data using Query
df_filtered = df.query('Age >= 30')
# Print First Five Rows of Filtered Data
print(df_filtered.head())
In the above example, we have used the `pd.read_csv()` function to read the data from a CSV file. We have then used the `query()` method to filter the data based on a condition.
Finally, we have printed the first five rows of the filtered DataFrame to check if the filter is correct.
Transforming Data
Transforming data is a crucial task in data analysis. Pandas provides several functions and methods to transform data in a DataFrame.
Let’s take a look at some of the most commonly used data transformation tasks.
Renaming Columns
Renaming columns is an essential task when we have to standardize the column names or change them to more meaningful names. Pandas provides an easy way to rename columns using the `rename()` method.
import pandas as pd
# Read CSV File
df = pd.read_csv('data.csv')
# Rename Columns
df_renamed = df.rename(columns={'Name': 'Full Name', 'Zip Code': 'Postal Code'})
# Print First Five Rows of Renamed DataFrame
print(df_renamed.head())
In the above example, we have used the `pd.read_csv()` function to read the data from a CSV file. We have then used the `rename()` method to rename two columns of the DataFrame.
Finally, we have printed the first five rows of the renamed DataFrame to check if the renaming is correct.
Grouping Data
Grouping data is an essential task in data analysis. Pandas provides several functions and methods to group data in a DataFrame.
The most commonly used methods are `groupby()` and `agg()`.
import pandas as pd
# Read CSV File
df = pd.read_csv('data.csv')
# Group Data by Age
df_grouped = df.groupby('Age').agg({'Salary': ['mean', 'sum'], 'Zip Code': 'nunique'})
# Print Grouped Data
print(df_grouped)
In the above example, we have used the `pd.read_csv()` function to read the data from a CSV file. We have then used the `groupby()` method to group the data based on age and used the `agg()` method to aggregate the salary and zip code data for each age group.
Finally, we have printed the grouped data to check if the grouping and aggregation are correct.
Conclusion
In this article, we have covered some of the most common tasks in Pandas that are essential for anyone who wants to analyze or manipulate data using Pandas. Reading and Writing Data, Data Manipulation, and Data Transformation are some of the critical tasks in data analysis, and Pandas provides easy-to-use functions and methods to perform these tasks.
With the knowledge of these common tasks, you can start analyzing and manipulating data quickly and efficiently using Pandas.
In conclusion, this article highlighted some of the most common tasks in Pandas that are important for data analysis.
Reading and Writing Data, Data Manipulation, and Data Transformation are essential tasks that can be efficiently performed using Pandas. The article demonstrated the use of several Pandas functions and methods such as `pd.read_csv()`, `to_csv()`, `loc[]`, `iloc[]`, `query()`, `rename()`, `groupby()`, and `agg()`.
With the knowledge of these common tasks, you can efficiently analyze and manipulate data. Pandas provides powerful tools for data analysis, and this article underlines the importance of using the appropriate tools for analyzing and manipulating data.
Overall, Pandas makes data analysis more accessible, efficient, and customizable, allowing for faster and more accurate insights into your data.