Data Visualization with Python: A Guide to Pandas and Matplotlib
Python is one of the most popular programming languages used for data analysis. One of its most valuable tools is a data manipulation package known as pandas.
Pandas allow the user to easily perform data analysis operations such as data extraction, transformation, and preparation. Additionally, it includes data visualization capabilities, which allow one to view the extracted insights quickly.
Getting Started with Pandas and Matplotlib
To get started with Python, you first need to set up your environment by installing the necessary libraries. There are many online resources that will guide you through this process.
Pandas and Jupyter Notebook are essential libraries for performing data analysis with Python. Once you have everything set up, you’re ready to create your first pandas plot.
Types of Pandas Plots
- Histograms: A chart that represents data distribution, providing an overview of data.
- Scatter Plots: An excellent way to discover the correlation between variables. It shows how much one variable is affected by another. The x-axis represents variable 1, while the y-axis represents variable 2.
- Bar Charts: Graphically represent categorical data with rectangular bars whose height or length is proportional to the values they represent.
- Pie Charts: Represent data as a pie or circle divided into slices, where each category is represented by a slice that proportionally follows the value of the category.
DataFrame.hist() is an easy-to-use method for creating histograms interactively. Pandas DataFrame.plot.scatter() method is a straightforward way to generate scatter plots.
Using Matplotlib for Data Visualization
Python’s data visualization capabilities can be augmented by using the Matplotlib library. Matplotlib is a comprehensive library that provides a wide range of visualization tools.
The .plot() Method in Matplotlib
Let’s take a closer look at how to use the .plot() method in Matplotlib to plot data. Matplotlib provides a function called .plot() that allows you to create a wide variety of plots such as line plots, scatter plots, and bar charts. It’s an extremely flexible and versatile library that provides granular control over the layout and stylistic aspects of the plot. To use the .plot() method, we first need to install the Matplotlib library.
This can be easily done using pip or another package installation tool. Once Matplotlib is installed, we can start creating our first plot.
Creating a Line Plot
To create a line plot using the .plot() method, we first need to specify the x-axis and y-axis values. We can do this by passing a list of values for both axes as arguments to the function.
import matplotlib.pyplot as plt
x_values = [1, 2, 3, 4, 5]
y_values = [10, 8, 6, 4, 2]
plt.plot(x_values, y_values)
plt.show()
This will create a line plot with x-axis values ranging from 1 to 5 and y-axis values ranging from 10 to 2. We can customize the plot by changing the line style, color, and marker size, among other things.
Comparing Matplotlib .plot() with Pandas DataFrame .plot()
Now that we’ve seen how to create a line plot using Matplotlib, let’s compare this approach to the pandas DataFrame object’s .plot() method. One of the benefits of using the .plot() method provided by the DataFrame object in pandas is that it allows us to quickly create a variety of plots.
We don’t have to worry about the details of plotting the data; the DataFrame object takes care of it for us. For instance, let’s try plotting some data with both the Matplotlib .plot() method and the pandas DataFrame object’s .plot() method.
We’ll start by creating a DataFrame with some random data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create a random DataFrame with 100 rows and 3 columns
data = pd.DataFrame(np.random.randn(100, 3), columns=['A', 'B', 'C'])
# plot the data with Matplotlib
plt.plot(data)
plt.show()
# plot the data with pandas' .plot() method
data.plot(kind='line')
plt.show()
In the first plot, we used the Matplotlib .plot() method to plot all three columns of the DataFrame. In the second plot, we used the DataFrame object’s .plot() method and specified the kind of chart we want to create as a line plot.
As you can see, the pandas’ .plot() method quickly generates a line plot, whereas the Matplotlib .plot() method required us to specify the data range and customize the chart to make it look good. However, Matplotlib gives us more granular control over the plot and is better suited for creating complex charts.
Surveying Data with Visualizations
Now let’s discuss how we can use visualizations to survey our data. One of the most common ways to survey data is to examine the data distribution and outliers with histograms and bar plots.
Histograms
A histogram is a chart that represents the distribution of data by dividing the data into intervals known as bins. We can create a histogram of a particular column in a DataFrame using the .hist() method.
For example, the following code will plot a histogram of the ‘age’ column in a DataFrame:
data['age'].hist()
Bar Plots
We can also use a bar plot to visualize the distribution of a categorical variable. A bar plot represents the data by using rectangular bars where the height of the bars corresponds to the value of the variable.
For instance, let’s say we have a DataFrame with two columns ‘gender’ and ‘count.’ In this case, we can use a bar plot to visualize the number of males and females in our dataset:
data.groupby('gender')['count'].sum().plot(kind='bar')
Scatter Plots and the .corr() Method
Another way to survey data is to check for correlation using scatter plots and the .corr() method. A scatter plot represents the relationship between two variables by displaying them as points on a two-dimensional plot.
We can use scatter plots to identify areas of high correlation between variables. The .corr() method calculates the correlation between two variables in a DataFrame and returns a matrix that shows the correlation coefficient between every pair of variables.
Analyzing Categorical Data with Visualizations
When working with data, we often need to analyze categorical data to draw insights. Categorical data consists of information that we can divide into discrete groups such as gender, age groups, or educational qualifications.
Pie Plots
Pie plots are an excellent visualization tool for examining ratios in categorical data. A pie plot is a circular chart divided into slices that represent the relative proportion of the categories. The sizes of the slices correspond to the relative frequency of each category.
We can use pie plots to identify which categories make up the majority and minority of the data. To create a pie plot, we first need to import the Matplotlib library and specify the data.
We can use the .plot.pie() method to create a pie chart easily. Here’s an example of how to create a pie plot for categorical data in a pandas DataFrame:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({
'class': ['A', 'B', 'C', 'D'],
'count': [20, 45, 30, 10]
})
df.set_index('class')['count'].plot.pie()
plt.show()
In this example, we create a DataFrame with categories ‘class’ and ‘count.’ We use the .set_index() method to set the ‘class’ column as the index and specify the ‘count’ column to be plotted in the pie chart. We then call the .plot.pie() method to create the pie plot.
Bar Plots and Scatter Plots for Categorical Data
Apart from pie charts, we can also use bar plots and scatter plots to analyze categorical data. Bar plots are useful for representing discrete categories, while scatter plots can help us examine the relationship between two or more variables.
For instance, let’s say we have a pandas DataFrame containing data on different car models, including the model name, price, horsepower, and fuel efficiency, among other things. We can use a bar plot to represent the model name as the x-axis and the price as the y-axis.
We can also use different colors to represent the number of cylinders in each model. Here’s how to create the plot:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({
'model': ['Corolla', 'Civic', 'Accord', 'Camry', 'Sentra'],
'price': [20000, 22000, 25000, 26000, 23000],
'horsepower': [198, 224, 252, 285, 300],
'mpg': [28, 32, 26, 25, 29],
'cylinders': [4, 4, 6, 6, 4]
})
df.plot(kind='bar', x='model', y='price', color='cylinders')
plt.show()
In this example, we create a DataFrame with data on different car models such as the car model name, price, horsepower, fuel efficiency, and the number of cylinders. We use the .plot() method to create a bar plot that shows the price of each car model, with different colors representing the number of cylinders.
We can also use scatter plots to analyze categorical data. For example, let’s say we have a pandas DataFrame with data on different cities in the US, including population, average temperature, and average income.
We can use a scatter plot to examine the relationship between population and average income. Here’s how to create the plot:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({
'city': ['New York', 'Chicago', 'Los Angeles', 'Houston', 'Phoenix'],
'population': [8537673, 2705994, 3979576, 2303482, 1680992],
'income': [85000, 75000, 90000, 65000, 60000],
'temperature': [15, 10, 25, 30, 35]
})
df.plot(kind='scatter', x='population', y='income')
plt.show()
In this example, we create a DataFrame with data on different cities, including population, income, and temperature. We use the .plot() method to create a scatter plot that shows the relationship between population and average income.
Conclusion
In conclusion, we can use different types of plots to analyze and visualize categorical data. Pie plots help us examine the relative proportions of the categories, while bar plots are effective for discrete categories.
Scatter plots allow us to examine the relationship between two or more variables. By using these tools, we can gain insights into the data and make informed decisions.
In conclusion, data visualization is a crucial part of data analysis that helps us identify patterns, trends, and insights that otherwise might go unnoticed when looking at raw data. Python provides a range of libraries such as pandas and Matplotlib that make it easy to create different types of visualizations for categorical and numerical data.
By using visualizations such as histograms, bar plots, pie plots, scatter plots, and correlation coefficients, we can gain a deeper understanding of the data and make informed decisions. The key takeaway is that data visualization is essential for developing a clear, concise, and informative picture of complex data, making it easier to grasp and making it more accessible to others.