Creating Scatterplots using Pandas and Matplotlib
Do you want to visualize the relation between two variables? Scatterplots are commonly used for this purpose.
In this article, we will explore creating scatterplots using two popular Python libraries – Pandas and Matplotlib. We will walk you through the basics of creating scatterplots and modifying the size and color of points.
Let’s dive in!
Using pandas.DataFrame.plot.scatter
Pandas is a library for data manipulation and analysis. It provides functions for data visualization, including scatterplots.
The pandas.DataFrame.plot.scatter function can be used to create scatterplots. Here are the primary keywords you need to know:
- pandas
- plot.scatter
- scatterplot
- x_column_name
- y_column_name
The function takes two arguments: the name of the column containing the x-axis data and the name of the column containing the y-axis data.
Here’s an example code:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
data.plot.scatter(x='age', y='height')
plt.show()
This code creates a scatterplot of the ‘age’ variable against the ‘height’ variable. The plt.show()
function displays the plot.
You can modify the plot to your liking using the following keywords:
- xlabel: label for the x-axis
- ylabel: label for the y-axis
- title: plot title
- color: color of the points
- alpha: opacity of the points
Using matplotlib.pyplot.scatter
Matplotlib is a plotting library for Python. It provides low-level plotting functions for creating complex plots.
The matplotlib.pyplot.scatter function can be used to create scatterplots. Here are the primary keywords you need to know:
- matplotlib
- pyplot.scatter
- scatterplot
- x
- y
The function takes two arguments: the x-axis data and the y-axis data.
Here’s an example code:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
plt.scatter(x=data['age'], y=data['height'])
plt.show()
This code creates a scatterplot of the ‘age’ variable against the ‘height’ variable. The plt.show()
function displays the plot.
You can modify the plot to your liking using the following keywords:
- s: size of the points
- c: color of the points
- alpha: opacity of the points
- marker: shape of the points
Example 1: Use Pandas
Creating a Simple Scatterplot
Suppose we have a dataset containing the weight and height of people. We want to create a scatterplot to visualize the relation between weight and height.
Here’s the code using Pandas:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
data.plot.scatter(x='weight', y='height')
plt.show()
This code creates a scatterplot of the ‘weight’ variable against the ‘height’ variable. The plt.show()
function displays the plot.
Modifying the Size and Color of Points
We can modify the size and color of points to make the plot more informative. Suppose we want to make the points larger and color them according to gender.
Here’s the modified code:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
colors = {'male': 'blue', 'female': 'red'}
sizes = {'male': 50, 'female': 100}
fig, ax = plt.subplots()
for gender, group in data.groupby('gender'):
ax.scatter(x=group['weight'], y=group['height'],
s=sizes[gender], c=colors[gender], alpha=0.5, label=gender)
ax.legend()
plt.show()
This code creates a scatterplot of the ‘weight’ variable against the ‘height’ variable. The points are colored according to gender, and the size of the points is determined by gender as well.
The alpha
keyword determines the opacity of the points. The label
keyword is used to create a legend.
Conclusion
Scatterplots are an essential tool for visualizing the relationship between two variables. In this article, we explored creating scatterplots using Pandas and Matplotlib, two popular Python libraries.
We covered the basics of creating scatterplots and modifying the size and color of points. We hope this article was informative and helpful in creating your scatterplots.
Happy coding!
Example 2: Use Matplotlib
In this section, we will use Matplotlib to create a scatterplot from a sample dataset. Matplotlib is a powerful library for creating static, animated, and interactive visualizations in Python.
The pyplot module of Matplotlib provides a simple interface for creating plots. Let’s get started!
Creating a Scatterplot
For this example, let’s assume we have a dataset of students’ grades from three different courses – math, science, and literature. We want to create a scatterplot to see the correlation between the grades in math and science courses.
Here’s the code:
import pandas as pd
import matplotlib.pyplot as plt
# load the dataset
df = pd.read_csv('grades.csv')
# create the scatterplot
plt.scatter(df['math'], df['science'])
# set the axis labels
plt.xlabel('Math Grades')
plt.ylabel('Science Grades')
# set the plot title
plt.title('Math vs Science Grades')
# show the plot
plt.show()
In the above code, we first load the dataset using Pandas. We then use the plt.scatter()
function to create a scatterplot of the math grades against the science grades.
Next, we set the axis labels and the plot title using the xlabel()
, ylabel()
, and title()
functions, respectively. Finally, we use the show()
function to display the plot.
As you can see from the plot, there is a positive correlation between the grades in math and science courses. Most students who score high in math also score high in science.
Modifying the Size and Color of Points
We can modify the size and color of the scatterplot points to make the plot more informative. For instance, we can use the size of the points to reflect the grades in literature, and we can use the color of the points to reflect the gender of the students.
Here’s the modified code:
import pandas as pd
import matplotlib.pyplot as plt
# load the dataset
df = pd.read_csv('grades.csv')
# create a dictionary to map gender to color
colors = {'M': 'blue', 'F': 'red'}
# create a dictionary to map literature grades to point size
sizes = {0:10, 1:20, 2:30, 3:40, 4:50}
# create the scatterplot
plt.scatter(df['math'], df['science'], s=df['literature'].apply(lambda x: sizes[x]),
c=df['gender'].apply(lambda x: colors[x]), alpha=0.5)
# set the axis labels
plt.xlabel('Math Grades')
plt.ylabel('Science Grades')
# set the plot title
plt.title('Math vs Science Grades by Literature and Gender')
# create the legend
for gender in colors.keys():
plt.scatter([], [], c=colors[gender], label=gender)
plt.legend()
# show the plot
plt.show()
In the above code, we first define two dictionaries – the colors
dictionary to map gender to color, and the sizes
dictionary to map literature grades to point size. We then use the apply()
function of a Pandas series to apply these mappings to the s
and c
arguments of the scatter()
function.
We also set the alpha
argument to 0.5
to make the points semi-transparent. Moreover, we’ve created a custom legend for the plot by creating empty scatterplots with the appropriate colors and labels.
As you can see from the plot, the size of the points reflects the grades in literature, and the color of the points reflects gender. The larger the point, the higher the grade in literature.
Moreover, we can see that female students tend to score higher in both math and science courses than male students.
Conclusion
In this section, we used Matplotlib to create a scatterplot from a sample dataset. We learned how to modify the size and color of the scatterplot points, which can be used to encode additional information about the data.
Scatterplots are a powerful tool for visualizing the relationship between two continuous variables. By modifying the size and color of the points, we can include additional information in the visualization, which can help us gain further insights into the data.
In this article, we learned about creating scatterplots using two popular Python libraries – Pandas and Matplotlib. We covered the basics of creating scatterplots and modifying the size and color of points.
Scatterplots are an essential tool for visualizing the relationship between two variables. By leveraging these libraries, we can easily create informative visualizations that can help us gain insights into our data.
Whether it be for data exploration, hypothesis testing, or reporting, scatterplots can be a powerful ally for anyone working with data.