Adventures in Machine Learning

Mastering Scatterplots with Pandas and Matplotlib

Creating Scatterplots using Pandas and Matplotlib

Do you want to visualize the relation between two variables? Scatterplots are commonly used for this purpose.

In this article, we will explore creating scatterplots using two popular Python libraries – Pandas and Matplotlib. We will walk you through the basics of creating scatterplots and modifying the size and color of points.

Let’s dive in!

Using pandas.DataFrame.plot.scatter

Pandas is a library for data manipulation and analysis. It provides functions for data visualization, including scatterplots.

The pandas.DataFrame.plot.scatter function can be used to create scatterplots. Here are the primary keywords you need to know:

  • pandas
  • plot.scatter
  • scatterplot
  • x_column_name
  • y_column_name

The function takes two arguments: the name of the column containing the x-axis data and the name of the column containing the y-axis data.

Here’s an example code:

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')

data.plot.scatter(x='age', y='height')

plt.show()

This code creates a scatterplot of the ‘age’ variable against the ‘height’ variable. The plt.show() function displays the plot.

You can modify the plot to your liking using the following keywords:

  • xlabel: label for the x-axis
  • ylabel: label for the y-axis
  • title: plot title
  • color: color of the points
  • alpha: opacity of the points

Using matplotlib.pyplot.scatter

Matplotlib is a plotting library for Python. It provides low-level plotting functions for creating complex plots.

The matplotlib.pyplot.scatter function can be used to create scatterplots. Here are the primary keywords you need to know:

  • matplotlib
  • pyplot.scatter
  • scatterplot
  • x
  • y

The function takes two arguments: the x-axis data and the y-axis data.

Here’s an example code:

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')

plt.scatter(x=data['age'], y=data['height'])

plt.show()

This code creates a scatterplot of the ‘age’ variable against the ‘height’ variable. The plt.show() function displays the plot.

You can modify the plot to your liking using the following keywords:

  • s: size of the points
  • c: color of the points
  • alpha: opacity of the points
  • marker: shape of the points

Example 1: Use Pandas

Creating a Simple Scatterplot

Suppose we have a dataset containing the weight and height of people. We want to create a scatterplot to visualize the relation between weight and height.

Here’s the code using Pandas:

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')

data.plot.scatter(x='weight', y='height')

plt.show()

This code creates a scatterplot of the ‘weight’ variable against the ‘height’ variable. The plt.show() function displays the plot.

Modifying the Size and Color of Points

We can modify the size and color of points to make the plot more informative. Suppose we want to make the points larger and color them according to gender.

Here’s the modified code:

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')

colors = {'male': 'blue', 'female': 'red'}
sizes = {'male': 50, 'female': 100}

fig, ax = plt.subplots()

for gender, group in data.groupby('gender'):
    ax.scatter(x=group['weight'], y=group['height'],
               s=sizes[gender], c=colors[gender], alpha=0.5, label=gender)

ax.legend()

plt.show()

This code creates a scatterplot of the ‘weight’ variable against the ‘height’ variable. The points are colored according to gender, and the size of the points is determined by gender as well.

The alpha keyword determines the opacity of the points. The label keyword is used to create a legend.

Conclusion

Scatterplots are an essential tool for visualizing the relationship between two variables. In this article, we explored creating scatterplots using Pandas and Matplotlib, two popular Python libraries.

We covered the basics of creating scatterplots and modifying the size and color of points. We hope this article was informative and helpful in creating your scatterplots.

Happy coding!

Example 2: Use Matplotlib

In this section, we will use Matplotlib to create a scatterplot from a sample dataset. Matplotlib is a powerful library for creating static, animated, and interactive visualizations in Python.

The pyplot module of Matplotlib provides a simple interface for creating plots. Let’s get started!

Creating a Scatterplot

For this example, let’s assume we have a dataset of students’ grades from three different courses – math, science, and literature. We want to create a scatterplot to see the correlation between the grades in math and science courses.

Here’s the code:

import pandas as pd
import matplotlib.pyplot as plt

# load the dataset
df = pd.read_csv('grades.csv')

# create the scatterplot
plt.scatter(df['math'], df['science'])

# set the axis labels
plt.xlabel('Math Grades')
plt.ylabel('Science Grades')

# set the plot title
plt.title('Math vs Science Grades')

# show the plot
plt.show()

In the above code, we first load the dataset using Pandas. We then use the plt.scatter() function to create a scatterplot of the math grades against the science grades.

Next, we set the axis labels and the plot title using the xlabel(), ylabel(), and title() functions, respectively. Finally, we use the show() function to display the plot.

As you can see from the plot, there is a positive correlation between the grades in math and science courses. Most students who score high in math also score high in science.

Modifying the Size and Color of Points

We can modify the size and color of the scatterplot points to make the plot more informative. For instance, we can use the size of the points to reflect the grades in literature, and we can use the color of the points to reflect the gender of the students.

Here’s the modified code:

import pandas as pd
import matplotlib.pyplot as plt

# load the dataset
df = pd.read_csv('grades.csv')

# create a dictionary to map gender to color
colors = {'M': 'blue', 'F': 'red'}

# create a dictionary to map literature grades to point size
sizes = {0:10, 1:20, 2:30, 3:40, 4:50}

# create the scatterplot
plt.scatter(df['math'], df['science'], s=df['literature'].apply(lambda x: sizes[x]), 
            c=df['gender'].apply(lambda x: colors[x]), alpha=0.5)

# set the axis labels
plt.xlabel('Math Grades')
plt.ylabel('Science Grades')

# set the plot title
plt.title('Math vs Science Grades by Literature and Gender')

# create the legend
for gender in colors.keys():
    plt.scatter([], [], c=colors[gender], label=gender)
plt.legend()

# show the plot
plt.show()

In the above code, we first define two dictionaries – the colors dictionary to map gender to color, and the sizes dictionary to map literature grades to point size. We then use the apply() function of a Pandas series to apply these mappings to the s and c arguments of the scatter() function.

We also set the alpha argument to 0.5 to make the points semi-transparent. Moreover, we’ve created a custom legend for the plot by creating empty scatterplots with the appropriate colors and labels.

As you can see from the plot, the size of the points reflects the grades in literature, and the color of the points reflects gender. The larger the point, the higher the grade in literature.

Moreover, we can see that female students tend to score higher in both math and science courses than male students.

Conclusion

In this section, we used Matplotlib to create a scatterplot from a sample dataset. We learned how to modify the size and color of the scatterplot points, which can be used to encode additional information about the data.

Scatterplots are a powerful tool for visualizing the relationship between two continuous variables. By modifying the size and color of the points, we can include additional information in the visualization, which can help us gain further insights into the data.

In this article, we learned about creating scatterplots using two popular Python libraries – Pandas and Matplotlib. We covered the basics of creating scatterplots and modifying the size and color of points.

Scatterplots are an essential tool for visualizing the relationship between two variables. By leveraging these libraries, we can easily create informative visualizations that can help us gain insights into our data.

Whether it be for data exploration, hypothesis testing, or reporting, scatterplots can be a powerful ally for anyone working with data.

Popular Posts