Adventures in Machine Learning

Mastering Scatter Plots in Python: Combining and Analyzing Data with Pandas

Scatter Plots in Pandas: Visualizing Relationships and Overlaying Datasets

Scatter plots are essential tools for data analysis because they allow us to visualize the relationship between two numerical variables. Scatter plots can also help us identify patterns and trends in data that may not be evident by looking at the numbers alone.

In this article, we will explore how to create scatter plots using pandas, a popular library for data manipulation and analysis in Python. We will also learn how to overlay multiple scatter plots on the same graph, giving us the ability to compare multiple datasets.

Part 1: Creating a Scatter Plot using Multiple Columns in a Pandas DataFrame

A scatter plot is a type of graph that shows the relationship between two variables. Pandas provides a convenient way to create scatter plots from data stored in a DataFrame.

To create a scatter plot with multiple columns, we can pass the column names to the plot() function. Syntax for creating a scatter plot with multiple columns:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('my_data.csv') # read data from csv file
df.plot(kind='scatter', x='col1', y='col2')
plt.show()

In the above example, we read data from a CSV file using the read_csv() function, and then we used the plot() function to create a scatter plot of two columns, col1 and col2. Example of creating a scatter plot with multiple columns:

Suppose we have a pandas DataFrame with data for basketball players, including the number of points they scored and the number of assists they made in a season.

import pandas as pd
import matplotlib.pyplot as plt

data = {'Player': ['LeBron James', 'Stephen Curry', 'Kevin Durant', 'James Harden', 'Russell Westbrook'],
        'Points': [2251, 2336, 2027, 2717, 2558],
        'Assists': [512, 514, 300, 750, 820]}
df = pd.DataFrame(data)
df.plot(kind='scatter', x='Points', y='Assists')
plt.title('Basketball Players: Points vs Assists')
plt.show()

In the above example, we create a pandas DataFrame with data for five basketball players. We then create a scatter plot to show the relationship between the number of points they scored and the number of assists they made in a season.

We also add a title to the graph using the title() function.

Part 2: Overlaying Scatter Plots on the Same Graph

Overlaying multiple scatter plots on the same graph allows us to compare multiple datasets visually.

Pandas makes it easy to create multiple scatter plots on the same graph by reusing the same plot object. Syntax for overlaying scatter plots on the same graph:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('my_data.csv') # read data from csv file
ax = df.plot(kind='scatter', x='col1', y='col2', color='red')
df.plot(kind='scatter', x='col3', y='col4', color='blue', ax=ax)
plt.show()

In the above example, we first create a scatter plot of one set of columns, col1 and col2, and then we create another scatter plot of a different set of columns, col3 and col4. We pass the argument ax=ax to the second plot() function to tell it to reuse the same plot object created in the first plot() function.

Example of overlaying scatter plots on the same graph:

Suppose we have a pandas DataFrame with data for basketball players, including the number of points and rebounds they made in a season for two different teams. We can create two scatter plots, one for each team, and overlay them on the same graph to compare the performance of players across teams.

import pandas as pd
import matplotlib.pyplot as plt

data1 = {'Player': ['LeBron James', 'Stephen Curry', 'Kevin Durant', 'James Harden', 'Russell Westbrook'],
         'Points': [2200, 2300, 2000, 2700, 2500],
         'Rebounds': [700, 500, 300, 900, 800],
         'Team': ['Team A']*5}
data2 = {'Player': ['Trae Young', 'Damian Lillard', 'Donovan Mitchell', 'Jayson Tatum', 'Zion Williamson'],
         'Points': [2400, 2500, 2000, 2200, 2300],
         'Rebounds': [400, 400, 200, 700, 800],
         'Team': ['Team B']*5}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
ax = df1.plot(kind='scatter', x='Points', y='Rebounds', color='red', label='Team A')
df2.plot(kind='scatter', x='Points', y='Rebounds', color='blue', label='Team B', ax=ax)
plt.title('Basketball Players: Points vs Rebounds')
plt.legend()
plt.show()

In the above example, we create two pandas DataFrames, one for each team, with data for the same set of players. We then create two scatter plots, one for each team, and overlay them on the same graph using the same plot object created in the first plot() function.

We use the legend() function to add a legend to the plot, indicating which team is which.

Conclusion

Scatter plots are valuable tools for visualizing relationships between two variables.

Pandas makes it easy to create scatter plots from data stored in a DataFrame, and we can overlay multiple scatter plots on the same graph to compare multiple datasets. By understanding the syntax for creating scatter plots with multiple columns and overlaying scatter plots on the same graph, we can gain valuable insights into our data and make more informed decisions based on our findings.

In this article, we explored how to create scatter plots using Pandas, a popular library for data manipulation and analysis in Python. We learned how to create scatter plots with multiple columns and how to overlay multiple scatter plots on the same graph.

Scatter plots are crucial tools for visualizing relationships between two variables, and Pandas makes it easy to create them using data stored in a DataFrame. Understanding how to create and analyze scatter plots can help us gain valuable insights into our data and make informed decisions.

Remember to always use clear syntax and add labels or titles to your scatter plots to enhance their effectiveness.

Popular Posts