Adventures in Machine Learning

Data Visualization with Pandas: Plotting Two Columns in Scatter and Line Charts

Plotting Two Columns in Pandas DataFrame: Exploring Scatter Plot and Line Chart

Data visualization plays a significant role in data analysis. It helps to uncover hidden insights, communicate data-driven decisions, and convey information in a concise and understandable way.

Pandas is a popular library in Python for data manipulation and analysis. Pandas’ DataFrame allows for easy manipulation of tabular data.

Method 1: Scatter Plot

A scatter plot is a type of chart that displays data as points with coordinates on a two-dimensional graph. It is particularly useful in analyzing the relationship between two variables. In Pandas, we can plot two columns in a scatter plot using the plot method of a DataFrame. The primary keywords for this method are scatter plot and pandas DataFrame.

Method 2: Line Chart

A line chart, also known as a line graph or curve chart, shows data as a series of points connected by lines. It is commonly used to visualize trends over time. In Pandas, we can plot two columns in a line chart using the plot method of a DataFrame with the kind parameter set to “line.” The primary keywords for this method are line chart and pandas DataFrame.

Example 1: Plotting Two Columns on Scatter Plot

Consider a dataset of basketball players containing their heights and weights. We want to create a scatter plot of the two variables that shows the relationship between height and weight. The primary keywords for this example are basketball players, pandas DataFrame, scatter plot, and plot values.

Creating a DataFrame in Pandas

To create a DataFrame in Pandas, we can start by defining two lists containing the height and weight values of the players. We can then use the DataFrame method to create a DataFrame with the two lists as columns.

import pandas as pd
heights = [78, 72, 68, 71, 75, 70, 73, 72, 74, 79]
weights = [250, 215, 210, 195, 225, 190, 195, 200, 210, 240]
df = pd.DataFrame({'height': heights, 'weight': weights})

We can use the head method to display the first five rows of the DataFrame:

print(df.head())

The output should look like this:

   height  weight
0      78     250
1      72     215
2      68     210
3      71     195
4      75     225

Creating a Scatter Plot in Matplotlib

To create a scatter plot in Matplotlib, we can use the scatter function. We’ll need to provide the values for the x-axis and y-axis, which in this case are the height and weight columns of the DataFrame.

import matplotlib.pyplot as plt
plt.scatter(df['height'], df['weight'])
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Basketball Players Height vs Weight')
plt.show()

The output should display a scatter plot of height versus weight for the basketball players.

Example 2: Plotting Two Columns on Line Chart

Consider a dataset of monthly sales for a company for the year 2021, containing the sales volume and the revenue generated for each month. We want to create a line chart of the two variables that shows the trend of sales volume and revenue over time. The primary keywords for this example are monthly sales, pandas DataFrame, line chart, and plot values.

Creating a DataFrame in Pandas

To create a DataFrame in Pandas, we can start by defining two lists containing the sales volume and revenue values for each month. We can then use the DataFrame method to create a DataFrame with the two lists as columns.

import pandas as pd
sales_volume = [100, 120, 140, 130, 160, 170, 180, 200, 210, 220, 240, 260]
revenue = [10000, 12000, 14000, 13000, 16000, 17000, 18000, 20000, 21000, 22000, 24000, 26000]
df = pd.DataFrame({'sales_volume': sales_volume, 'revenue': revenue})

We can use the head method to display the first five rows of the DataFrame:

print(df.head())

The output should look like this:

   sales_volume  revenue
0           100    10000
1           120    12000
2           140    14000
3           130    13000
4           160    16000

Creating a Line Chart in Matplotlib

To create a line chart in Matplotlib, we can use the plot function with the kind parameter set to “line.” We’ll need to provide the values for the x-axis and y-axis, which in this case are the months and the sales volume and revenue columns of the DataFrame.

import matplotlib.pyplot as plt
df.plot(xticks=range(len(df.index)), kind='line', grid=True)
plt.xlabel('Months')
plt.ylabel('Sales Volume and Revenue')
plt.title('Monthly Sales for 2021')
plt.legend(['Sales Volume', 'Revenue'])
plt.show()

The output should display a line chart of sales volume and revenue for each month of 2021.

Example 2: Plotting Two Columns on Line Chart

In this example, we will explore how to create a line chart using Pandas to plot two columns in a basketball team dataset.

We are interested in visualizing the performance of the team over the season by plotting the points scored and the points conceded. The primary keywords for this example are basketball team, pandas DataFrame, line chart, and plot values.

Creating a DataFrame in Pandas

To create a DataFrame in Pandas, we can start by defining two lists containing the points scored and the points conceded by the team. We can then use the DataFrame method to create a DataFrame with the two lists as columns.

import pandas as pd
points_scored = [110, 102, 105, 120, 112, 118, 122, 114, 128, 130, 132, 125]
points_conceded = [100, 98, 110, 115, 103, 108, 112, 118, 105, 122, 123, 117]
df = pd.DataFrame({'Points Scored': points_scored, 'Points Conceded': points_conceded})

We can use the head method to display the first five rows of the DataFrame:

print(df.head())

The output should look like this:

   Points Scored  Points Conceded
0            110              100
1            102               98
2            105              110
3            120              115
4            112              103

Creating a Line Chart in Pandas

To create a line chart in Pandas, we can use the plot method of a DataFrame. We’ll need to specify the x-axis and y-axis values for our chart.

In this case, the x-axis represents the period of the season, while the y-axis represents the points scored and points conceded.

df.plot(title='Basketball Team Performance', xlabel='Game #', ylabel='Points', grid=True)

The output should display a line chart of points scored and points conceded for each game of the season.

Conclusion

In this article, we explored two methods for plotting two columns in a Pandas DataFrame: scatter plot and line chart. We provided examples of how to create a DataFrame in Pandas and how to create a scatter plot and a line chart using Matplotlib and Pandas.

We also discussed commonly used Pandas functions and visualization tools. Data visualization is essential in data analysis because it helps convey information in a concise and understandable way.

With Pandas and Matplotlib, data visualization becomes more accessible and intuitive, even for those without extensive programming experience. By using Pandas visualization tools, analysts can uncover hidden insights and communicate data-driven decisions in a more impactful way.

Additional Resources

Commonly Used Pandas Functions

  • head(n): returns the first n rows of a DataFrame
  • tail(n): returns the last n rows of a DataFrame
  • info(): prints a concise summary of a DataFrame including column names, non-null values, and data types
  • describe(): generates a summary of statistics for numerical columns in a DataFrame
  • groupby(): groups a DataFrame by one or more columns and returns a GroupBy object for further processing
  • merge(): combines two DataFrames based on one or more common columns
  • fillna(): fills missing values in a DataFrame with a specified value or method
  • astype(): converts column data types in a DataFrame to a specified data type
  • apply(): applies a function to each row or column of a DataFrame

Pandas Visualization Tools

  • plot(): creates a variety of plots including line, bar, scatter, and histogram
  • hist(): creates a histogram of a column in a DataFrame
  • boxplot(): creates a box and whisker plot of a column in a DataFrame
  • scatter_matrix(): creates a scatter plot matrix of selected columns in a DataFrame
  • pivot_table(): creates a pivot table to summarize and aggregate data in a DataFrame
  • heatmap(): creates a heatmap of values in a DataFrame

Pandas visualization tools are built on top of Matplotlib, a popular data visualization library in Python. These tools allow for quick and easy creation of charts and graphs with minimal coding. With these tools, data analysis becomes more intuitive and accessible, even for those without extensive programming experience.

Popular Posts