Plotting Two Columns in Pandas DataFrame: Exploring Scatter Plot and Line Chart
Data visualization plays a significant role in data analysis. It helps to uncover hidden insights, communicate data-driven decisions, and convey information in a concise and understandable way.
Pandas is a popular library in Python for data manipulation and analysis. Pandas’ DataFrame allows for easy manipulation of tabular data.
Method 1: Scatter Plot
A scatter plot is a type of chart that displays data as points with coordinates on a two-dimensional graph. It is particularly useful in analyzing the relationship between two variables. In Pandas, we can plot two columns in a scatter plot using the plot
method of a DataFrame. The primary keywords for this method are scatter plot and pandas DataFrame.
Method 2: Line Chart
A line chart, also known as a line graph or curve chart, shows data as a series of points connected by lines. It is commonly used to visualize trends over time. In Pandas, we can plot two columns in a line chart using the plot
method of a DataFrame with the kind
parameter set to “line.” The primary keywords for this method are line chart and pandas DataFrame.
Example 1: Plotting Two Columns on Scatter Plot
Consider a dataset of basketball players containing their heights and weights. We want to create a scatter plot of the two variables that shows the relationship between height and weight. The primary keywords for this example are basketball players, pandas DataFrame, scatter plot, and plot values.
Creating a DataFrame in Pandas
To create a DataFrame in Pandas, we can start by defining two lists containing the height and weight values of the players. We can then use the DataFrame
method to create a DataFrame with the two lists as columns.
import pandas as pd
heights = [78, 72, 68, 71, 75, 70, 73, 72, 74, 79]
weights = [250, 215, 210, 195, 225, 190, 195, 200, 210, 240]
df = pd.DataFrame({'height': heights, 'weight': weights})
We can use the head
method to display the first five rows of the DataFrame:
print(df.head())
The output should look like this:
height weight
0 78 250
1 72 215
2 68 210
3 71 195
4 75 225
Creating a Scatter Plot in Matplotlib
To create a scatter plot in Matplotlib, we can use the scatter
function. We’ll need to provide the values for the x-axis and y-axis, which in this case are the height and weight columns of the DataFrame.
import matplotlib.pyplot as plt
plt.scatter(df['height'], df['weight'])
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Basketball Players Height vs Weight')
plt.show()
The output should display a scatter plot of height versus weight for the basketball players.
Example 2: Plotting Two Columns on Line Chart
Consider a dataset of monthly sales for a company for the year 2021, containing the sales volume and the revenue generated for each month. We want to create a line chart of the two variables that shows the trend of sales volume and revenue over time. The primary keywords for this example are monthly sales, pandas DataFrame, line chart, and plot values.
Creating a DataFrame in Pandas
To create a DataFrame in Pandas, we can start by defining two lists containing the sales volume and revenue values for each month. We can then use the DataFrame
method to create a DataFrame with the two lists as columns.
import pandas as pd
sales_volume = [100, 120, 140, 130, 160, 170, 180, 200, 210, 220, 240, 260]
revenue = [10000, 12000, 14000, 13000, 16000, 17000, 18000, 20000, 21000, 22000, 24000, 26000]
df = pd.DataFrame({'sales_volume': sales_volume, 'revenue': revenue})
We can use the head
method to display the first five rows of the DataFrame:
print(df.head())
The output should look like this:
sales_volume revenue
0 100 10000
1 120 12000
2 140 14000
3 130 13000
4 160 16000
Creating a Line Chart in Matplotlib
To create a line chart in Matplotlib, we can use the plot
function with the kind
parameter set to “line.” We’ll need to provide the values for the x-axis and y-axis, which in this case are the months and the sales volume and revenue columns of the DataFrame.
import matplotlib.pyplot as plt
df.plot(xticks=range(len(df.index)), kind='line', grid=True)
plt.xlabel('Months')
plt.ylabel('Sales Volume and Revenue')
plt.title('Monthly Sales for 2021')
plt.legend(['Sales Volume', 'Revenue'])
plt.show()
The output should display a line chart of sales volume and revenue for each month of 2021.
Example 2: Plotting Two Columns on Line Chart
In this example, we will explore how to create a line chart using Pandas to plot two columns in a basketball team dataset.
We are interested in visualizing the performance of the team over the season by plotting the points scored and the points conceded. The primary keywords for this example are basketball team, pandas DataFrame, line chart, and plot values.
Creating a DataFrame in Pandas
To create a DataFrame in Pandas, we can start by defining two lists containing the points scored and the points conceded by the team. We can then use the DataFrame
method to create a DataFrame with the two lists as columns.
import pandas as pd
points_scored = [110, 102, 105, 120, 112, 118, 122, 114, 128, 130, 132, 125]
points_conceded = [100, 98, 110, 115, 103, 108, 112, 118, 105, 122, 123, 117]
df = pd.DataFrame({'Points Scored': points_scored, 'Points Conceded': points_conceded})
We can use the head
method to display the first five rows of the DataFrame:
print(df.head())
The output should look like this:
Points Scored Points Conceded
0 110 100
1 102 98
2 105 110
3 120 115
4 112 103
Creating a Line Chart in Pandas
To create a line chart in Pandas, we can use the plot
method of a DataFrame. We’ll need to specify the x-axis and y-axis values for our chart.
In this case, the x-axis represents the period of the season, while the y-axis represents the points scored and points conceded.
df.plot(title='Basketball Team Performance', xlabel='Game #', ylabel='Points', grid=True)
The output should display a line chart of points scored and points conceded for each game of the season.
Conclusion
In this article, we explored two methods for plotting two columns in a Pandas DataFrame: scatter plot and line chart. We provided examples of how to create a DataFrame in Pandas and how to create a scatter plot and a line chart using Matplotlib and Pandas.
We also discussed commonly used Pandas functions and visualization tools. Data visualization is essential in data analysis because it helps convey information in a concise and understandable way.
With Pandas and Matplotlib, data visualization becomes more accessible and intuitive, even for those without extensive programming experience. By using Pandas visualization tools, analysts can uncover hidden insights and communicate data-driven decisions in a more impactful way.
Additional Resources
Commonly Used Pandas Functions
head(n)
: returns the firstn
rows of a DataFrametail(n)
: returns the lastn
rows of a DataFrameinfo()
: prints a concise summary of a DataFrame including column names, non-null values, and data typesdescribe()
: generates a summary of statistics for numerical columns in a DataFramegroupby()
: groups a DataFrame by one or more columns and returns a GroupBy object for further processingmerge()
: combines two DataFrames based on one or more common columnsfillna()
: fills missing values in a DataFrame with a specified value or methodastype()
: converts column data types in a DataFrame to a specified data typeapply()
: applies a function to each row or column of a DataFrame
Pandas Visualization Tools
plot()
: creates a variety of plots including line, bar, scatter, and histogramhist()
: creates a histogram of a column in a DataFrameboxplot()
: creates a box and whisker plot of a column in a DataFramescatter_matrix()
: creates a scatter plot matrix of selected columns in a DataFramepivot_table()
: creates a pivot table to summarize and aggregate data in a DataFrameheatmap()
: creates a heatmap of values in a DataFrame
Pandas visualization tools are built on top of Matplotlib, a popular data visualization library in Python. These tools allow for quick and easy creation of charts and graphs with minimal coding. With these tools, data analysis becomes more intuitive and accessible, even for those without extensive programming experience.