Performing Polynomial Regression in Python
Regression analysis is an essential tool for analyzing the relationship between a response variable and one or more explanatory variables. In practice, regression models help in making predictions and identifying underlying trends within data.
Polynomial regression is a specific method used to model non-linear relationships between variables. In this article, we will introduce you to the basics of polynomial regression and how to perform it using Python.
Linear vs Nonlinear Regression
Before diving into the details of polynomial regression, it is important to understand the difference between linear and nonlinear regression models. Simple linear regression is a method that models a linear relationship between a single explanatory variable and a response variable.
On the other hand, nonlinear regression models capture the relationship between variables that cannot be modeled linearly, like curves or parabolas.
Polynomial Regression
Polynomial regression is one special case of nonlinear regression that can efficiently capture complex nonlinear relationships between variables. The model is defined as:
y = 0 + 1x + 2x + …
+ n xn
where y is the response variable, x is the explanatory variable, and n is the degree of the polynomial equation. For instance, if n=2, the model can be expressed as:
y = 0 + 1x + 2x
This is referred to as quadratic regression, and if n = 3, it is referred to as cubic regression.
Example
Let us consider the following data comprising the number of vehicles per neighborhood in a city:
X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Y=[2, 14, 36, 68, 110, 162, 224, 296, 378, 470]
We can plot it using a scatterplot to visualize the relationship between the two variables. A scatterplot provides a clear picture of the data points and allows you to identify any underlying relationship between the variables.
import matplotlib.pyplot as plt
# Scatterplot
plt.scatter(X, Y)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
The above code will create a scatter plot. From the plot, we can observe that the relationship between the number of vehicles and neighborhoods is not linear.
Therefore, we can perform polynomial regression to get a better understanding of the variables’ relationship.
# numpy is a library in Python used to perform numerical operations
import numpy as np
degree = 2 # quadratic regression
coefficients = np.polyfit(X, Y, degree)
model = np.poly1d(coefficients)
# Scatterplot with regression line
plt.scatter(X, Y)
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(X, model(X), 'r')
plt.show()
Here, we fit the data using the numpy.polyfit() function and create a model equation using numpy.poly1d().
We then visualize the data using the scatter plot and the regression line. From the plot, we can see that the quadratic model accurately captures the relationship between the variables.
R-squared
R-squared is a measure of how close the data is to the fitted regression line. It ranges from 0 to 1, where 0 indicates that the model does not fit the data well.
In contrast, 1 indicates that the model perfectly fits the data. The higher the R-squared, the better the model’s ability to predict the response variable accurately.
# Calculate R-squared
from sklearn.metrics import r2_score
r2_score = r2_score(Y, model(X))
print(f'R-squared value: {r2_score:.2f}')
We used the r2_score() function to calculate the R-squared value. In this example, the R-squared value is 0.998, indicating that the quadratic model accurately fits the data.
Scatterplot Visualization
In data analytics, a scatterplot is one of the most critical tools for identifying relationships between variables. A scatterplot is a two-dimensional plot that shows the relationship between two variables, where each point in the plot represents one observation.
Example
Suppose we have collected data on hourly temperatures during the summer season at a beach resort. We want to visualize the relationship between the temperature and the number of beachgoers.
By plotting a scatterplot between these two variables, we can identify any underlying patterns in the data.
# Data preparation
Temps = [22.3, 23.0, 22.5, 21.8, 24.0, 25.0, 25.2, 25.5, 26.3, 28.0, 30.1, 33.3, 33.1, 31.2]
Beachgoers = [40, 50, 55, 60, 65, 80, 90, 95, 105, 110, 120, 130, 130, 130]
# Scatterplot
plt.scatter(Temps, Beachgoers, color='red')
plt.title('Beachwork')
plt.xlabel('Temperature, C')
plt.ylabel('Number of beachgoers')
plt.show()
In this example, we are plotting a scatterplot between temperature and the number of beachgoers.
The scatterplot helps to visualize the relationship between temperature and the number of beachgoers at the resort. From the scatterplot, we can observe that the number of beachgoers increases as the temperature rises.
Conclusion
In conclusion, polynomial regression is a powerful technique for modeling non-linear relationships between variables. It helps in providing a better understanding of the data and improves the accuracy of predictions.
On the other hand, scatterplots are widely used in data analytics to visualize the relationship between two variables. In both cases, Python offers a variety of functions and libraries that make it easy to perform these operations.
3) Polynomial Regression Model
Polynomial regression is a curve-fitting method used to model nonlinear relationships between a response variable and one or more explanatory variables. In polynomial regression, the relationship between the variables is captured through a polynomial function of degree n.
The polynomial function can be expressed as:
y = 0 + 1*x + 2*x + … + n*x
where y is the predicted response variable, x is the explanatory variable, and 0, 1, 2, …, n are the coefficients of the polynomial regression model.
The degree of the polynomial equation determines how accurately the model fits the data. Higher degree polynomial equations can fit the data closely, but they can also lead to overfitting.
Example
Let’s create a simple polynomial regression model using the numpy.poly1d() function. Suppose we have the following data:
X = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Y = [4.7, 5.2, 6.5, 7.2, 8.6, 10.1, 11.5, 12.7, 14.1, 15.3]
We can create a polynomial regression model using numpy.poly1d() by specifying the degree of the polynomial equation.
For instance, let’s create a quadratic regression model, which has a degree of 2.
import numpy as np
degree = 2 # Quadratic Regression
coefficients = np.polyfit(X, Y, degree)
model = np.poly1d(coefficients)
print(f'Fitted model: {model}')
The fitted model above will output a quadratic regression equation with coefficients as follows:
Fitted model: 0.225 x^2 + 1.139 x + 4.389
Now that we have a fitted polynomial regression equation, we can predict the value of Y for a given value of X using the equation above.
4) R-squared Calculation
In statistics, R-squared is a metric used to determine how well a regression model fits the data.
It measures the proportion of variance in the response variable that can be explained by the explanatory variables. R-squared ranges from 0% to 100%, where 0% indicates that the model does not fit the data at all, and 100% indicates a perfect fit.
R-squared can be calculated using the sum of squares regression (SSReg) and the total sum of squares (SSTot). SSReg is the sum of the squared differences between the predicted and the mean of the response variable.
SSTot is the sum of the squared differences between the actual and the mean values of the response variable. The R-squared calculation can be expressed mathematically as follows:
R-squared = 1 – (SSReg/SSTot)
Example
We can use the numpy library in Python to calculate R-squared. Let’s use the polynomial regression model we developed in example 3 to calculate the R-squared of the regression model.
# Calculate R-squared
y_predicted = model(X)
y_mean = np.mean(Y)
SSReg = np.sum((y_predicted - y_mean)**2)
SSTot = np.sum((Y - y_mean)**2)
R_squared = 1 - (SSReg/SSTot)
print(f'R-squared value: {R_squared:.3f}')
In the code above, we first calculated the predicted values of y and the mean value of y. We then calculated the sum of squares regression (SSReg) and the total sum of squares (SSTot).
Finally, we calculated R-squared by dividing the SSReg by the SSTot and subtracting the result from 1. The R-squared value for the quadratic regression model is 0.997, indicating that the model explains 99.7% of the variation in the response variable.
Conclusion
Polynomial regression models are useful tools for modeling nonlinear relationships between variables. With Python, it is easy to fit polynomial regression models using the numpy library.
We can also use R-squared to determine how well the model fits the data. By understanding the basics of polynomial regression and R-squared, we can improve our ability to make accurate predictions and understand complex relationships between variables.
To summarize, polynomial regression is an effective method for modeling nonlinear relationships between variables. Polynomial regression models capture nonlinearities in the data by fitting a polynomial function to the data.
With the numpy library in Python, it is easy to fit polynomial regression models and calculate R-squared. By understanding the basics of polynomial regression and R-squared, we can improve our ability to make accurate predictions and understand complex relationships between variables.
Takeaway: An adequate knowledge of polynomial regression and R-squared can be an essential skill for data analysts in today’s data-driven world.