Linear Regression in Python
Have you ever wondered how machines can predict values based on given data? The answer lies in one of the most widely used machine learning techniques – Linear Regression.
In this article, we will discuss the basics of Linear Regression and how it can be implemented in Python. Example of Linear Regression:
Before diving into the implementation, let’s get a clear understanding of the concept of Linear Regression.
Linear Regression is a statistical method that is used to establish a relationship between two variables – Predictor Variables and Response Variable. The Predictor Variables are independent variables that are used to predict the Response Variable, which is a dependent variable.
To better understand this, let’s consider an example of predicting the price of a house based on its area. Here, the area of the house is the Predictor Variable and the price of the house is the Response Variable.
In Python, we can perform Linear Regression using various libraries such as scikit-learn, TensorFlow, and StatsModel. Let’s consider the example of using StatsModel to implement Linear Regression.
Steps to Perform Linear Regression:
Step 1: Enter the Data
The first step in implementing Linear Regression is to create the necessary data. This can be done using the Pandas library, which is a popular Python data analysis library.
Pandas provides a DataFrame class that is used to store and manipulate data in tabular form. Creating and Viewing Data:
In the example of predicting the price of a house based on its area, we can create a DataFrame as shown below:
import pandas as pd
data = {'Area': [2104, 1600, 2400, 1416, 3000],
'Price': [400000, 330000, 369000, 232000, 540000]}
df = pd.DataFrame(data)
print(df.head())
Output:
Area Price
0 2104 400000
1 1600 330000
2 2400 369000
3 1416 232000
4 3000 540000
Here, we have created a DataFrame with two columns – Area and Price. We can view the data using the `head()` function, which displays the first five rows of the DataFrame.
Step 2: Fit the Model
After creating the data, we need to fit the model to the data. This can be done using the Ordinary Least Squares (OLS) function in StatsModel library.
OLS is a method used to estimate the unknown parameters in a linear regression model.
import statsmodels.api as sm
X = df[['Area']]
Y = df['Price']
X = sm.add_constant(X)
model = sm.OLS(Y,X).fit()
print(model.summary())
Output:
OLS Regression Results
==============================================================================
Dep.
Variable: Price R-squared: 0.772
Model: OLS Adj. R-squared: 0.697
Method: Least Squares F-statistic: 10.39
Date: Sun, 17 Oct 2021 Prob (F-statistic): 0.0230
Time: 10:00:00 Log-Likelihood: -56.747
No. Observations: 5 AIC: 117.5
Df Residuals: 3 BIC: 116.5
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.66e+05 8.59e+04 1.932 0.150 -1.16e+05 4.49e+05
Area 135.7877 42.249 3.224 0.023 32.920 238.655
==============================================================================
Omnibus: nan Durbin-Watson: 1.797
Prob(Omnibus): nan Jarque-Bera (JB): 0.469
Skew: -0.352 Prob(JB): 0.791
Kurtosis: 1.597 Cond.
No. 7.64e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 7.64e+04.
This might indicate that there are
strong multicollinearity or other numerical problems.
Here, we have fitted the model using the OLS function by providing the Predictor Variable `X` and Response Variable `Y`.
We also added a constant column to `X` using the `add_constant()` function. Finally, we called the `fit()` function to fit the model to the data and printed the summary of the model using the `summary()` function.
The summary provides useful information such as R-squared, Adjusted R-squared, F-statistic, and p-values.
Linear Regression is a powerful statistical method that can help in predicting values based on given data.
With Python, implementing Linear Regression has become easy and efficient using various libraries such as StatsModel, TensorFlow, and scikit-learn. By following the steps mentioned in this article, you can implement
Linear Regression in Python to predict values and gain valuable insights from your data.
Step 2: Perform Linear Regression
In the previous section, we discussed how to enter the data required for Linear Regression. In this section, we will discuss the steps involved in performing the Linear Regression Analysis using the OLS function from the StatsModel Library in Python.
Defining Response and Predictor Variables:
The next step involved in performing Linear Regression analysis is defining the Response and Predictor Variables. The Response Variable is the variable that we are trying to predict, and Predictor Variables are the variables used to make predictions.
In our example of predicting the price of a house based on its area, the Response Variable is the price of the house, and the Predictor Variable is the area of the house.
import statsmodels.api as sm
X = df[['Area']]
Y = df['Price']
Here, we have defined the Predictor Variable `X` as the `Area` column from our DataFrame, and the Response Variable `Y` as the `Price` column.
Adding Constant to Predictor Variables:
Before fitting the Linear Regression Model, it is necessary to add a constant to the Predictor Variable using the `add_constant()` function. This is because the OLS function does not include a constant by default.
X = sm.add_constant(X)
This code adds a constant column with a default value of 1 to the Predictor Variable `X`.
Fitting Linear Regression Model:
The final step in performing Linear Regression Analysis is fitting the Linear Regression Model using the OLS function.
model = sm.OLS(Y, X).fit()
Here, we have fit the model using the OLS function by providing the Response Variable `Y` and Predictor Variable `X`. Once the model is fit, we can proceed to interpret the results.
Step 3: Interpret the Results
After fitting the Linear Regression Model, it is essential to interpret the results to gain insights from the data. The following are some of the key aspects to consider while interpreting the results.
Coefficient of Determination:
The Coefficient of Determination, also known as R-squared, is a measure that indicates the proportion of variance in the Response Variable that is explained by the Predictor Variables. The R-squared ranges from 0 to 1, with a higher value indicating a better fit of the model to the data.
print(model.summary())
The summary of the model provides information about the R-squared value. In our example, the R-squared value is 0.772, indicating that approximately 77.2% of the variance in the price of the house is explained by the area of the house.
F-Statistic and P-Value:
The F-Statistic is a measure that indicates the overall statistical significance of the Linear Regression Model. A low F-Statistic value indicates that the model does not fit the data well, and the results are not statistically significant.
The P-value indicates the statistical significance of the F-statistic. A P-value less than 0.05 is considered statistically significant.
print(model.summary())
The summary of the model provides information about the F-Statistic and P-value. In our example, the F-Statistic is 10.39, with a P-value of 0.023, indicating that the model is statistically significant.
Coefficients for Each Predictor Variable:
The Coefficients for each Predictor Variable indicate the expected change in the Response Variable for a unit change in the Predictor Variable, keeping all other variables constant. The Intercept is the expected value of the Response Variable when all Predictor Variables are equal to zero.
print(model.params)
The `params` function provides information about the coefficients for each Predictor Variable and the Intercept. In our example, the Intercept is 166000, and the Coefficient of the Predictor Variable `Area` is 135.7877, indicating that for each unit increase in Area, the Price of the house is expected to increase by $135.78.
Individual P-Values:
Individual P-Values indicate the statistical significance of each Predictor Variable. A P-value less than 0.05 is considered statistically significant.
print(model.summary())
The summary of the model provides information about the individual P-values of each Predictor Variable. In our example, the P-value of the Predictor Variable `Area` is 0.023, indicating that it is statistically significant.
Estimated Regression Equation:
The Estimated Regression Equation provides a formula that can be used to calculate the expected value of the Response Variable based on the Predictor Variables.
y_hat = 166000 + 135.7877 * 2000
print(y_hat)
Here, we have calculated the expected Price of the house for an Area of 2000 square feet, using the Estimated Regression Equation. In our example, the expected Price is $437,575.
In conclusion, Linear Regression Analysis is a powerful technique that can be used to make predictions based on given data. In Python, implementing Linear Regression Analysis is easy and efficient, using various libraries such as StatsModel, TensorFlow, and scikit-learn.
By following the steps mentioned in this article, we can perform Linear Regression Analysis and interpret the results to gain insights from our data.
Step 4: Check Model Assumptions
After performing Linear Regression Analysis, it is essential to check if our model satisfies the assumptions of Linear Regression.
These assumptions include a linear relationship between Predictor Variables and Response Variable, independence of residuals, homoscedasticity of residuals, and normality of residuals. The following are some of the techniques used to check these assumptions.
Linear Relationship between Predictor Variables and Response Variable:
The first assumption in Linear Regression is a linear relationship between Predictor Variables and the Response Variable. To check this assumption, we can plot the Residuals against the Predictor Variable.
Residuals are the differences between the observed Response Variable and the Predicted Response Variable.
import matplotlib.pyplot as plt
residuals = model.resid
plt.scatter(X.iloc[:,1], residuals)
plt.xlabel("Area")
plt.ylabel("Residuals")
plt.show()
Here, we have plotted the residuals against the Predictor Variable `Area`.
The plot should show no clear pattern, indicating that there is a linear relationship between Predictor Variables and the Response Variable. If there is a pattern, it indicates that the Linear Regression Model may not fit the data well, and we may need to consider adding more Predictor Variables or transforming the data.
Independence of Residuals:
The second assumption in Linear Regression is the independence of residuals. This means that the residuals should not be correlated with each other.
To check this assumption, we can use the Durbin-Watson Test and the Breusch-Pagan Test.
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.diagnostic import het_breuschpagan
print("Durbin-Watson Test: ", durbin_watson(residuals))
print("Breusch-Pagan Test: ", het_breuschpagan(residuals, X)[1])
The Durbin-Watson Test value should be between 0 and 4, with a value near 2 indicating no correlation between residuals.
The Breusch-Pagan Test checks for heteroscedasticity by testing whether the residual variance is constant across observations. A P-value less than 0.05 indicates that we cannot assume that the residual variance is constant.
Homoscedasticity of Residuals:
The third assumption in Linear Regression is homoscedasticity of residuals. This means that the variance of residuals should be constant across observations.
To check this assumption, we can use the Durbin-Watson Test and the Breusch-Pagan Test.
print("Durbin-Watson Test: ", durbin_watson(residuals))
print("Breusch-Pagan Test: ", het_breuschpagan(residuals, X)[1])
The Durbin-Watson Test value should be between 0 and 4, with a value near 2 indicating homoscedasticity of residuals.
The Breusch-Pagan Test checks for heteroscedasticity by testing whether the residual variance is constant across observations. A P-value less than 0.05 indicates that we cannot assume that the residual variance is constant.
Normality of Residuals:
The final assumption in Linear Regression is normality of residuals. This means that the residuals should be normally distributed.
To check this assumption, we can use the Q-Q Plot, Jarque-Bera Test, Anderson-Darling Test, and VIF Value.
from statsmodels.graphics.gofplots import qqplot
from scipy.stats import jarque_bera
from scipy.stats import anderson
from statsmodels.stats.outliers_influence import variance_inflation_factor
qqplot(residuals, line='s')
plt.show()
print("Jarque-Bera Test: ", jarque_bera(residuals)[1])
print("Anderson-Darling Test: ", anderson(residuals)[1])
print("VIF Value: ", variance_inflation_factor(X.values, 1))
The Q-Q Plot should show that the residuals follow the diagonal line, indicating normality.
The Jarque-Bera Test checks for normality by testing whether the residuals have a normal distribution. A P-value less than 0.05 indicates that we cannot assume that the residuals have a normal distribution.
The Anderson-Darling Test checks for normality by testing whether the residuals have a normal distribution. A P-value less than 0.05 indicates that we cannot assume that residuals have a normal distribution.
The VIF Value checks for multicollinearity by testing whether the Predictor Variables are highly correlated. A VIF Value less than 5 indicates no multicollinearity.
In conclusion, checking the assumptions of Linear Regression is essential to ensure that the model fits the data well and that our predictions are accurate. By checking the linear relationship between Predictor Variables and the Response Variable, independence of residuals, homoscedasticity of residuals, and normality of residuals, we can gain insights from the data and improve our predictions.
The techniques discussed in this article, such as Residual Plots, Durbin-Watson Test, Breusch-Pagan Test, Q-Q Plot, Jarque-Bera Test, Anderson-Darling Test, and VIF Value, can be used to check the assumptions of
Linear Regression in Python. Linear Regression is a popular statistical method used to establish a relationship between variables for prediction purposes.
In Python, implementing Linear Regression is easy and efficient, thanks to libraries such as StatsModel, TensorFlow, and scikit-learn. The four steps