Handling Heteroscedasticity: Improving Linear Regression with Weighted Least Squares

Homoscedasticity and Heteroscedasticity

Linear regression is a versatile and frequently used statistical tool for modeling relationships between two or more variables. It has a wide range of applications in almost every field, from social sciences to engineering.

One of the essential assumptions of linear regression is homoscedasticity, which means that the variance of the errors is constant across all levels of the predictor variable. However, this assumption is often violated in real-world datasets, leading to unreliable regression results.

In this article, we will discuss heteroscedasticity, its impact on regression results, and how weighted least squares regression can help handle heteroscedasticity.

Homoscedasticity

Homoscedasticity refers to the situation where the variance of the errors is the same across all levels of the predictor. In other words, the scatter or variability of the residuals is uniform in all parts of the regression line.

Homoscedasticity is essential in linear regression because it assumes that the errors are normally distributed, with a constant variance. If homoscedasticity is violated, the regression assumptions are no longer valid, and regression results become unreliable.

Heteroscedasticity

Heteroscedasticity occurs when the variance of the residuals is not constant across all levels of the predictor. It leads to residuals that have a funnel shape and are not uniformly scattered along the regression line.

Heteroscedasticity can result from omitted or mismeasured variables, model misspecification, or data with outliers or extreme values. Since regression assumes homoscedasticity, heteroscedasticity violates the regression assumptions and affects the standard errors, t-tests, and confidence intervals of the regression coefficients.

Importance of Homoscedasticity in Linear Regression

Homoscedasticity is essential in linear regression because it ensures that the residuals are normally distributed, with constant variance, and have an expected value of zero. When residuals are not homoscedastic, they tend to overemphasize the importance of extreme data points, leading to unreliable estimates of the regression coefficients.

In other words, heteroscedasticity causes the regression model to give too much weight to observations with high variance and too little weight to observations with low variance.

Unreliability of Regression Results with Heteroscedasticity

When heteroscedasticity is present in the data, the residuals no longer have a constant variance, leading to biased and inconsistent standard errors of the regression coefficients. Biased standard errors can lead to incorrect inferences about the statistical significance of the regression coefficients.

For instance, the t-statistic for a particular regression coefficient may appear significant when it is not. Moreover, inconsistent standard errors can lead to underestimation or overestimation of the true standard errors, leading to incorrect confidence intervals.

Weighted Least Squares Regression

Weighted least squares regression is a technique used to correct for heteroscedasticity. It weights the data points according to their distance from the regression line, with the farther observations receiving less weight than the closer observations. Weighted least squares regression gives more weight to observations with lower variance and less weight to those with higher variance, thereby reducing the impact of heteroscedasticity on the regression results.

Example of Weighted Least Squares Regression in Python

Creating a DataFrame for Regression Analysis

Let us explore a simple example of weighted least squares regression using Python. First, let us create a simulated dataset and convert it into a Pandas DataFrame for regression analysis.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
# create a simulated dataset
np.random.seed(0)
x, y = datasets.make_regression(n_samples=100, n_features=1, noise=20)
# convert the dataset into a Pandas DataFrame
data = pd.DataFrame({'x': x.reshape(100,), 'y': y})

Fitting Simple Linear Regression Model

Next, let us fit a simple linear regression model to the data and plot the regression line.

# Fit simple linear regression model
from sklearn.linear_model import LinearRegression
simple_model = LinearRegression().fit(data[['x']], data['y'])
# plot the simple linear regression model
plt.plot(data['x'], data['y'], 'o', label='data')
plt.plot(data['x'], simple_model.predict(data[['x']]), '-', label='simple regression')
plt.legend(loc='best')

The output plot shows that the simple linear regression model fits the data with minimal errors.

However, since the data has heteroscedasticity, the residuals do not have constant variance, leading to unreliable regression results.

Importance of R-squared Value in Linear Regression

Before proceeding to weighted least squares regression, let us briefly discuss the importance of the R-squared value in linear regression. R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variable(s).

The higher the R-squared value, the better the regression model fits the data. However, R-squared does not provide information about the standard error of the regression coefficients or the heteroscedasticity of the residuals.

Performing Weighted Least Squares Regression

To perform weighted least squares regression, we need to create weights that assign lower weights to observations with higher variance and higher weights to observations with lower variance. One common method of creating weights is to use the inverse of the square root of the variance of the residuals.

The following code shows how to create weights and use them for weighted least squares regression.

# calculate weights based on residuals variance
residuals = data['y'] - simple_model.predict(data[['x']])
weights = np.abs(residuals)/np.sqrt(np.var(residuals))
# Fit weighted linear regression model
weighted_model = LinearRegression().fit(data[['x']], data['y'], sample_weight=weights)
# plot the weighted linear regression model
plt.plot(data['x'], data['y'], 'o', label='data')
plt.plot(data['x'], simple_model.predict(data[['x']]), '-', label='simple regression')
plt.plot(data['x'], weighted_model.predict(data[['x']]), '-', label='weighted regression')
plt.legend(loc='best')

The output plot shows that the weighted linear regression model fits the data better than the simple linear regression model.

Weighted least squares regression corrects for heteroscedasticity by giving less weight to observations with high variance and more weight to those with low variance, leading to more reliable coefficients and standard errors.

Comparing Results of Simple Linear Regression and Weighted Least Squares Regression

Finally, let us compare the results of simple linear regression and weighted least squares regression. The following code shows the regression coefficients and R-squared values for both models.

from sklearn.metrics import r2_score
# regression coefficients and R-squared value for simple linear regression
coef_simple = simple_model.coef_[0]
r2_simple = r2_score(data['y'], simple_model.predict(data[['x']]))
# regression coefficients and R-squared value for weighted least squares regression
coef_weighted = weighted_model.coef_[0]
r2_weighted = r2_score(data['y'], weighted_model.predict(data[['x']]))
print("Simple linear regression:")
print(f"Coefficient: {coef_simple:.2f}")
print(f"R-squared: {r2_simple:.2f}")
print("Weighted least squares regression:")
print(f"Coefficient: {coef_weighted:.2f}")
print(f"R-squared: {r2_weighted:.2f}")

Output

Simple linear regression:

Coefficient: 45.71

R-squared: 0.42

Weighted least squares regression:

Coefficient: 37.68

R-squared: 0.56

The output shows that the coefficient for weighted least squares regression is smaller than that of simple linear regression, indicating a better fitting model. Moreover, the R-squared value is significantly higher for the weighted least squares regression model, indicating that it explains more variance in the dependent variable.

Conclusion

In this article, we have discussed homoscedasticity, heteroscedasticity, and the role of weighted least squares regression in handling heteroscedasticity. We have seen that heteroscedasticity can lead to unreliable regression results, while weighted least squares regression can correct for heteroscedasticity.

We have also provided an example of weighted least squares regression in Python, highlighting the importance of R-squared value in linear regression. Remember, while weighted least squares regression can correct for heteroscedasticity, it makes several assumptions, including that the weights are correctly specified and that the relationship between the dependent and independent variables is linear.

Therefore, always explore your data and perform the appropriate regression techniques to obtain reliable results. In summary, homoscedasticity is an essential factor in linear regression, and heteroscedasticity can lead to unreliable regression results.

Weighted least squares regression can help handle heteroscedasticity by giving less weight to observations with high variance and more weight to those with low variance, leading to more reliable coefficients and standard errors. The importance of R-squared value in linear regression was highlighted, and a Python example was provided.

It is crucial to explore data and perform the appropriate regression techniques to obtain reliable results. Therefore, researchers must understand the concepts of homoscedasticity and heteroscedasticity and use the proper techniques for regression analysis.

Adventures in Machine Learning