Adventures in Machine Learning

Mastering Multiple Linear Regression with Python’s Statsmodels

Mastering Multiple Linear Regression Using Statsmodels

Have you ever wondered if there’s a way to predict future outcomes based on certain variables that have a significant impact on the final result? What if there was a tool that could help you predict future outcomes by analyzing multiple variables?

Enter multiple linear regression, a powerful statistical method that can help you analyze how multiple independent variables affect a dependent variable. In this article, we’ll explore how to use the statsmodels library in Python to fit a multiple linear regression model and make predictions.

Fitting a Multiple Linear Regression Model

Before we dive into fitting a multiple linear regression model, let’s understand what it is. Multiple linear regression is a method used to predict the value of the dependent variable based on the values of multiple independent or predictor variables.

The formula for multiple linear regression is:

Y = b0 + b1X1 + b2X2 + … + bNXN

Where Y is the dependent variable, X1, X2, …

XN are independent variables, and b0, b1, b2, … bN are coefficients.

Now that we understand what multiple linear regression is, let’s explore how to fit a model in Python using the statsmodels library. The first step is to load the necessary libraries.

“`

import pandas as pd

import numpy as np

import statsmodels.api as sm

“`

Next, let’s create a DataFrame with our data. We’ll use a hypothetical scenario where we’re trying to predict the scores of students based on their age, the number of hours they studied, and the number of extracurricular activities they participated in.

“`

data = {‘age’: [18, 19, 20, 21, 18, 19, 20, 21, 18, 19, 20, 21],

‘hours_studied’: [5, 4, 6, 3, 4, 6, 5, 3, 5, 6, 4, 3],

‘extracurricular_activities’: [2, 1, 3, 2, 1, 3, 2, 2, 2, 3, 1, 2],

‘score’: [80, 70, 90, 60, 70, 90, 80, 60, 80, 90, 70, 60]}

df = pd.DataFrame(data)

“`

Our DataFrame has four columns, i.e., age, hours_studied, extracurricular_activities, and score, where score is the dependent variable. Now that we have our DataFrame, let’s fit our model using the Ordinary Least Squares (OLS) method in statsmodels.

“`

X = df[[‘age’, ‘hours_studied’, ‘extracurricular_activities’]]

y = df[‘score’]

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

“`

In the above code snippet, we separate the dependent and independent variables and then add a constant value to the independent variables. The reason for adding the constant is to estimate the intercept b0 in the multiple linear regression equation.

We then fit the model using OLS, which returns an Ordinary Least Squares RegressionResults object that we store in the model object.

Making Predictions with the Fitted Model

Now that we have our fitted model, let’s use it to make predictions. We’ll create a new DataFrame with data for new students to predict their scores.

“`

new_data = {‘age’: [19, 20, 22],

‘hours_studied’: [4, 5, 6],

‘extracurricular_activities’: [3, 2, 1]}

new_df = pd.DataFrame(new_data)

“`

Our new DataFrame has three rows with data for three new students. “`

age hours_studied extracurricular_activities

0 19 4 3

1 20 5 2

2 22 6 1

“`

We’ll use the predict() method of the RegressionResults object to make predictions. The predict() method takes a DataFrame object as an argument.

“`

new_X = sm.add_constant(new_df)

predictions = model.predict(new_X)

“`

We add a constant value to the new data, which is required for the predict() method to work correctly. We then make predictions for the new students and store the results in the predictions object.

Our predictions are:

“`

0 87.363636

1 80.545455

2 72.090909

dtype: float64

“`

Based on our multiple linear regression model, we predict that the first student will score 87.36, the second student will score 80.55, and the third student will score 72.09.

Conclusion

Multiple linear regression is a powerful statistical method that helps predict the value of the dependent variable based on multiple independent variables. By using the statsmodels library in Python, we can easily fit a multiple linear regression model and make predictions.

In this article, we covered how to fit a multiple linear regression model using OLS() and how to make predictions using predict(). Armed with this knowledge, you can now explore how multiple linear regression can help you gain insights and make data-driven decisions.

Happy modeling!

Multiple linear regression is an essential statistical tool for analyzing how multiple independent variables affect a dependent variable. By using the statsmodels library in Python, we can quickly fit a multiple linear regression model and make predictions.

In this article, we have explained how to fit a model using the OLS method and make predictions using predict(). Armed with this knowledge, you can now explore how multiple linear regression can help you predict future outcomes based on multiple variables.

With the ability to predict outcomes, organizations can gain insights and make data-driven decisions. It is a powerful tool that should be a part of every data scientist’s toolkit.