Adventures in Machine Learning

Maximizing Model Performance: Understanding Logistic Regression and ROC Curve

Logistic Regression and ROC Curve

Logistic regression is a statistical method used to analyze a dataset where the response variable is binary (either 0 or 1). The purpose is to predict the probability of a response being in a particular category based on one or more predictor variables.

Logistic regression is widely used in the medical, social, and business domains, especially when the outcome of interest is dichotomous. The ROC (Receiver Operating Characteristic) curve is a graph that shows the relationship between sensitivity (true positive rate) and specificity (false positive rate) for every possible threshold of a given classification model.

The purpose of the ROC curve is to evaluate the performance of a binary classifier, which assigns a positive or negative label to each instance based on a calculated probability. ROC can be used to find the optimal classification threshold that maximizes the tradeoff of sensitivity and specificity, or to compare the performance of different classifiers under different scenarios.

Creating a ROC Curve

To create an ROC curve, we need to first fit a binary classifier to a dataset and obtain the predicted probability of each instance being in the positive class. Then, we need to set up a range of threshold values, typically from 0 to 1, and calculate the corresponding sensitivity and specificity scores for each threshold using the true positive rate (TPR) and false positive rate (FPR), respectively.

Finally, we plot the TPR versus FPR for each threshold, resulting in a curve that represents how well the classifier distinguishes between the two classes.

Example in Python

Here’s an example of how to create an ROC curve in Python using the scikit-learn library:

“` python

import pandas as pd

import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve, auc

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# load the dataset

url = “https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data”

df = pd.read_csv(url, header=None)

# add column names

df.columns = [“age”, “sex”, “cp”, “restbp”, “chol”, “fbs”, “restecg”, “thalach”, “exang”, “oldpeak”, “slope”, “ca”, “thal”, “target”]

# replace missing values with NaN

df = df.replace(“?”, np.nan)

# drop rows with missing values

df = df.dropna()

# convert the target column to binary

df[“target”] = df[“target”].apply(lambda x: 1 if x >= 1 else 0)

# split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(df.drop(“target”, axis=1), df[“target”], test_size=0.2, random_state=42)

# fit a logistic regression model to the training data

clf = LogisticRegression(random_state=42)

clf.fit(X_train, y_train)

# predict probabilities for the testing data

probs = clf.predict_proba(X_test)[:, 1]

# calculate the FPR, TPR, and threshold values

fpr, tpr, thresholds = roc_curve(y_test, probs)

# calculate the AUC score

roc_auc = auc(fpr, tpr)

# plot the ROC curve

plt.figure(figsize=(8, 6))

plt.plot(fpr, tpr, color=”darkorange”, lw=2, label=”ROC curve (AUC = %0.2f)” % roc_auc)

plt.plot([0, 1], [0, 1], color=”navy”, lw=2, linestyle=”–“)

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel(“False Positive Rate”)

plt.ylabel(“True Positive Rate”)

plt.title(“Receiver Operating Characteristic”)

plt.legend(loc=”lower right”)

plt.show()

“`

In this example, we load the heart disease dataset from the UCI Machine Learning Repository, which contains 303 instances and 14 attributes, and convert the target variable to binary (0 or 1) based on a threshold of 1. We then split the dataset into 80% training and 20% testing sets, fit a logistic regression model on the training data using scikit-learn’s `LogisticRegression` class, and obtain the predicted probabilities for the testing data using the `predict_proba` method.

We then calculate the FPR, TPR, and threshold values using scikit-learn’s `roc_curve` function, and plot the ROC curve using matplotlib’s `plot` function. Finally, we calculate the AUC score using scikit-learn’s `auc` function and display it on the plot as well.

Interpretation of ROC Curve

Once we have created an ROC curve, we can use it to evaluate the performance of a binary classifier based on its sensitivity and specificity scores. Sensitivity is the proportion of true positives (TP) that are correctly identified as positive, while specificity is the proportion of true negatives (TN) that are correctly identified as negative.

A perfect classifier would have sensitivity and specificity scores of 1, meaning that it correctly identifies all positive and negative instances, respectively. A random classifier would have sensitivity and specificity scores of 0.5, meaning that it cannot distinguish between the two classes.

Model Evaluation

One way to evaluate the performance of a binary classifier using an ROC curve is to calculate the area under the curve (AUC), which measures the overall quality of the classifier at all possible classification thresholds. The AUC score ranges from 0 to 1, with a score of 0.5 indicating a random classifier and a score of 1 indicating a perfect classifier.

A score above 0.5 means that the classifier performs better than random guessing. The AUC score can be interpreted as the probability that a positive instance is ranked higher than a negative instance by the classifier.

AUC Calculation

To calculate the AUC score, we can use the `auc` function from scikit-learn, which takes the FPR and TPR arrays as input. The AUC score represents the area under the ROC curve, which can be decomposed into various parts depending on the distribution of positive and negative instances in the dataset.

For example, if the dataset is highly imbalanced (i.e., one class is much more frequent than the other), the AUC score may be overinflated by a fraction of the area under the diagonal line that represents a random classifier. Therefore, it is important to consider the prior probabilities of the classes and adjust the threshold accordingly.

Example in Python

Here’s an example of how to calculate the AUC score and plot the ROC curve in Python using the scikit-learn library:

“` python

import pandas as pd

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score, roc_curve

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# load the dataset

url = “https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data”

df = pd.read_csv(url, header=None)

# add column names

df.columns = [“age”, “sex”, “cp”, “restbp”, “chol”, “fbs”, “restecg”, “thalach”, “exang”, “oldpeak”, “slope”, “ca”, “thal”, “target”]

# replace missing values with NaN

df = df.replace(“?”, np.nan)

# drop rows with missing values

df = df.dropna()

# convert the target column to binary

df[“target”] = df[“target”].apply(lambda x: 1 if x >= 1 else 0)

# split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(df.drop(“target”, axis=1), df[“target”], test_size=0.2, random_state=42)

# fit a logistic regression model to the training data

clf = LogisticRegression(random_state=42)

clf.fit(X_train, y_train)

# predict probabilities for the testing data

probs = clf.predict_proba(X_test)[:, 1]

# calculate the AUC score

roc_auc = roc_auc_score(y_test, probs)

# calculate the FPR, TPR, and threshold values

fpr, tpr, thresholds = roc_curve(y_test, probs)

# plot the ROC curve

plt.figure(figsize=(8, 6))

plt.plot(fpr, tpr, color=”darkorange”, lw=2, label=”ROC curve (AUC = %0.2f)” % roc_auc)

plt.plot([0, 1], [0, 1], color=”navy”, lw=2, linestyle=”–“)

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel(“False Positive Rate”)

plt.ylabel(“True Positive Rate”)

plt.title(“Receiver Operating Characteristic”)

plt.legend(loc=”lower right”)

plt.show()

“`

In this example, we load the heart disease dataset from the UCI Machine Learning Repository, which contains 303 instances and 14 attributes, and convert the target variable to binary (0 or 1) based on a threshold of 1. We then split the dataset into 80% training and 20% testing sets, fit a logistic regression model on the training data using scikit-learn’s `LogisticRegression` class, and obtain the predicted probabilities for the testing data using the `predict_proba` method.

We then calculate the AUC score using scikit-learn’s `roc_auc_score` function, which takes the true labels and predicted probabilities as input, and calculate the FPR, TPR, and threshold values using scikit-learn’s `roc_curve` function. Finally, we plot the ROC curve using matplotlib’s `plot` function and display the AUC score in the legend.

Conclusion

In this article, we discussed the logistic regression model and the ROC curve, including their purposes, metrics, and interpretation. We also provided examples in Python using the scikit-learn library.

Logistic regression and ROC curve are widely used in various fields to model and evaluate binary classification problems. Understanding their concepts and applications can improve our ability to analyze complex datasets and make informed decisions.

In this article, we explored the concepts of logistic regression and ROC curve and discussed their applications in evaluating binary classification models. We provided step-by-step examples in Python using scikit-learn and explained how to calculate the AUC score and interpret its results.

Logistic regression and ROC curve are essential tools for analyzing binary data and making informed decisions in various fields. Understanding their metrics and techniques can help improve the effectiveness of binary classification models and inform future research.

Popular Posts