Logistic Regression and ROC Curve
Logistic regression is a statistical method used to analyze a dataset where the response variable is binary (either 0 or 1). The purpose is to predict the probability of a response being in a particular category based on one or more predictor variables.
Logistic regression is widely used in the medical, social, and business domains, especially when the outcome of interest is dichotomous. The ROC (Receiver Operating Characteristic) curve is a graph that shows the relationship between sensitivity (true positive rate) and specificity (false positive rate) for every possible threshold of a given classification model.
The purpose of the ROC curve is to evaluate the performance of a binary classifier, which assigns a positive or negative label to each instance based on a calculated probability. ROC can be used to find the optimal classification threshold that maximizes the tradeoff of sensitivity and specificity, or to compare the performance of different classifiers under different scenarios.
Creating a ROC Curve
To create an ROC curve, we need to first fit a binary classifier to a dataset and obtain the predicted probability of each instance being in the positive class. Then, we need to set up a range of threshold values, typically from 0 to 1, and calculate the corresponding sensitivity and specificity scores for each threshold using the true positive rate (TPR) and false positive rate (FPR), respectively.
Finally, we plot the TPR versus FPR for each threshold, resulting in a curve that represents how well the classifier distinguishes between the two classes.
Example in Python
Here’s an example of how to create an ROC curve in Python using the scikit-learn library:
“` python
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# load the dataset
url = “https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data”
df = pd.read_csv(url, header=None)
# add column names
df.columns = [“age”, “sex”, “cp”, “restbp”, “chol”, “fbs”, “restecg”, “thalach”, “exang”, “oldpeak”, “slope”, “ca”, “thal”, “target”]
# replace missing values with NaN
df = df.replace(“?”, np.nan)
# drop rows with missing values
df = df.dropna()
# convert the target column to binary
df[“target”] = df[“target”].apply(lambda x: 1 if x >= 1 else 0)
# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(“target”, axis=1), df[“target”], test_size=0.2, random_state=42)
# fit a logistic regression model to the training data
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)
# predict probabilities for the testing data
probs = clf.predict_proba(X_test)[:, 1]
# calculate the FPR, TPR, and threshold values
fpr, tpr, thresholds = roc_curve(y_test, probs)
# calculate the AUC score
roc_auc = auc(fpr, tpr)
# plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color=”darkorange”, lw=2, label=”ROC curve (AUC = %0.2f)” % roc_auc)
plt.plot([0, 1], [0, 1], color=”navy”, lw=2, linestyle=”–“)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(“False Positive Rate”)
plt.ylabel(“True Positive Rate”)
plt.title(“Receiver Operating Characteristic”)
plt.legend(loc=”lower right”)
plt.show()
“`
In this example, we load the heart disease dataset from the UCI Machine Learning Repository, which contains 303 instances and 14 attributes, and convert the target variable to binary (0 or 1) based on a threshold of 1. We then split the dataset into 80% training and 20% testing sets, fit a logistic regression model on the training data using scikit-learn’s `LogisticRegression` class, and obtain the predicted probabilities for the testing data using the `predict_proba` method.
We then calculate the FPR, TPR, and threshold values using scikit-learn’s `roc_curve` function, and plot the ROC curve using matplotlib’s `plot` function. Finally, we calculate the AUC score using scikit-learn’s `auc` function and display it on the plot as well.
Interpretation of ROC Curve
Once we have created an ROC curve, we can use it to evaluate the performance of a binary classifier based on its sensitivity and specificity scores. Sensitivity is the proportion of true positives (TP) that are correctly identified as positive, while specificity is the proportion of true negatives (TN) that are correctly identified as negative.
A perfect classifier would have sensitivity and specificity scores of 1, meaning that it correctly identifies all positive and negative instances, respectively. A random classifier would have sensitivity and specificity scores of 0.5, meaning that it cannot distinguish between the two classes.
Model Evaluation
One way to evaluate the performance of a binary classifier using an ROC curve is to calculate the area under the curve (AUC), which measures the overall quality of the classifier at all possible classification thresholds. The AUC score ranges from 0 to 1, with a score of 0.5 indicating a random classifier and a score of 1 indicating a perfect classifier.
A score above 0.5 means that the classifier performs better than random guessing. The AUC score can be interpreted as the probability that a positive instance is ranked higher than a negative instance by the classifier.
AUC Calculation
To calculate the AUC score, we can use the `auc` function from scikit-learn, which takes the FPR and TPR arrays as input. The AUC score represents the area under the ROC curve, which can be decomposed into various parts depending on the distribution of positive and negative instances in the dataset.
For example, if the dataset is highly imbalanced (i.e., one class is much more frequent than the other), the AUC score may be overinflated by a fraction of the area under the diagonal line that represents a random classifier. Therefore, it is important to consider the prior probabilities of the classes and adjust the threshold accordingly.
Example in Python
Here’s an example of how to calculate the AUC score and plot the ROC curve in Python using the scikit-learn library:
“` python
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# load the dataset
url = “https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data”
df = pd.read_csv(url, header=None)
# add column names
df.columns = [“age”, “sex”, “cp”, “restbp”, “chol”, “fbs”, “restecg”, “thalach”, “exang”, “oldpeak”, “slope”, “ca”, “thal”, “target”]
# replace missing values with NaN
df = df.replace(“?”, np.nan)
# drop rows with missing values
df = df.dropna()
# convert the target column to binary
df[“target”] = df[“target”].apply(lambda x: 1 if x >= 1 else 0)
# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(“target”, axis=1), df[“target”], test_size=0.2, random_state=42)
# fit a logistic regression model to the training data
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)
# predict probabilities for the testing data
probs = clf.predict_proba(X_test)[:, 1]
# calculate the AUC score
roc_auc = roc_auc_score(y_test, probs)
# calculate the FPR, TPR, and threshold values
fpr, tpr, thresholds = roc_curve(y_test, probs)
# plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color=”darkorange”, lw=2, label=”ROC curve (AUC = %0.2f)” % roc_auc)
plt.plot([0, 1], [0, 1], color=”navy”, lw=2, linestyle=”–“)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(“False Positive Rate”)
plt.ylabel(“True Positive Rate”)
plt.title(“Receiver Operating Characteristic”)
plt.legend(loc=”lower right”)
plt.show()
“`
In this example, we load the heart disease dataset from the UCI Machine Learning Repository, which contains 303 instances and 14 attributes, and convert the target variable to binary (0 or 1) based on a threshold of 1. We then split the dataset into 80% training and 20% testing sets, fit a logistic regression model on the training data using scikit-learn’s `LogisticRegression` class, and obtain the predicted probabilities for the testing data using the `predict_proba` method.
We then calculate the AUC score using scikit-learn’s `roc_auc_score` function, which takes the true labels and predicted probabilities as input, and calculate the FPR, TPR, and threshold values using scikit-learn’s `roc_curve` function. Finally, we plot the ROC curve using matplotlib’s `plot` function and display the AUC score in the legend.
Conclusion
In this article, we discussed the logistic regression model and the ROC curve, including their purposes, metrics, and interpretation. We also provided examples in Python using the scikit-learn library.
Logistic regression and ROC curve are widely used in various fields to model and evaluate binary classification problems. Understanding their concepts and applications can improve our ability to analyze complex datasets and make informed decisions.
In this article, we explored the concepts of logistic regression and ROC curve and discussed their applications in evaluating binary classification models. We provided step-by-step examples in Python using scikit-learn and explained how to calculate the AUC score and interpret its results.
Logistic regression and ROC curve are essential tools for analyzing binary data and making informed decisions in various fields. Understanding their metrics and techniques can help improve the effectiveness of binary classification models and inform future research.