Matthews Correlation Coefficient: An Important Metric for Classification Models
As data becomes increasingly valuable across all industries, there is also an increased interest in making predictions about the future based on data. One common way to do this is through building classification models that can take in various inputs and predict a certain category of outputs.
Examples of classification models include predicting whether an email is spam or not, diagnosing a patient with a disease, or even predicting which college basketball players will be drafted into the NBA. However, some classification problems can be inherently imbalanced, where one class of outputs might rarely occur compared to the other.
For example, in the NBA draft example, there are only 60 draft picks each year despite thousands of college basketball players across the nation. This is why it’s crucial to have a performance metric that can account for imbalanced classes and properly evaluate the effectiveness of a classification model.
One such metric that is particularly useful in imbalanced classes is the Matthews correlation coefficient (MCC). MCC is a statistical measure that evaluates the quality of binary (two-class) classifications, taking into account true positives, false positives, true negatives, and false negatives.
MCC ranges from -1 to 1, with 1 indicating perfect prediction and -1 indicating complete disagreement between predictions and outcomes.
Calculation of MCC
1. Formula
To calculate MCC, the following formula is used:
MCC = (TP * TN - FP * FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN))
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
2. Confusion Matrix
MCC can be calculated using a confusion matrix, which is a table that summarizes the number of correct and incorrect predictions by the model:
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | True Positive (TP) | False Negative (FN) |
Actual Negative | False Positive (FP) | True Negative (TN) |
From the confusion matrix, we can obtain the values of TP, TN, FP, and FN, which are then used in the MCC formula to get the final score.
Importance of MCC in Imbalanced Classes
When dealing with imbalanced classes, accuracy as a performance metric can be misleading. For example, consider a case where a classification model predicts all observations to be in one class, resulting in high accuracy if that class is the majority but providing no value to the minority class.
MCC, on the other hand, considers all four categories of the confusion matrix and provides a more accurate assessment of the quality of predictions. Another important feature of MCC is that it is not sensitive to changes in the class distribution.
This is useful when applied to data that may have a class imbalance that changes over time. In such cases, the model may experience a shift in performance as the class distribution changes, and traditional metrics such as accuracy may not be as reliable. Because MCC compares the predicted class to the actual classes without weighting for class frequency, it will provide a more consistent evaluation performance over time.
Example of Using MCC to Evaluate a Classification Model
Let’s consider a scenario where a researcher is building a classification model aimed at predicting which college basketball players will be drafted into the NBA. The researcher has collected data from previous years drafts, including various features such as player height, weight, college, and position played.
The researcher may decide to use logistic regression, a type of classification algorithm, to train the model based on this data. In logistic regression, the model generates a probability for each input that a player will be drafted, and threshold values are applied to classify each record as either drafted or not drafted.
To evaluate the performance of this model, the researcher needs to calculate MCC. Using the confusion matrix shown earlier, the researcher can calculate the values of TP, TN, FP, and FN for the model. The predicted class is drafted or not drafted, while the actual class can be determined based on the actual draft outcome.
Implementing MCC Calculation in Python
If you’re using Python, there are packages such as Scikit-learn that provide convenient functions for calculating MCC. The matthews_corrcoef()
function in Scikit-learn can be used to calculate MCC as follows:
from sklearn.metrics import matthews_corrcoef
# Actual classes
y_actual = [1, 0, 0, 1, 0, 1, 1, 0]
# Predicted classes
y_pred = [1, 0, 1, 1, 0, 1, 0, 0]
# Calculate MCC
mcc = matthews_corrcoef(y_actual, y_pred)
print("MCC:", mcc)
The output should be a value between -1 and 1, with a higher value indicating better performance.
In conclusion, when dealing with classification models, it’s important to have a performance metric that can accurately evaluate the quality of predictions, especially in cases where the classes are imbalanced. Matthews correlation coefficient (MCC) is one such measure that takes into account all categories of the confusion matrix and provides a more accurate assessment of model performance.
Thanks to its insensitivity to changes in class distribution, MCC is particularly useful when applied to data sets that may have imbalanced classes that change over time. In conclusion, the Matthews correlation coefficient is an important metric that can accurately evaluate classification models, especially in cases where the classes are imbalanced.
Unlike accuracy, MCC takes into account all categories of the confusion matrix and provides a more accurate assessment of model performance. Furthermore, MCC is not sensitive to changes in class distribution, making it particularly useful for data sets that may have imbalanced classes that change over time.
By incorporating MCC into the evaluation process, researchers and data scientists can ensure that they are generating reliable predictions and avoid misleading results that could have significant consequences.