Adventures in Machine Learning

Mastering Confusion Matrix in Python: Tools and Techniques

Creating and Displaying Confusion Matrix in Python

In machine learning, confusion matrix is an essential evaluation metric. It is a table that summarizes the performance of a classifier on a particular dataset by showing the number of true positives, true negatives, false positives, and false negatives.

The confusion matrix helps data scientists identify the strengths and weaknesses of a model, allowing for better fine-tuning and optimization. In Python, creating and displaying a confusion matrix is easy, and it can be done using libraries like pandas, seaborn, and pandas_ml.

In this article, we will explore how to create and display confusion matrix using Python and its libraries.

Creating Confusion Matrix using pandas

The first step in creating a confusion matrix in Python is to use the pandas library. Pandas is a powerful library for data manipulation, and it provides a convenient way of creating a confusion matrix by using the crosstab function.

The crosstab function takes two arrays, the actual labels and the predicted labels, and returns a table showing the count of each combination. For instance, suppose we have a multiclass classification problem with three classes, and we have the following array of actual and predicted labels:

import pandas as pd
import numpy as np
actual = np.array([0, 1, 2, 1, 2, 2, 0, 0, 1])
predicted = np.array([0, 1, 1, 1, 2, 2, 0, 1, 1])
cf_matrix = pd.crosstab(actual, predicted, rownames=['Actual'], colnames=['Predicted'])

print(cf_matrix)

The output of the code above produces the following table:

Predicted  0  1  2
Actual            
0          2  0  0
1          0  2  1
2          0  1  2

From the table, we can see that the model made two correct predictions for class 0, two correct predictions for class 1, and two correct predictions for class 2. Additionally, the model made one incorrect prediction for class 1 and one incorrect prediction for class 2.

Displaying Confusion Matrix using seaborn

After creating the confusion matrix using pandas, the next step is to display the table using a heatmap. Heatmaps are a great way of visualizing the data in a confusion matrix, and they can be created using the seaborn library.

Seaborn is a Python data visualization library based on matplotlib, and it provides a high-level interface for creating informative and beautiful visualizations. To display the confusion matrix using seaborn, we can call the heatmap function and pass in the confusion matrix as an argument.

Additionally, we can set the cmap parameter to a specific color map to customize the appearance of the heatmap.

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.show()

The output of this code produces a heatmap that shows the classification accuracy for each pair of classes. The diagonal cells represent the correctly predicted instances, while the non-diagonal cells represent the incorrectly predicted instances.

Confusion Matrix Heatmap

Getting Additional Stats via pandas_ml

In addition to displaying the confusion matrix in a heatmap, pandas_ml provides additional functionality for computing evaluation metrics from the confusion matrix. The ConfusionMatrix object in pandas_ml allows us to compute precision, recall, F1 score, and accuracy, among other metrics.

!pip install pandas_ml

from pandas_ml import ConfusionMatrix
confusion_matrix = ConfusionMatrix(actual, predicted)
print(confusion_matrix.print_stats())

The output of the code above produces the following table of evaluation metrics:

Confusion Matrix and Statistics
          Predicted          
                 0    1    2
Actual                     
0                2    0    0
1                0    2    1
2                0    1    2
Overall Statistics:
                                         
               Accuracy : 0.7778          
                 95% CI : (0.3967, 0.9636)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 0.0069          
                                         
                  Kappa : 0.625           
 Mcnemar's Test P-Value : NA              
Statistics by Class:
                     0          1          2
... 

Working with Non-numeric Data

So far, we have seen how to create and display confusion matrix using numeric data. However, what if you have non-numeric data?

For example, suppose you have a binary classification task where the labels are “yes” and “no”. How do you create and display a confusion matrix in this scenario?

The solution is to map the non-numeric labels to numerical values. For instance, we can map “yes” to 1 and “no” to 0.

Once we have mapped the labels to numerical values, we can create the confusion matrix as we did before.

import pandas as pd
import numpy as np
actual = np.array(['yes', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no'])
predicted = np.array(['no', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no', 'yes'])
mapping = {'yes': 1, 'no': 0}
actual = np.array([mapping[x] for x in actual])
predicted = np.array([mapping[x] for x in predicted])
cf_matrix = pd.crosstab(actual, predicted, rownames=['Actual'], colnames=['Predicted'])

print(cf_matrix)
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.show()

The output of the code above produces the following output:

Predicted  0  1
Actual         
0          4  1
1          2  2

Conclusion

In conclusion, creating and displaying confusion matrix in Python is relatively straightforward. Using libraries such as pandas, seaborn, and pandas_ml, data scientists can evaluate the performance of their models and make informed decisions on how to improve them.

Additionally, in working with non-numeric data, we need to map the labels to numerical values before creating the confusion matrix.

Creating and Displaying Confusion Matrix in Python: Part 2

In the previous section, we explored how to create and display a confusion matrix in Python using the pandas and seaborn libraries. In this section, we will dive deeper into those libraries and explore additional ways to create and display a confusion matrix.

Creating Confusion Matrix using pandas

To create a confusion matrix using pandas, the first step is to import the pandas library and create a DataFrame that contains the actual labels and predicted labels.

import pandas as pd
import numpy as np
actual_labels = np.array([0, 1, 1, 0, 1, 0, 0, 1, 0])
predicted_labels = np.array([1, 0, 1, 0, 1, 1, 0, 0, 0])
df = pd.DataFrame({'actual': actual_labels, 'predicted': predicted_labels})

In the code above, we create two arrays of actual and predicted labels and then create a DataFrame using the pd.DataFrame() function, which takes a dictionary of column names and values. Once we have the DataFrame, we can use the pd.crosstab() function to create a confusion matrix that shows the count of actual and predicted labels.

cm = pd.crosstab(df['actual'], df['predicted'])

The pd.crosstab() function takes two Series objects as input and returns a DataFrame that shows the count of occurrences of each combination of actual and predicted labels.

Displaying Confusion Matrix using seaborn

To display a confusion matrix using seaborn, the first step is to import the seaborn library and create a heatmap that visualizes our confusion matrix.

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(cm, annot=True, cmap='Blues')
plt.show()

In the code above, we use the sns.heatmap() function to create a heatmap that visualizes the count of occurrences for each combination of actual and predicted labels. We also set the annot parameter to True to display the counts in each cell of the heatmap, and the cmap parameter to ‘Blues’ to set the color scheme to blue hues.

Adding Margins for Additional Stats

In addition to displaying a confusion matrix, we can also compute some additional metrics to evaluate the performance of our classifier. One way of doing this is by adding margins to our confusion matrix, which shows the proportion of actual and predicted labels for each class.

cm = pd.crosstab(df['actual'], df['predicted'], rownames=['Actual'], colnames=['Predicted'], margins=True)

In the code above, we add the rownames and colnames arguments to specify the names of the rows and columns in our confusion matrix. We also set the margins parameter to True to add margins to the matrix.

The margins will be displayed as an additional column and row that show the total counts of actual and predicted labels. We can use these counts to compute additional evaluation metrics such as precision, recall, and F1-score.

total_counts = cm.loc['All', 'All']
true_positive = cm.loc[1, 1]
false_positive = cm.loc[0, 1]
true_negative = cm.loc[0, 0]
false_negative = cm.loc[1, 0]
accuracy = (true_positive + true_negative) / total_counts
precision = true_positive / (true_positive + false_positive)
recall = true_positive / (true_positive + false_negative)
f1_score = (2 * precision * recall) / (precision + recall)
print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1-score:', f1_score)

In the code above, we compute the evaluation metrics using the counts from our confusion matrix. We compute the accuracy by summing up the counts of true positives and true negatives and dividing by the total number of instances.

We compute precision, recall, and F1-score following standard formulas that use the counts of true positives, false positives, and false negatives.

Conclusion

In this article, we explored how to create and display a confusion matrix in Python using the pandas and seaborn libraries. We also learned how to add margins to our confusion matrix to compute additional evaluation metrics.

The confusion matrix is an essential tool for evaluating the performance of a classifier, and it helps data scientists identify the strengths and weaknesses of their models.

Creating and Displaying Confusion Matrix in Python: Part 2

In the previous sections, we explored how to create and display a confusion matrix using Python’s pandas and seaborn libraries. We also saw how to add margins to our confusion matrix to compute additional evaluation metrics.

In this section, we will dive deeper into these topics and explore two more ways to create and display a confusion matrix.

Getting Additional Stats using pandas_ml

The pandas_ml library is another popular library for working with machine learning models in Python. It provides various functionalities for performing different evaluation metrics, including the confusion matrix.

The library provides a ConfusionMatrix class that takes in the actual and predicted labels and provides additional statistics such as precision, recall, F1-score, accuracy, and many more. Before we can use the pandas_ml module, we need to install it first.

Open up your terminal and type in the following command to install pandas_ml via pip.

pip install pandas_ml

Now that we’ve installed pandas_ml let’s see an example of how to use it.

from pandas_ml import ConfusionMatrix
# Creating data
actual = [1, 1, 0 , 0, 1, 0, 1]
predicted = [0, 1, 0, 1, 1, 0, 0]
# Creating confusion matrix using pandas_ml's ConfusionMatrix class
confusion_matrix = ConfusionMatrix(actual, predicted)
# Printing statistics of the confusion matrix
print(confusion_matrix.print_stats())

The code above produces the following output:

Confusion Matrix and Statistics
    Predicted  0  1
    Actual         
    0          2  2
    1          1  2
    Overall Statistics:
                                                      
    Accuracy          : 0.5714                     
    95% CI            : (0.1702, 0.9142)
    No Information Rate: 0.5714                     
    P-Value [Acc > NIR]: 0.8745                     
    Kappa             : 0.1429                     
    Mcnemar's Test    : Not Significant            
    Class Statistics:
                                                      
    Classes           : 0.0            1.0           
    ... 

From the output, we can see that the actual labels are [1, 1, 0 , 0, 1, 0, 1] and the predicted labels are [0, 1, 0, 1, 1, 0, 0].

The confusion matrix shows that we have 2 true positives, 2 true negatives, 1 false positive, and 2 false negatives. Additionally, the statistics show that we have an accuracy of 0.5714 and a kappa of 0.1429.

Working with non-numeric data

Sometimes, we might encounter classifiers that have non-numeric labels. For example, in the context of sentiment analysis, the labels might be positive and negative.

In such cases, we can’t directly create a confusion matrix using the pandas library. However, we can map the non-numeric labels to numeric values to create the confusion matrix.

Let’s consider an example where the labels are positive and negative.

import pandas as pd
# Creating data
actual = ['positive', 'positive', 'negative', 'negative', 'positive', 'positive', 'positive', 'negative']
predicted = ['negative', 'positive', 'negative', 'negative', 'positive', 'positive', 'positive', 'negative']
# Creating a mapping from non-numeric labels to numeric values
mapping = {'positive': 1, 'negative': 0}
actual = [mapping[i] for i in actual]
predicted = [mapping[i] for i in predicted]
# Creating confusion matrix using pandas
cm = pd.crosstab(pd.Series(actual), pd.Series(predicted), rownames=['Actual'], colnames=['Predicted'])
# Creating heatmap using seaborn

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(cm, annot=True, cmap='Blues')
plt.show()

In the code above, we create a mapping from the non-numeric labels (positive and negative) to numeric values (1 and 0). We then map the actual and predicted values to numeric values using this mapping.

Finally, we create a confusion matrix using the pd.crosstab() function as before.

Conclusion

In conclusion, creating and displaying confusion matrices in Python is an important tool for evaluating the performance of machine learning models. We explored how to use the pandas and seaborn libraries to create and display confusion matrices.

We also saw how to add margins to our confusion matrix to compute additional evaluation metrics and explored two additional ways of creating and displaying confusion matrices using pandas_ml and mapping non-numeric labels to numeric values. In conclusion, creating and displaying a confusion matrix is an essential tool in evaluating the performance of machine learning models.

By using libraries such as pandas, seaborn, and pandas_ml, data scientists can accurately analyze the results of a classifier, identify strengths and weaknesses, and make informed decisions on how to improve them. Additionally, by mapping non-numeric labels to numeric values, we can analyze non-numeric labels’ classifier performance.

Thus, data scientists can harness the advantages of confusion matrices’ evaluation metrics to continually refine machine learning models, obtain better model accuracy and make better-informed decisions in data-driven projects.

Popular Posts