Uncovering Patterns with K-Means Clustering: A Visual Guide

Visualizing K-means Clusters and K-means Clustering

Clustering is a common task in machine learning and data science that involves grouping similar data points together. The K-means algorithm is an iterative clustering method that seeks to partition a set of observations into a pre-determined number of clusters (K).

Visualizing K-means Clusters:

Data visualization is an essential aspect of data analysis that enables us to communicate patterns, trends, and insights to decision-makers.

This section will help you learn how to plot K-means clusters using the sklearn, load_digits, PCA, KMeans, and matplotlib libraries in Python.

Preparing Data for Plotting:

The first step is to load the digits dataset and prepare it for plotting.

The sklearn library has an in-built dataset called load_digits that contains 8×8 images of handwritten digits from 0-9. We can load the dataset with the following code:

from sklearn.datasets import load_digits
digits = load_digits()
data = digits["data"]

The data variable now contains 1797 rows and 64 columns.

Each row represents an image of a digit, and each column represents a pixel’s grayscale intensity. We can reduce the dimension of the dataset from 64 to 2 using Principal Component Analysis (PCA) to plot the clusters.

PCA is a dimensionality reduction technique that extracts the most important features from the data. We will use the first two principal components to plot the clusters.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(data)

The pca_data variable now contains 1797 rows and 2 columns, representing the first two principal components.

Apply K-means to the Data:

Next, we will apply K-means clustering to the reduced dataset.

The K-means algorithm requires us to specify the number of clusters (K) we want to create. In this example, we will set K=10 since we have 10 digits (0-9).

We can use the KMeans class from the sklearn.cluster library to apply the algorithm to our dataset.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10)
kmeans.fit(pca_data)
labels = kmeans.labels_

The labels variable now contains the label (cluster number) for each digit.

Plotting Label 0 K-means Clusters:

We can visualize the K-means clusters by plotting the pca_data points and color-coding them based on their labels. In this section, we will plot label 0.

import matplotlib.pyplot as plt
filtered_label0 = pca_data[labels == 0]
plt.scatter(filtered_label0[:, 0], filtered_label0[:, 1])
plt.title("Label 0 K-Means Clusters")
plt.show()

The above code will plot the K-means clusters for label 0.

Plotting Additional K-means Clusters:

We can plot the clusters for other labels (clusters) in the same way.

filtered_label2 = pca_data[labels == 2]
filtered_label8 = pca_data[labels == 8]
plt.scatter(filtered_label2[:, 0], filtered_label2[:, 1])
plt.scatter(filtered_label8[:, 0], filtered_label8[:, 1])
plt.title("Additional K-Means Clusters")
plt.show()

The above code will plot the K-means clusters for labels 2 and 8.

Plot All K-means Clusters:

We can plot all the K-means clusters and add a legend to differentiate the clusters.

u_labels = np.unique(labels)
for i in u_labels:
    filtered_label = pca_data[labels == i]
    plt.scatter(filtered_label[:, 0], filtered_label[:, 1], label=i)
plt.legend()
plt.title("K-Means Clusters")
plt.show()

The above code will plot all K-means clusters.

Plotting the Cluster Centroids:

Finally, we can plot the cluster centroids to see the center of each cluster.

centroids = kmeans.cluster_centers_
for i in u_labels:
    filtered_label = pca_data[labels == i]
    plt.scatter(filtered_label[:, 0], filtered_label[:, 1], label=i)
plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=150, linewidths=3, color="red")
plt.legend()
plt.title("K-Means Clusters with Centroids")
plt.show()

The above code will plot all K-means clusters with their respective centroids.

K-means Clustering:

K-means clustering is an important machine learning algorithm that can be applied to various real-life challenges such as customer segmentation, image compression, anomaly detection, and much more.

In this section, we will learn about K-means clustering and how to determine the optimal number of clusters.

Iterative Clustering Method:

K-means clustering is an iterative clustering method that aims to partition a dataset into K clusters.

The algorithm starts by selecting K random points (centroids) from the dataset. It then assigns each data point to the nearest centroid to form K clusters.

It then updates the centroids by computing the mean of the data points in each cluster and repeats the process until the centroids no longer move.

Determining the Number of Clusters:

One of the challenges of K-means clustering is determining the optimal number of clusters (K) for a dataset.

Two common methods are the elbow method and the average silhouette method.

The elbow method involves plotting the sum of squared errors (SSE) against the number of clusters (K) and selecting the K value where the SSE starts to flatten out. This suggests that adding more clusters does not significantly improve the performance of the algorithm. The code to plot the elbow curve is as follows:

sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(pca_data)
    sse.append(kmeans.inertia_)
plt.plot(range(1, 11), sse)
plt.xlabel("Number of Clusters")
plt.ylabel("Sum of Squared Errors")
plt.title("Elbow Method")
plt.show()

The above code will plot the elbow curve.

The average silhouette method involves calculating the silhouette score for each K value, which measures how similar an observation is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1, where a score of 1 indicates that the observation is well-matched to its own cluster and poorly matched to neighboring clusters.

The code to calculate and plot the silhouette scores is as follows:

from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(pca_data)
    labels = kmeans.labels_
    silhouette_scores.append(silhouette_score(pca_data, labels))
plt.plot(range(2, 11), silhouette_scores)
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("Average Silhouette Method")
plt.show()

The above code will plot the silhouette scores.

Conclusion:

In this article, we have explored K-means clustering in two parts: visualizing K-means clusters and understanding the algorithm.

We have learned how to prepare data for plotting, apply K-means clustering, plot K-means clusters, and determine the optimal number of clusters using the elbow method and the average silhouette method. K-means clustering is a powerful machine learning algorithm that is widely used in industry and academia to solve various real-life challenges.

K-means clustering is a fundamental method in data science that involves grouping similar data points into clusters. The algorithm is iterative and aims to partition a dataset into a pre-determined number of clusters.

This article has explored K-means clustering in two parts: visualizing K-means clusters and understanding the algorithm. We have learned how to prepare data for plotting, apply K-means clustering, and determine the optimal number of clusters using the elbow method and the average silhouette method.

K-means clustering is a powerful machine learning algorithm that has numerous real-life applications. The article emphasizes the importance of data visualization in communicating insights and patterns to decision-makers.

Understanding K-means clustering empowers data scientists and business professionals to solve complex challenges and make informed decisions.

Adventures in Machine Learning