Machine Learning - K-Means - The Coding College

Clustering is a fundamental technique in unsupervised learning, and K-Means is one of the most popular clustering algorithms. It’s simple, efficient, and widely used for grouping data points into clusters based on their similarity.

At The Coding College, we simplify machine learning concepts so you can master them with ease. In this guide, we’ll dive into how K-Means works and how to implement it in Python.

What Is K-Means?

K-Means is an iterative clustering algorithm that partitions a dataset into K clusters. Each cluster is represented by its centroid, and data points are grouped based on proximity to these centroids.

Key Characteristics of K-Means

Requires the number of clusters (K) to be specified in advance.
Works well with numerical data.
Iteratively minimizes the sum of squared distances between data points and their cluster centroids.

How K-Means Works

The algorithm follows these steps:

Initialize Centroids: Randomly select K data points as initial centroids.
Assign Clusters: Assign each data point to the nearest centroid.
Update Centroids: Calculate the mean of all points in each cluster to update centroids.
Repeat: Iterate steps 2 and 3 until the centroids no longer change significantly.

K-Means in Python

Let’s implement K-Means using the popular library scikit-learn.

Example: Clustering 2D Data

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Sample data
data = np.array([
    [1, 2], [1.5, 1.8], [5, 8],
    [8, 8], [1, 0.6], [9, 11]
])

# Create K-Means model
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(data)

# Cluster centroids and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

# Plot results
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='rainbow')
plt.scatter(centroids[:, 0], centroids[:, 1], color='black', marker='x')
plt.title("K-Means Clustering")
plt.show()

Output: A scatter plot showing data points colored by cluster and centroids marked as black crosses.

Evaluating K-Means

1. Elbow Method

The Elbow Method helps determine the optimal number of clusters by plotting the sum of squared errors (SSE) against the number of clusters.

sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(data)
    sse.append(kmeans.inertia_)

plt.plot(range(1, 11), sse, marker='o')
plt.title("Elbow Method")
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()

2. Silhouette Score

The Silhouette Score measures how similar a data point is to its own cluster compared to other clusters.

from sklearn.metrics import silhouette_score

# Calculate silhouette score
score = silhouette_score(data, kmeans.labels_)
print("Silhouette Score:", score)

Applications of K-Means

Customer Segmentation: Group customers based on purchasing behavior.
Image Compression: Reduce image size by clustering pixel colors.
Anomaly Detection: Identify outliers by observing distant data points.

Challenges and Limitations

Predefined K: Choosing the correct number of clusters can be challenging.
Sensitivity to Initialization: Random initialization can lead to suboptimal clusters.
Non-Spherical Clusters: K-Means struggles with irregularly shaped clusters.

Solutions

Use the Elbow Method or Silhouette Score to determine K.
Implement K-Means++ for smarter centroid initialization.

Exercises

Exercise 1: Implement K-Means with Different K

Cluster a dataset using K=3 and K=5. Compare the results visually and using evaluation metrics.

Exercise 2: Real-World Dataset

Apply K-Means to the famous Iris dataset from sklearn.datasets. Identify clusters and evaluate their accuracy.

Exercise 3: Handling Non-Spherical Data

Create a dataset with non-spherical clusters using make_moons or make_blobs from sklearn.datasets. Apply K-Means and observe its limitations.

Why Learn K-Means at The Coding College?

At The Coding College, we focus on practical learning with clear examples and exercises. Our step-by-step tutorials empower you to master essential machine learning techniques like K-Means Clustering.

Conclusion

K-Means is a powerful clustering algorithm widely used in various industries. By understanding its working principles, implementation, and evaluation techniques, you can leverage it to uncover hidden patterns in your data.

Machine Learning – K-Means