Hierarchical Clustering is a popular unsupervised machine learning algorithm used for grouping data into a tree of clusters. This algorithm is highly intuitive and particularly useful when the number of clusters in a dataset is unknown.
In this tutorial by The Coding College, we’ll explore hierarchical clustering, its working mechanism, and how to implement it in Python.
What Is Hierarchical Clustering?
Hierarchical Clustering builds a hierarchy of clusters using either:
- Agglomerative Approach: Bottom-up approach where each data point starts as its own cluster and merges into larger clusters.
- Divisive Approach: Top-down approach where all data points start in one cluster and are divided into smaller clusters.
How Does Hierarchical Clustering Work?
Steps for Agglomerative Clustering
- Start with Individual Clusters: Each data point is treated as its own cluster.
- Compute Distance: Calculate the distance between every pair of clusters.
- Merge Closest Clusters: Combine two clusters that are closest to each other.
- Repeat Until One Cluster Remains: Continue merging until a single cluster or desired number of clusters is achieved.
Linkage Criteria
The method for determining the distance between clusters is known as the linkage criterion. Common types include:
- Single Linkage: Distance between the closest points of two clusters.
- Complete Linkage: Distance between the farthest points of two clusters.
- Average Linkage: Average distance between all points in two clusters.
- Ward’s Linkage: Minimizes the variance within clusters.
Visualizing the Clusters: Dendrogram
A Dendrogram is a tree-like diagram that shows the merging process and hierarchy of clusters. It helps determine the optimal number of clusters.
Implementing Hierarchical Clustering in Python
Example: Customer Segmentation
We’ll use a sample dataset of customer spending habits to illustrate hierarchical clustering.
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
# Sample data
data = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])
# Perform hierarchical clustering
linked = linkage(data, method='ward')
# Plot dendrogram
plt.figure(figsize=(8, 6))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title("Dendrogram")
plt.xlabel("Data Points")
plt.ylabel("Euclidean Distance")
plt.show()
Determining Clusters
Using the dendrogram, we can decide the number of clusters and assign labels to data points:
# Form flat clusters (cut-off at distance = 7)
clusters = fcluster(linked, t=7, criterion='distance')
print(f"Cluster Labels: {clusters}")
Real-World Applications
- Customer Segmentation: Group customers based on purchasing behavior.
- Gene Analysis: Identify groups of similar genes in bioinformatics.
- Image Segmentation: Divide images into regions for analysis.
- Document Clustering: Group similar documents for better organization.
Advantages and Disadvantages
Advantages
- No Fixed Number of Clusters: Works without prior knowledge of cluster count.
- Produces Hierarchical Relationships: Useful for understanding data structure.
Disadvantages
- Computationally Intensive: Not suitable for very large datasets.
- Sensitive to Noise: Outliers can distort the hierarchy.
Exercises
Exercise 1: Dendrogram Interpretation
Generate a dendrogram for the famous Iris dataset. Identify the optimal number of clusters visually.
Exercise 2: Compare Linkage Methods
Experiment with different linkage methods (single, complete, average, Ward) on the Wine dataset. Observe the impact on clustering results.
Exercise 3: Real-World Dataset
Use the Mall Customers dataset to segment customers using hierarchical clustering. Visualize the clusters in 2D or 3D space.
Why Learn at The Coding College?
At The Coding College, we provide practical, easy-to-follow tutorials that simplify complex topics like Hierarchical Clustering. Whether you’re a beginner or an experienced learner, our step-by-step guides help you excel in Machine Learning.
Conclusion
Hierarchical Clustering is a flexible and insightful technique for grouping data. By visualizing relationships with dendrograms, you can uncover the natural structure of your data.