Search⌘ K

Agglomerative Clustering

Learn the process, methodologies, and resulting hierarchical structure of agglomerative clustering.

Hierarchical clustering is a machine learning technique that creates nested clusters by iteratively merging or dividing data points. In this lesson, we focus specifically on agglomerative clustering, which is a bottom-up hierarchical approach. We will walk through the core process of agglomerative clustering, explain the main methods used to define similarity or dissimilarity between clusters (known as linkage), and outline the steps of the agglomerative clustering algorithm.

The agglomerative process

Agglomerative clustering starts by treating each data point as its own cluster. At every step, the algorithm finds the two closest clusters and merges them. This process continues until only the desired number of clusters (kk) remains, or until all data points belong to a single, large cluster.

Agglomerative clustering
Agglomerative clustering

The crucial question in this process is: How do we define the similarity/dissimilarity of a pair of clusters?

Similarity/dissimilarity (linkage) between clusters

There are multiple ways to define the similarity (or dissimilarity) between two clusters, often referred to as linkage methods. A few common methods, assuming Euclidean distance is used as the base dissimilarity metric, are as follows:

  • MIN (Single linkage): The dissimilarity between two clusters is the minimum dissimilarity between any pair of points from the two clusters. This method tends to create long, chain-like clusters.
MIN dissimilarity
MIN dissimilarity
  • MAX (Complete linkage): The dissimilarity between two clusters is the maximum dissimilarity between any pair of points of the two clusters. This method produces compact, spherical clusters.
MAX dissimilarity
MAX dissimilarity

  • Centroid distance: The dissimilarity between two clusters is the Euclidean distance between their centroids (the mean of all points in each cluster).
Centroid distance
Centroid distance

Computing cluster dissimilarity

The code below computes the dissimilarity between a pair of clusters using the methods defined above:

Python 3.10.4
import numpy as np
def compute_distance_matrix(C1, C2):
d = np.zeros((C1.shape[0], C2.shape[0]))
for i in range(C1.shape[0]):
d[i, :] = np.sum((C1[i, :]-C2)**2, axis=1)**0.5
return d
def cluster_dissimilarity(C1, C2, method = 'MIN'):
d = compute_distance_matrix(C1, C2)
if method == 'MIN':
diss = np.min(d)
elif method == 'MAX':
diss = np.max(d)
else:
diss = np.linalg.norm(np.mean(C1, axis=0) - np.mean(C2, axis=0))
return diss
n1, n2, d = 100, 200, 3
C1, C2 = np.random.randn(n1,d), np.random.randn(n2, d)
print(f'Cluster dissimilarity using MIN = {cluster_dissimilarity(C1, C2)}')
print(f'Cluster dissimilarity using MAX = {cluster_dissimilarity(C1, C2, "MAX")}')
print(f'Cluster dissimilarity using Centroids distance = {cluster_dissimilarity(C1, C2, None)}')

Here is the explanation for the code above:

  • Lines 3 and 10: The code defines two functions: compute_distance_matrix to compute the distance matrix between all points in two datasets (C1C_1 ...