Search⌘ K
AI Features

Agglomerative Clustering Walk-Through Example

Explore agglomerative clustering by walking through a Python example that uses the MIN linkage and Euclidean distance. Understand cluster merging processes, update distance matrices, and implement clustering with scikit-learn. Visualizations help clarify hierarchical cluster formation and final assignments.

In the previous lesson, we learned the core concepts and linkage methods of agglomerative clustering. This lesson will solidify that knowledge by providing a step-by-step walk-through example. We will track the iterative merging process on a small dataset, visualize the intermediate steps, and finally, implement the full clustering algorithm using Python’s scikit-learn library.

Dry running an example

In this example, we use the MIN linkage method for computing cluster dissimilarity, and Euclidean distance as the base distance measure. We’ll use the following synthetic 2D dataset for clarity:

Actual dataset
Actual dataset

Creation of NN clusters

The first step is to create NN clusters. This is easy enough. This will change the graph to look like this:

Actual dataset with N clusters
Actual dataset with N clusters

In the code widget below, we’ll use the matplotlib library to plot all data points as distinct clusters:

Python 3.10.4
# Importing required packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Generating synthetic dataset
x = np.array([1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7])
y = np.array([2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5])
data_points = [[x[i], y[i]] for i in range(len(x))]
# Creating a DataFrame to assign unique cluster IDs to each data point
df = pd.DataFrame({'Data points': data_points,
'Cluster ID' : np.arange(len(data_points))})
colors = ['silver', 'tomato', 'skyblue', 'yellow', 'olive',
'yellowgreen', 'green', 'red', 'cyan',
'lightblue', 'plum', 'purple', 'hotpink', 'pink', 'blue']
# Plotting data points
plt.scatter(x, y, c=colors, alpha=0.6, s=150)
print(df)

Computing the distance matrix

To begin the merging process, we first calculate the dissimilarity between all pairs of initial clusters. With 15 data points, the resulting distance matrix would be large (15×1515 \times 15). We show it in the image below. The smallest distance, which determines the first pair of clusters to merge, is highlighted for clarity.

Euclidean distances among all data points
Euclidean distances among all data points

Reading the Euclidean distances among all data points in tabular form makes it hard to understand the whole picture, so let’s visualize them as a heatmap:

Heatmap of the dataset
Heatmap of the dataset

The heatmap shows the dissimilarity scores (Euclidean distances) between all pairs of points in a synthetic dataset. The color of each cell in the heatmap represents the dissimilarity score between the corresponding pair of points. Darker cells indicate higher dissimilarity scores, while lighter cells indicate lower ones. It helps to identify which data points are close to each other and which ones are far apart based on their pairwise Euclidean distances.

Python 3.10.4
# Importing required packages
from sklearn.metrics.pairwise import euclidean_distances as dis_score
import matplotlib.pyplot as plt
import numpy as np
# Generating synthetic dataset
x = np.array([1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7])
y = np.array([2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5])
data_points = [[x[i], y[i]] for i in range(len(x))]
# Drawing a heatmap of Euclidean distances among all points
dissimilarity_scores = dis_score(data_points)
plt.xticks(np.arange(len(data_points)), labels=data_points, rotation = 40, size = 8)
plt.yticks(np.arange(len(data_points)), labels=data_points, size = 8)
plt.pcolormesh(dissimilarity_scores)
plt.title("Dissimilarity scores (Euclidean distances)")
plt.colorbar()
plt.show()

Here is the explanation of the code above:

  • Lines 7–9: The dissimilarity matrix is calculated using the euclidean_distances function from the sklearn library. The x and y-coordinates of the points are stored in x and y arrays, respectively, and the data_points list is created by combining the x and y arrays into pairs.

  • Line 12: It calculates the pairwise Euclidean distances between the points.

  • Lines 13–14: The x and y tick labels indicate the coordinates of the points.

  • Line 15: The dissimilarity matrix is plotted as a heatmap using the pcolormesh function from the matplotlib library.

  • Lines 17–18: The resulting heatmap shows the dissimilarity scores between all pairs of points.

Creation of N1N-1 clusters

The algorithm first identifies the two clusters with the minimum dissimilarity score in the matrix and merges them. Based on the heatmap, the nearest points are merged to create N1N-1 clusters.

In the first merge, the two closest points, ...