Agglomerative Clustering Walk-Through Example
Explore agglomerative clustering by walking through a Python example that uses the MIN linkage and Euclidean distance. Understand cluster merging processes, update distance matrices, and implement clustering with scikit-learn. Visualizations help clarify hierarchical cluster formation and final assignments.
In the previous lesson, we learned the core concepts and linkage methods of agglomerative clustering. This lesson will solidify that knowledge by providing a step-by-step walk-through example. We will track the iterative merging process on a small dataset, visualize the intermediate steps, and finally, implement the full clustering algorithm using Python’s scikit-learn library.
Dry running an example
In this example, we use the MIN linkage method for computing cluster dissimilarity, and Euclidean distance as the base distance measure. We’ll use the following synthetic 2D dataset for clarity:
Creation of clusters
The first step is to create clusters. This is easy enough. This will change the graph to look like this:
In the code widget below, we’ll use the matplotlib library to plot all data points as distinct clusters:
Computing the distance matrix
To begin the merging process, we first calculate the dissimilarity between all pairs of initial clusters. With 15 data points, the resulting distance matrix would be large (). We show it in the image below. The smallest distance, which determines the first pair of clusters to merge, is highlighted for clarity.
Reading the Euclidean distances among all data points in tabular form makes it hard to understand the whole picture, so let’s visualize them as a heatmap:
The heatmap shows the dissimilarity scores (Euclidean distances) between all pairs of points in a synthetic dataset. The color of each cell in the heatmap represents the dissimilarity score between the corresponding pair of points. Darker cells indicate higher dissimilarity scores, while lighter cells indicate lower ones. It helps to identify which data points are close to each other and which ones are far apart based on their pairwise Euclidean distances.
Here is the explanation of the code above:
-
Lines 7–9: The dissimilarity matrix is calculated using the
euclidean_distancesfunction from thesklearnlibrary. The x and y-coordinates of the points are stored inxandyarrays, respectively, and thedata_pointslist is created by combining thexandyarrays into pairs. -
Line 12: It calculates the pairwise Euclidean distances between the points.
-
Lines 13–14: The
xandytick labels indicate the coordinates of the points. -
Line 15: The dissimilarity matrix is plotted as a heatmap using the
pcolormeshfunction from thematplotliblibrary. -
Lines 17–18: The resulting heatmap shows the dissimilarity scores between all pairs of points.
Creation of clusters
The algorithm first identifies the two clusters with the minimum dissimilarity score in the matrix and merges them. Based on the heatmap, the nearest points are merged to create clusters.
In the first merge, the two closest points, ...