Agglomerative Clustering
Learn the process, methodologies, and resulting hierarchical structure of agglomerative clustering.
Hierarchical clustering is a machine learning technique that creates nested clusters by iteratively merging or dividing data points. In this lesson, we focus specifically on agglomerative clustering, which is a bottom-up hierarchical approach. We will walk through the core process of agglomerative clustering, explain the main methods used to define similarity or dissimilarity between clusters (known as linkage), and outline the steps of the agglomerative clustering algorithm.
The agglomerative process
Agglomerative clustering starts by treating each data point as its own cluster. At every step, the algorithm finds the two closest clusters and merges them. This process continues until only the desired number of clusters () remains, or until all data points belong to a single, large cluster.
The crucial question in this process is: How do we define the similarity/dissimilarity of a pair of clusters?
Similarity/dissimilarity (linkage) between clusters
There are multiple ways to define the similarity (or dissimilarity) between two clusters, often referred to as linkage methods. A few common methods, assuming Euclidean distance is used as the base dissimilarity metric, are as follows:
- MIN (Single linkage): The dissimilarity between two clusters is the minimum dissimilarity between any pair of points from the two clusters. This method tends to create long, chain-like clusters.
- MAX (Complete linkage): The dissimilarity between two clusters is the maximum dissimilarity between any pair of points of the two clusters. This method produces compact, spherical clusters.
- Centroid distance: The dissimilarity between two clusters is the Euclidean distance between their centroids (the mean of all points in each cluster).
Computing cluster dissimilarity
The code below computes the dissimilarity between a pair of clusters using the methods defined above:
Here is the explanation for the code above:
-
Lines 3 and 10: The code defines two functions:
compute_distance_matrixto compute the distance matrix between all points in two datasets ( ...