Clustering Algorithms Comparison
Explore the performance differences among K-means DBSCAN and agglomerative clustering algorithms. Learn their advantages disadvantages and best use cases to decide the optimal clustering approach for your specific dataset and task requirements.
In previous lessons, we explored three fundamentally different unsupervised techniques: the centroid-based K-means algorithm, the density-based DBSCAN algorithm, and the hierarchical agglomerative clustering algorithm. While all aim to group similar data points into clusters, each operates under unique assumptions and performs best with different types of data structures. This lesson shifts the focus from how the algorithms work to comparing their performance, strengths, and weaknesses, helping determine the most appropriate clustering approach for a given dataset.
-means
This is a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. -means starts by randomly selecting initial centroids and then assigns each data point to the cluster corresponding to the nearest centroid. The centroids are then updated to the mean of the points in the corresponding cluster, and the process is repeated until convergence.
Advantages of K-means clustering
-
Simple and easy to implement
-
Computationally efficient, even for large datasets
-
Works well when clusters are spherical and well separated
-
Scales effectively with the number of data points
-
Produces clear and interpretable cluster centers
Disadvantages of K-means clustering
-
Requires the number of clusters (k) to be specified in advance
-
Sensitive to the initial placement of centroids
-
Can get stuck in local optima
-
Performs poorly with non-spherical or unevenly sized clusters
-
Highly sensitive to outliers and noise
-means clustering comparison example
Let’s try a scenario where -means performs better as compared to DBSCAN and agglomerative clustering:
In this example, we generated synthetic data with three clusters that are separated by a small distance. The data is generated using three normal distributions with different means and standard deviations.
When applied to this data, K-means is able to identify the three clusters more accurately than DBSCAN and agglomerative clustering, as shown by the lower Minkowski distance score of 1.98. This score (discussed in detail later) is a measure of how tightly clustered the points are around their centroids, and a lower value indicates superior performance for this type of data. DBSCAN struggles because the maximum distance between two points to be considered in the same cluster () isn’t satisfied, and agglomerative clustering struggles to separate the clusters properly because they’re too close together.
DBSCAN
DBSCAN works by identifying points that have a high density of neighbors and grouping them into clusters. It doesn’t require us to specify the number of clusters beforehand. It also identifies points that don’t belong to any cluster as noise or outliers. DBSCAN requires two user-specified parameters: eps, which is the maximum distance between two points to be considered in the same cluster, and min_samples, which is the minimum number of points required to form a cluster.
Advantages of DBSCAN
-
Does not require the number of clusters to be specified in advance
-
Can identify arbitrarily shaped clusters
-
Naturally detects noise and outliers
-
Less sensitive to initialization compared to centroid-based methods
-
Works well for datasets with clear density differences
Disadvantages of DBSCAN
-
Sensitive to the choice of
epsandmin_samplesparameters -
Struggles when clusters have varying densities
-
Performance can degrade in high-dimensional data
-
Choosing optimal parameters can be difficult without domain knowledge
-
Not as efficient as K-means for very large datasets
DBSCAN clustering comparison example
One scenario where DBSCAN might perform better than -means and agglomerative clustering is when the data contains clusters of varying densities.
-means and agglomerative clustering are both sensitive to initial conditions, and can be sensitive to the shape and size of the clusters. DBSCAN, on the other hand, doesn’t require the user to specify the number of clusters or the shape of the clusters. It can identify clusters of varying densities and is less sensitive to the initial conditions.
Here is an example where DBSCAN outperforms -means and agglomerative clustering on a synthetic dataset with two clusters of varying densities:
In this example, DBSCAN is able to identify the clusters of varying densities, while
When should I use DBSCAN instead of K-means?
Agglomerative clustering
This is a bottom-up hierarchical clustering algorithm that starts by treating each data point as a singleton cluster and then merges the closest clusters until all points are in the same cluster or a stopping criterion is reached. Agglomerative clustering requires us to specify the number of clusters, and it can use different distance measures to define the similarity between clusters.
What are the main differences between K-means and hierarchical clustering?
Advantages of agglomerative clustering
-
Produces a hierarchical structure of clusters
-
Can be visualized using a dendrogram
-
Does not rely on random initialization
-
Works with different distance metrics and linkage methods
-
Can capture complex and nonspherical cluster shapes
Disadvantages of agglomerative clustering
-
Computationally expensive, especially for large datasets
-
Requires the number of clusters or a cutoff distance to be specified
-
Sensitive to noise and outliers
-
Early merging decisions cannot be undone
-
Not suitable for very high-dimensional or very large datasets
Agglomerative clustering comparsion example
Imagine that we have a dataset with a large number of samples that are organized into several distinct, nonspherical clusters. In this case, -means might struggle to find the correct clusters because it’s designed to find clusters that are spherical in shape. Similarly, DBSCAN might also struggle to find the correct clusters because it’s designed to find clusters that are dense in terms of the number of samples they contain.
However, agglomerative clustering doesn’t assume that the clusters are spherical or dense. Instead, it begins by treating each sample as its own cluster and then iteratively merges the closest clusters together until a stopping condition is reached. This makes it well-suited for finding clusters that aren’t spherical or dense in shape.
You can change the hyperparameters in the following coding playground to observe the performance of each technique:
Performance criteria: The performance of the algorithms might differ depending on the specific characteristics of the data. Trying out different algorithms and evaluating their performance using appropriate evaluation metrics is always a good idea.
The choice among these algorithms depends heavily on the data's characteristics and your goal. The following table provides a quick reference guide to their key assumptions and trade-offs:
Comparison of clustering algorithms
Feature | K-means | DBSCAN | Agglomerative Clustering |
Type | Centroid-based | Density-based | Hierarchical (Linkage-based) |
Input k | Required | Not required | Required (or a cutoff distance) |
Cluster shape | Assumes spherical | Identifies arbitrary shapes | Flexible; depends on linkage |
Outlier handling | Poor (Outliers distort centroids) | Excellent (Outliers labeled as noise) | Poor (Outliers force early merges) |
Efficiency | High (Fast for large N) | Medium | Low (Computationally expensive) |
Primary output | Final cluster labels | Final cluster labels + Noise labels | Dendrogram or cluster labels |
Which clustering algorithm is best?
Conclusion
This comparison demonstrates that no single clustering algorithm is universally superior. K-means is the go-to choice for large datasets with expected spherical clusters.
DBSCAN excels in identifying arbitrarily shaped clusters and managing noise effectively.
Agglomerative clustering is best utilized when the hierarchical relationship between data points is important. Selecting the appropriate algorithm always requires knowledge of the data’s underlying structure and the goal of the clustering task.