Group the Similar Stuff
Learn how to uncover hidden patterns in unlabeled data.
We'll cover the following...
As data scientists, we often work with datasets that don’t come with neat, pre-defined categories. Previously, we explored classification, where our models learned to predict a flower’s species because we already had labeled examples.
But what happens when we’re faced with a large volume of data, perhaps customer transaction histories, or complex biological measurements, and no one has told us what the underlying groups are or how many distinct groups might exist? We don’t have labels; just raw information.
This is where we transition from supervised learning to unsupervised learning. Our goal here isn’t to predict a known outcome but to discover inherent patterns, structures, and natural groupings within the data. The primary technique we use for this type of exploratory analysis is clustering.
What is clustering?
Clustering is unsupervised learning that systematically partitions a dataset into distinct groups. Unlike classification algorithms, which learn from labeled data to predict predefined categories, clustering algorithms learn from unlabeled data to uncover underlying patterns and structures without prior information. These algorithms identify patterns in the dataset by analyzing the similarity or distance between individual data points.
Common types of clustering
Understanding different clustering approaches is key for data scientists, as each excels in various scenarios. Let’s look at the primary types of clustering and their common algorithms and use cases.
Centroid-based clustering
Centroid-based clustering is a straightforward and efficient method that organizes data into distinct, non-overlapping groups. Imagine we have a bunch of scattered fruits, and we want to put them into different baskets. Centroid-based clustering works similarly by finding the center (or centroid) of each group and assigning fruits to the basket whose center is closest. A centroid is essentially the average position of all data points within a cluster.
The most popular algorithm in this category is K-means clustering. It works iteratively:
Randomly initialize
centroids. Assign each data point to its nearest centroid.
Recompute each centroid as the mean of its assigned points.
Repeat assignment and centroid-update steps until centroids stabilize (no significant movement).
Other alternatives algorithms include K-medoids and fuzzy C-means:
K-medoids: Uses actual data points as the cluster centers (medoids) instead of means.
Fuzzy C-means: Allows each data point to belong to multiple clusters with membership degrees rather than a single hard assignment.
Centroid-based clustering is widely used due to its simplicity and effectiveness, especially with large datasets. It's like a trusty Swiss Army knife for many data grouping tasks. Let's look at some of the common use cases:
Marketing: Customer segmentation for targeted campaigns (e.g., by purchasing behavior, ...