...

Group the Similar Stuff

Learn how to uncover hidden patterns in unlabeled data.

We'll cover the following...

What is clustering?
Common types of clustering
K-means algorithm with scikit-learn
Conclusion

As data scientists, we often work with datasets that don’t come with neat, pre-defined categories. Previously, we explored classification, where our models learned to predict a flower’s species because we already had labeled examples.

But what happens when we’re faced with a large volume of data, perhaps customer transaction histories, or complex biological measurements, and no one has told us what the underlying groups are or how many distinct groups might exist? We don’t have labels; just raw information.

This is where we transition from supervised learning to unsupervised learning. Our goal here isn’t to predict a known outcome but to discover inherent patterns, structures, and natural groupings within the data. The primary technique we use for this type of exploratory analysis is clustering.

What is clustering?

Clustering is unsupervised learning that systematically partitions a dataset into distinct groups. Unlike classification algorithms, which learn from labeled data to predict predefined categories, clustering algorithms learn from unlabeled data to uncover underlying patterns and structures without prior information. These algorithms identify patterns in the dataset by analyzing the similarity or distance between individual data points.

Press + to interact

Common types of clustering

Understanding different clustering approaches is key for data scientists, as each excels in various scenarios. Let’s look at the primary types of clustering and their common algorithms and use cases.

Centroid-based clustering

Centroid-based clustering is a straightforward and efficient method that organizes data into distinct, non-overlapping groups. Imagine we have a bunch of scattered fruits, and we want to put them into different baskets. Centroid-based clustering works similarly by finding the center (or centroid) of each group and assigning fruits to the basket whose center is closest. A centroid is essentially the average position of all data points within a cluster.

Press + to interact

The most popular algorithm in this category is K-means clustering. It works iteratively:

Randomly initialize $k$ centroids.
Assign each data point to its nearest centroid.
Recompute each centroid as the mean of its assigned points.
Repeat assignment and centroid-update steps until centroids stabilize (no significant movement).

Other alternatives algorithms include K-medoids and fuzzy C-means:

K-medoids: Uses actual data points as the cluster centers (medoids) instead of means.
Fuzzy C-means: Allows each data point to belong to multiple clusters with membership degrees rather than a single hard assignment.

Centroid-based clustering is widely used due to its simplicity and effectiveness, especially with large datasets. It's like a trusty Swiss Army knife for many data grouping tasks. Let's look at some of the common use cases:

Marketing: Customer segmentation for targeted campaigns (e.g., by purchasing behavior, ...

Dive into Data Science

Talk to Data

Clean It Up

Make Sense of Data

Build Smart Stuff

Conclusion

Group the Similar Stuff

What is clustering?

Common types of clustering

Centroid-based clustering