Search⌘ K

K-Means Clustering

Explore K-means clustering, a key method to partition data into clusters by iteratively updating centroids. Understand the role of K-means++ initialization and how mini-batch clustering accelerates processing large datasets with minimal loss. Learn to implement these techniques using scikit-learn objects and customize parameters like cluster count and batch size for practical machine learning applications.

Chapter Goals:

  • Learn about K-means clustering and how it works
  • Understand why mini-batch clustering is used for large datasets

A. K-means algorithm

The idea behind clustering data is pretty simple: partition a dataset into groups of similar data observations. How we go about finding these clusters is a bit more complex, since there are a number of different methods for clustering datasets.

The most well-known clustering method is K-means clustering. The K-means clustering algorithm will separate the data into K clusters (the number of clusters is chosen by the user) using cluster means, also known as centroids.

These centroids represent the "centers" of each cluster. Specifically, a cluster's centroid is equal to the average of all the data observations within the cluster. ...