Search⌘ K

K-Means Clustering

Learn about the k-means algorithm, its initialization, NP-hardness, and variance computation with examples.

Traditionally, in machine learning, we start with the popular partitional clustering algorithm called kk-means clustering. This algorithm divides the data into kk clusters based on a similarity score. The objective is to minimize the total variance of the kk clusters. The number of clusters, kk, must be specified.

Note: The choice of similarity metric is a hyperparameter.

Objective

K-means clustering aims to partition a dataset into kk clusters such that the total variance of the clusters is minimized. This means we want the data points within each cluster to be as close to each other as possible.

Given a set of nn data points D={x1,x2,,xn}D=\{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n\} in Rd\mathbf{R}^d (a dd-dimensional Euclidean space), the goal is to partition DD into k2k \ge 2We require at least two partitions because clustering a dataset into 1 cluster would simply be the original dataset itself, which is trivial. non-empty partitions, ...