K-Means Clustering

Learn about the k-means algorithm, its initialization, NP-hardness, and variance computation with examples.

Traditionally, in machine learning, we start with the popular partitional clustering algorithm called kk-means clustering. This algorithm divides the data into kk clusters based on a similarity score. The objective is to minimize the total variance of the kk clusters. The number of clusters, kk, must be specified.

Note: The choice of similarity score is a hyperparameter.


Given a set D={x1,x2,,xn}D=\{\bold x_1, \bold x_2, \dots, \bold x_n \} of nn data points in Rd\R^d, the goal is to partition DD into the given k2k \ge 2 partitions, say, C1,C2,,CkC_1, C_2, \dots, C_k such that j=1Kvar(Cj)\sum_{j=1}^K var(C_j) is minimum. The var(Cj)var(C_j) using Euclidean distance can be defined as follows:

var(Cj)=xCjxμj22var(C_j) = \sum_{\bold x \in C_j}\|\bold x-\bold \mu_j\|_2^2

Here, μj\bold \mu_j is the centroidThe centroid of a set of points is the mean of the data points in the set. of the the partition CjC_j.

Note: The variance of a partition CjC_j is defined in terms of Euclidean distance here; however, other distance/similarity measures can also be used.

Variance computation code example

Let’s compute the variance of a set of points using numpy:

Get hands-on with 1200+ tech skills courses.