K-means++ algorithm

In data mining, $K$ -means++ is an unsupervised learning approach to select the initial values for the K-means clustering algorithm. David Arthur and Sergei Vassilvitskii first proposed it in 2007.

Background

We use randomization to find initial centroids when using Lloyd's $K$ -means clustering technique. The first $K$ -centroids were chosen at random from the data points. The problem of initialization sensitivity arises due to the randomization of picking $K$ -centroids points. This tends to have an impact on the final generated clusters.

We can utilize the following two approaches to get around the initialization sensitivity concern:

Repeat $K$ -means
$K$ -means++

However, $K$ -means++ is more effective as it provides an optimal solution compared to traditional $K$ -means.

Intuition

$K$ -means++ is a smart centroid initialization approach. The intuition behind this technique is spreading out of the $K$ initial cluster centers. The first cluster center is picked uniformly at random from the data points being clustered, and each successive cluster center is chosen from the remaining data points with a probability proportional to the point's squared distance from the nearest existing cluster center.

Algorithm

The exact algorithm is as follows:

Choose the first centroid at random from the data points.
Compute the distance between each data point and the nearest, previously chosen centroid.
Choose the next centroid from the data points so that the chance of selecting a point as a centroid is directly proportional to its distance from the previous centroid.
Repeat the second and the third step until all $K$ -centroids have been selected.
Once the initial centers are determined, proceed with ordinary $K$ -means clustering.

Example

Let's consider the following example. Suppose we want to make two clusters, and we have the following points:

The blue point will be chosen as the next centroid in this case. After initializing the centroids, we can proceed with the $K$ -Means algorithm.

Conclusion

$K$ -means is a popular clustering method that aims to reduce the average squared distance between points in the same cluster. Although it does not ensure accuracy, its simplicity and speed are highly appealing in practice. However, $K$ -means++ algorithm is computationally costly but guarantees finding a competitive solution compared to the optimal $K$ -means solution. Trials demonstrate that the augmentation significantly improves both the speed and accuracy of $K$ -means.

Free Resources