K-means++ algorithm

In data mining, KK-means++ is an unsupervised learning approach to select the initial values for the K-means clustering algorithm. David Arthur and Sergei Vassilvitskii first proposed it in 2007.

Background

We use randomization to find initial centroids when using Lloyd's KK-means clustering technique. The first KK-centroids were chosen at random from the data points. The problem of initialization sensitivity arises due to the randomization of picking KK-centroids points. This tends to have an impact on the final generated clusters.

We can utilize the following two approaches to get around the initialization sensitivity concern:

  • Repeat KK-means

  • KK-means++

However, KK-means++ is more effective as it provides an optimal solution compared to traditional KK-means.

Intuition

KK-means++ is a smart centroid initialization approach. The intuition behind this technique is spreading out of the KK initial cluster centers. The first cluster center is picked uniformly at random from the data points being clustered, and each successive cluster center is chosen from the remaining data points with a probability proportional to the point's squared distance from the nearest existing cluster center.

Algorithm

The exact algorithm is as follows:

  • Choose the first centroid at random from the data points.

  • Compute the distance between each data point and the nearest, previously chosen centroid.

  • Choose the next centroid from the data points so that the chance of selecting a point as a centroid is directly proportional to its distance from the previous centroid.

  • Repeat the second and the third step until all KK-centroids have been selected.

  • Once the initial centers are determined, proceed with ordinary KK-means clustering.

Example

Let's consider the following example. Suppose we want to make two clusters, and we have the following points:

The initial step is to choose a data point at random to serve as the cluster centroid:

Assume the red point is chosen as the initial centroid. Now compute the distance between each data point and this centroid:

The next centroid is the one with the greatest squared distance from the present centroid:

The blue point will be chosen as the next centroid in this case. After initializing the centroids, we can proceed with the KK-Means algorithm.

Conclusion

KK-means is a popular clustering method that aims to reduce the average squared distance between points in the same cluster. Although it does not ensure accuracy, its simplicity and speed are highly appealing in practice. However, KK-means++ algorithm is computationally costly but guarantees finding a competitive solution compared to the optimal KK-means solution. Trials demonstrate that the augmentation significantly improves both the speed and accuracy of KK-means.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved