K-Means Clustering

In this lesson, you’ll learn about clustering algorithms, which form the core of unsupervised learning and involve grouping similar items together.


Clustering is a famous unsupervised learning technique. It involves making clusters or groups of items such that the items in the same cluster are more similar to each other than items in the other cluster. In this lesson, we will be looking into K-means clustering.

K-means clustering

K-Means clustering as the name suggests, looks for a fixed number of clusters (KK) in the dataset. The mean or center of the KthK^{th} cluster is represented by μk\mu_{k}, which is also called the cluster centroid or average point. K-means relies on the idea of similarity/dissimilarity when assigning instances to respective clusters.

Similarity can also be thought of as proximity. It is a numerical measure of how alike two data instances are. Cosine Similarity is one of the most commonly used similarity measures. It takes values between 0 and 1, where a higher value indicates more similar instances. Cosine similarity between two feature vectors of x=(x1,x2,...,xn)x = (x_1, x_2, ... , x_n) and y=(y1,y2,...,yn)y = (y_1, y_2, ..., y_n) is given as sim(x,y)=x.yxysim(x, y) = \frac{x.y}{||x|||y||}

Where x||x|| and y||y|| are the Euclidean norm of the feature vectors xx and yy respectively. It is given as

x=x12+x22+...+xn2||x|| = \sqrt{x_1^2+x_2^2+ ... + x_n^2}
y=y12+y22+...+yn2||y|| = \sqrt{y_1^2+y_2^2+ ... + y_n^2}

The notion of dissimilarity can be understood using the concept of distance between two points. It is a numerical measure of how far apart or different two data instances are. The Euclidean distance is one of the most commonly used measures of dissimilarity between instances. Lower values indicate more similar instances. The Euclidean distance between two instances can be given as x=(x1,x2,...,xn)x = (x_1, x_2, ... , x_n) and y=(y1,y2,...,yn)y = (y_1, y_2, ..., y_n) is given as

EuclideanDistance=(y1x1)2+(y2x2)2+...+(ynxn)2EuclideanDistance =\sqrt{(y_1-x_1)^2 + (y_2 - x_2)^2 + ... + (y_n-x_n)^2}

In the case of Categorical Features, if two feature values are same then their similarity is one and vice versa.

How does the K-means algorithm work

In K-means clustering we repeat the below steps iteratively.

  1. Randomly pick KK centroids in the dataset.
  2. Add each instance to the closest centroid by computing the similarity or distance measures.
  3. Recompute the centroids. The centroid is the center (average) point of the cluster.
  4. Add each instance to the closest centroid.
  5. Recompute the centroids.
  6. Repeat until no changes occur.

Get hands-on with 1200+ tech skills courses.