Search⌘ K
AI Features

K-Means Walk-Through Example

Explore the K-means clustering algorithm through a detailed walk-through using Python and sklearn. Understand how to initialize centroids, calculate distances, assign points to clusters, and iteratively update centroids until convergence. This lesson offers practical coding insights and visualizations to solidify your grasp of clustering.

In the previous lesson, we discussed that K-means clustering is an algorithm designed to partition a dataset into kk distinct clusters by minimizing the total variance within each cluster. In this lesson, we will move from theory to practice by performing a dry run of the K-means algorithm on a small synthetic dataset. We will follow Lloyd’s algorithm step-by-step, using Python and the sklearn library to visualize how data points are assigned to clusters and how centroids are iteratively updated.

KK-means algorithm

For a given dataset and value of kk, kk-means clustering has the following steps:

  1. Choose kk: Select the number of clusters, kk, such that k2k \ge 2.
  2. Initialize centroids: Choose kk number of centroids, typically randomly, as initial cluster centers.
  3. Calculate dissimilarity: Find the similarity score (distance, e.g., Euclidean) of each data point with respect to each centroid.
  4. Assign clusters: Based on the similarity score, assign each data point to the cluster whose centroid is the closest (least distant).
  5. Update centroids: From these new groupings, find new centroids by taking the mean of all data points assigned to that cluster.
  6. Repeat: Repeat steps 3 to 5 until the difference between old and new centroids is negligible (convergence).

If the steps above seem unclear, don’t worry. We will illustrate each step with an example.

Dry running the example

Let’s say we have the following dataset:

Actual dataset
Actual dataset

Step 1: Plotting the data

Run the following code widget in order to plot the data. Here, x and y contain all the x and y-coordinates to represent our synthetic data.

Python 3.10.4
import matplotlib.pyplot as plt
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Plotting the actual datapoints
plt.scatter(x, y, color = 'white', edgecolor = 'black')
plt.show()

Let’s start with the first step of kk-means clustering and decide how many clusters we want if the number isn’t given already. Let the number of clusters be three, which means k=3k = 3.

Step 2: Assigning values to centroids

The second step is to assign kk number of centroids with the random value. Since kk is 33, we’ll get three centroids μ1\bold \mu_1, μ2\bold \mu_2, and μ3\bold \mu_3. Also, assign them random values yields:

Centroids assignment
Centroids assignment

In the following code, Cx and Cy represent x and y coordinates of the centroids:

Python 3.10.4
import matplotlib.pyplot as plt
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
colors = ['red', 'blue', 'green']
# Plotting the actual datapoints
plt.scatter(x, y, color = 'white', edgecolor = 'black')
# Plotting centroids
for ctr, clr in zip(range(len(Cx)), colors):
plt.plot(Cx[ctr] , Cy[ctr], color = clr, marker = 's', markersize=10, alpha = 0.2)
plt.show()

Step 3: Calculating the dissimilarity score

The third step is to find the dissimilarity score of each data point (15 total) with each centroid. We’ll be using the Euclidean distance as the dissimilarity score. The function euclidean_distances takes two arrays, where each array is an array of points. Let’s see how to calculate the dissimilarity score using sklearn:

Python 3.10.4
from sklearn.metrics.pairwise import euclidean_distances as dis_score
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
# Converting data points and centers into array of arrays for using s_score
data_points = [[x[i], y[i]] for i in range(len(x))]
centers = [[Cx[i], Cy[i]] for i in range(len(Cx))]
print(dis_score(data_points, centers))

Here is the explanation for the code above:

  • Lines 3–7: We define two lists x and y representing the x and y-coordinates of the data points. Similarly, Cx and Cy represent the x and y-coordinates of initial centroids.

  • Lines 10–11: We convert the data points and the centroid coordinates into arrays of 2-D points because euclidean_distances expects its inputs to be matrices, where each row represents one point. The function computes pairwise distances between every point in the first matrix and every point in the second. By restructuring the data into lists of [x, y] pairs, we ensure the function receives properly formatted inputs and can correctly compute the distances.

  • Line 13: We use the euclidean_distances function to calculate the Euclidean distances between data points and centroids and print the resulting array.

The code output will be a 2D array where each row represents a data point, and each column represents a centroid. The value at position (i,j)(i, j) in the array represents the Euclidean distance between the ithi^{th} data point and the jthj^{th} ...