Search⌘ K
AI Features

K-Means Walk-Through Example

Understand how K-means clustering partitions data by iteratively assigning points to clusters and updating centroids. Learn step-by-step with a Python example using synthetic data and sklearn. Gain hands-on experience visualizing clusters and how the algorithm converges to minimize variance within clusters.

In the previous lesson, we discussed that K-means clustering is an algorithm designed to partition a dataset into kk distinct clusters by minimizing the total variance within each cluster. In this lesson, we will move from theory to practice by performing a dry run of the K-means algorithm on a small synthetic dataset. We will follow Lloyd’s algorithm step-by-step, using Python and the sklearn library to visualize how data points are assigned to clusters and how centroids are iteratively updated.

KK-means algorithm

For a given dataset and value of kk, kk-means clustering has the following steps:

  1. Choose kk: Select the number of clusters, kk, such that k2k \ge 2.
  2. Initialize centroids: Choose kk number of centroids, typically randomly, as initial cluster centers.
  3. Calculate dissimilarity: Find the similarity score (distance, e.g., Euclidean) of each data point with respect to each centroid.
  4. Assign clusters: Based on the similarity score, assign each data point to the cluster whose centroid is the closest (least distant).
  5. Update centroids: From these new groupings, find new centroids by taking the mean of all data points assigned to that cluster.
  6. Repeat: Repeat steps 3 to 5 until the difference between old and new centroids is negligible (convergence).

If the steps above seem unclear, don’t worry. We will illustrate each step with an example.

Dry running the example

Let’s say we have the following dataset:

Actual dataset
Actual dataset

Step 1: Plotting the data

Run the following code widget in order to plot the data. Here, x and y contain all the x and y-coordinates to represent our synthetic data.

Python 3.10.4
import matplotlib.pyplot as plt
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Plotting the actual datapoints
plt.scatter(x, y, color = 'white', edgecolor = 'black')
plt.show()

Let’s start with the first step of kk-means clustering and decide how many clusters we want if the number isn’t given already. Let the number of clusters be three, which means k=3k = 3.

Step 2: Assigning values to centroids

The second step is to assign kk number of centroids with the random value. Since kk is 33, we’ll get three centroids μ1\bold \mu_1, μ2\bold \mu_2, and μ3\bold \mu_3. Also, assign them random values yields:

Centroids assignment
Centroids assignment

In the following code, Cx and Cy represent x and y coordinates of the centroids:

Python 3.10.4
import matplotlib.pyplot as plt
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
colors = ['red', 'blue', 'green']
# Plotting the actual datapoints
plt.scatter(x, y, color = 'white', edgecolor = 'black')
# Plotting centroids
for ctr, clr in zip(range(len(Cx)), colors):
plt.plot(Cx[ctr] , Cy[ctr], color = clr, marker = 's', markersize=10, alpha = 0.2)
plt.show()

Step 3: Calculating the dissimilarity score

The third step is to find the dissimilarity score of each data point (15 total) with each centroid. We’ll be using the Euclidean distance as the dissimilarity score. The function euclidean_distances takes two arrays, where each array is an array of points. Let’s see how to calculate the dissimilarity score using sklearn:

Python 3.10.4
from sklearn.metrics.pairwise import euclidean_distances as dis_score
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
# Converting data points and centers into array of arrays for using s_score
data_points = [[x[i], y[i]] for i in range(len(x))]
centers = [[Cx[i], Cy[i]] for i in range(len(Cx))]
print(dis_score(data_points, centers))

Here is the explanation for the code above:

  • Lines 3–7: We define two lists x and y representing the x and y-coordinates of the data points. Similarly, Cx and Cy represent the x and y-coordinates of initial centroids.

  • Lines 10–11: We convert the data points and the centroid coordinates into arrays of 2-D points because euclidean_distances expects its inputs to be matrices, where each row represents one point. The function computes pairwise distances between every point in the first matrix and every point in the second. By restructuring the data into lists of [x, y] pairs, we ensure the function receives properly formatted inputs and can correctly compute the distances.

  • Line 13: We use the euclidean_distances function to calculate the Euclidean distances between data points and centroids and print the resulting array.

The code output will be a 2D array where each row represents a data point, and each column represents a centroid. The value at position (i,j)(i, j) in the array represents the Euclidean distance between the ithi^{th} data point and the jthj^{th} centroid.

The dissimilarity scores were calculated using sklearn and are also given below:

Dissimilarity Scores

Data Points

Centroid_1 (1, 1)

Centroid_2 (7, 2)

Centroid_3 (5, 6.5)

1, 2

1

6

6.020797289

2, 1

1

5.099019514

6.264982043

2, 1.5

1.118033989

5.024937811

5.830951895

2.5, 3.5

2.915475947

4.74341649

3.905124838

3, 4

3.605551275

4.472135955

3.201562119

4, 3.5

3.905124838

3.354101966

3.16227766

4, 7.5

7.158910532

6.264982043

1.414213562

5, 6

6.403124237

4.472135955

0.5

5, 7

7.211102551

5.385164807

0.5

5.5, 2

4.609772229

1.5

4.527692569

6, 1.5

5.024937811

1.118033989

5.099019514

6, 3

5.385164807

1.414213562

3.640054945

6, 5.5

6.726812024

3.640054945

1.414213562

6.5, 5

6.800735254

3.041381265

2.121320344

7, 2.5

6.184658438

0.5

4.472135955

Now, don’t panic seeing the giant table. All this table tells is the distance of each data point from each centroid. For example, let’s see the first data point (1,2)(1,2) (let’s call it D1D_1) with respect to first centroid (μ1\bold \mu_1) as below:

D1μ1=(x2x1)2+(y2y1)2D_1\bold \mu_1 = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}

Step 4: Assigning the clusters

After calculating the distances of each point from the centroids, the fourth step is to assign them to relevant clusters. This is done by selecting the centroid that’s least distant from each data point, therefore assigning it to the appropriate cluster.

Python 3.10.4
from sklearn.metrics.pairwise import euclidean_distances as dis_score
import pandas as pd
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
# Converting data points and centers into array of arrays for using dis_score
data_points = [[x[i], y[i]] for i in range(len(x))]
centers = [[Cx[i], Cy[i]] for i in range(len(Cx))]
# Computing dissimilarity scores using sklearn
dissimilarity_scores = dis_score(data_points, centers)
# Creating a dataframe to represent clusters in tabular form
points_4_index = [f'{pt[0]},{pt[1]}' for pt in data_points]
df = pd.DataFrame(dissimilarity_scores,
index = points_4_index,
columns = ['mu_1(1,1)', 'mu_2(7,2)', 'mu_3(5,6.5)'])
df.index.name = 'Data Points'
# Assigning clusters to each data point by selecting the centroid which is least distant from it.
df['Clusters'] = df.idxmin(axis=1)
print(df)

This code creates a pandas DataFrameA DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns. to represent the clusters in a tabular form.

Here is the explanation for the code above:

  • Line 15: Computes the Euclidean distance between every data point and every centroid.
  • Line 18: Creates string labels such as “1,2” and “2,1” to use as DataFrame row names.
  • Lines 19–21: Stores the distance matrix in a clean table format:
    • Rows = data points
    • Columns = distance to each centroid
  • Line 22: Labels the index column for readability.
  • Line 25: Assigns each point to the nearest centroid. idxmin(axis=1) finds the column with the smallest distance for each row.
  • Line 26: The DataFrame is printed to the console using the print statement.

TheA DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns. df DataFrame can also be visualized in tabular form as seen below:

Cluster Assignment

Data Points

Centroid_1 (1, 1)

Centroid_2 (7, 2)

Centroid_3 (5, 6.5)

Cluster

1, 2

1

6

6.020797289

C1

2, 1

1

5.099019514

6.264982043

C1

2, 1.5

1.118033989

5.024937811

5.830951895

C1

2.5, 3.5

2.915475947

4.74341649

3.905124838

C1

3, 4

3.605551275

4.472135955

3.201562119

C3

4, 3.5

3.905124838

3.354101966

3.16227766

C3

4, 7.5

7.158910532

6.264982043

1.414213562

C3

5, 6

6.403124237

4.472135955

0.5

C3

5, 7

7.211102551

5.385164807

0.5

C3

5.5, 2

4.609772229

1.5

4.527692569

C2

6, 1.5

5.024937811

1.118033989

5.099019514

C2

6, 3

5.385164807

1.414213562

3.640054945

C2

6, 5.5

6.726812024

3.640054945

1.414213562

C3

6.5, 5

6.800735254

3.041381265

2.121320344

C3

7, 2.5

6.184658438

0.5

4.472135955

C2

The table above shows us which data point got assigned to which cluster. For example, the data point D1(1,2)D_1 (1,2) got assigned to the C1C_1 cluster. This means that D1(1,2)D_1 (1,2) was closer to the centroid of C1C_1. Let’s see this visually:

Cluster assignment
Cluster assignment

Let’s code this step in Python now:

Python 3.10.4
from sklearn.metrics.pairwise import euclidean_distances as dis_score
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt
import pandas as pd
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
# Converting data points and centers into array of arrays for using dis_score
data_points = [[x[i], y[i]] for i in range(len(x))]
centers = [[Cx[i], Cy[i]] for i in range(len(Cx))]
# Comoputing dissimilarity scores using sklearn
dissimilarity_scores = dis_score(data_points, centers)
# Making a list of tuples for index column of DataFrame df
points_4_index = [(pt[0], pt[1]) for pt in data_points]
# Creating a datafrane to represents clusters in tabular form
df = pd.DataFrame(dissimilarity_scores,
index = points_4_index,
columns = ['C1(1,1)', 'C2(7,2)', 'C3(5,6.5)'])
df.index.name = 'Data Points'
# Assgning clusters to each data point by selecting the centroid which is least distant from it.
df['Clusters'] = df.idxmin(axis=1)
# Adding an encoded column based on df['Clusters] so that all data points within a clusters
# get the same color as plt.scatter takes a parameter "c" which is a array-like or list of colors.
encoder = OrdinalEncoder()
df['Clusters_encoded'] = encoder.fit_transform(df[['Clusters']])
df = df.astype({'Clusters_encoded':'int'})
# Defining a color mapping function to visualize the clusters properly.
def map_color(val):
if val==0:
return 'red'
elif val==1:
return 'blue'
elif val==2:
return 'green'
colors = map(map_color, df['Clusters_encoded'].values)
# Plotting datapoints in the form of clusters
plt.scatter(x=x, y=y, c=list(colors), alpha = 0.4)
plt.scatter(x=Cx,y=Cy, marker='s', c=['red', 'blue', 'green'], alpha = 0.4, s = 75)
plt.show()

Here is the explanation for the code above:

  • Lines 33–35: Encodes cluster labels ('C1(1,1)', 'C2(7,2)', 'C3(5,6.5)') as integers: 0, 1, 2 and converts the encoded column to integers for easy use in plotting (plt.scatter(c=...)). This ensures that each cluster can be mapped consistently to a color.

  • Lines 38–44: Defines a function to assign a color to each cluster based on the encoded integer value.

  • Line 46: Applies the color mapping to all data points to create a list of colors for plotting.

  • Lines 49–51: The code plots the data points with colors corresponding to their assigned clusters and overlays the centroids as squares using the same cluster colors. The alpha=0.4 parameter makes the points semi-transparent, while s=75 ensures that the centroid markers are clearly visible.

Step 5: Recomputing the centroids

Alright, now we’re getting somewhere. The image above is somewhat clustered. Here comes our fifth step, which will recompute the centroids of each cluster.

The cluster C1C_1 consists of four data points (1,2),(2,1),(2,1.5)(1, 2),(2, 1),(2, 1.5), and (2.5,3.5)(2.5, 3.5), that is, C1={(1,2),(2,1),(2,1.5),(2.5,3.5)}C_1=\{(1, 2),(2, 1),(2, 1.5),(2.5, 3.5)\}. The centroid, μ1\bold \mu_1, of C1C_1 can be calculated as follows:

μ1=([1+2+2+2.5]4,[2+1+1.5+3.5]4)=(1.875,2)\bold \mu_1 = \bigg(\dfrac{[1+2+2+2.5]}4, \dfrac{[2+1+1.5+3.5]}4\bigg)=(1.875, 2)

Similarly, the centroid μ2\bold \mu_2 of the cluster C2={(5.5,2),((6,1.5),(6,3),(7,2.5)}C_2=\{(5.5, 2),((6, 1.5),(6, 3),(7, 2.5)\} can be calculated as follows:

μ2=([5.5+6+6+7]4,[2+1.5+3+2.5]4)=(6.125,2.25)\bold \mu_2 = \bigg(\dfrac{[5.5+6+6+7]}4, \dfrac{[2+1.5+3+2.5]}4\bigg)=(6.125, 2.25)

Finally, the centroid μ3\bold \mu_3 of the cluster C3={(3,4),(4,3.5),(4,7.5),(5,6),(5,7),(6,5.5),(6.5,5)}C_3=\{(3, 4),(4, 3.5),(4, 7.5),(5, 6),(5, 7),(6, 5.5),(6.5, 5)\} can be calculated as follows:

μ3=([3+4+4+5+5+6+6.5]7,[4+3.5+7.5+6+7+5.5+5]7)=(4.785,4.214)\bold \mu_3 = \bigg(\dfrac{[3+4+4+5+5+6+6.5]}7, \dfrac{[4+3.5+7.5+6+7+5.5+5]}7\bigg)=(4.785, 4.214)

Now, let’s see how our centroids have moved.

Updating the centroids
Updating the centroids

The above illustration shows the new position of the updated centroids. This looks promising as these updated centroid locations truly represent the center of their clusters.

Step 6: Repeating the steps

Lastly, moving on to the sixth step. This step says that if our centroid and the updated centroid are different, we must perform all these steps (3355) again. For simplicity’s sake, we’ll fast-forward this process via Python code.

Let’s put this all together. In the following coding widget, update_clusters() is responsible for assigning clusters to each data point, and update_centroids() will update the centroids for each cluster by taking the mean of each data point within that cluster. Furthermore, we can control the number of iterations to be performed by kk-means by updating iterations in line 67.

Python 3.10.4
from sklearn.metrics.pairwise import euclidean_distances as dis_score
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
def update_clusters(x, y, Cx, Cy):
# Converting data points and centers into array of arrays for using dis_score
data_points = [[x[i], y[i]] for i in range(len(x))]
centers = [[Cx[i], Cy[i]] for i in range(len(Cx))]
# Computing dissimilarity scores using sklearn
dissimilarity_scores = dis_score(data_points, centers)
# Making a list of tuples for index column of DataFrame df
points_4_index = [(pt[0], pt[1]) for pt in data_points]
# Creating a DataFrame to represents clusters in tabular form
df = pd.DataFrame(dissimilarity_scores,
index = points_4_index,
columns = ['C1(1,1)', 'C2(7,2)', 'C3(5,6.5)'])
df.index.name = 'Data Points'
# Assigning clusters to each data point by selecting the centroid that is least distant from it
df['Clusters'] = df.idxmin(axis=1)
# Adding an encoded column based on df['Clusters] so that all data points within a clusters
# get the same color as plt.scatter takes a parameter "c" which is an array-like or list of colors
encoder = OrdinalEncoder()
df['Clusters_encoded'] = encoder.fit_transform(df[['Clusters']])
df = df.astype({'Clusters_encoded':'int'})
return df
# Updating means for centroids
def update_centroids(df):
means = np.zeros([len(df['Clusters_encoded'].unique()), 2])
for cluster_id in df['Clusters_encoded'].unique():
count = 0
for pt in df.index.where(df['Clusters_encoded']==cluster_id):
if pt == pt: # Filtering out 'nan'
means[cluster_id][0] += pt[0]
means[cluster_id][1] += pt[1]
count += 1
means[cluster_id][0] /= count
means[cluster_id][1] /= count
return means
# Defining a color mapping function to visualize the clusters properly
def map_color(val):
if val==0:
return 'red'
elif val==1:
return 'blue'
elif val==2:
return 'green'
# Change the no. of iterations to observe the change in clusters
iterations = 4
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
axs = axs.ravel()
for i in range(iterations):
df = update_clusters(x, y, Cx, Cy)
colors = map(map_color, df['Clusters_encoded'].values)
axs[i].scatter(x=x, y=y, c=list(colors), alpha=0.5)
axs[i].scatter(x=Cx, y=Cy, marker='s', c=['red', 'blue', 'green'], alpha=0.5, s=75)
axs[i].title.set_text('Iterations ' + str(i+1))
means = update_centroids(df)
Cx, Cy = means[:, 0], means[:, 1]

Following is the explanation for the code above:

  • Lines 13–39: The first function, update_clusters, takes the dataset and centroid positions as inputs, computes the dissimilarity scores (using the Euclidean distance measure) between each data point and each centroid, assigns each data point to the cluster whose centroid is the closest, and encodes the clusters with colors.

  • Lines 42–55: The second function, update_centroids, is responsible for updating the positions of the centroids.

    • Line 42: Defines a function update_centroids that takes a DataFrame df as input. This DataFrame contains the data points and their assigned cluster labels (encoded as integers in Clusters_encoded).
    • Line 43: Initializes a NumPy array means with zeros to store the updated centroid positions. The shape is [number of clusters, 2], because each centroid has an x and y coordinate. len(df['Clusters_encoded'].unique()) gives the number of unique clusters.
    • Line 44: Loops over each cluster ID to compute the mean position of points assigned to that cluster.
    • Line 45: Initializes a counter count to track the number of data points belonging to the current cluster.
    • Line 46: Iterates over the index of df (which are tuples representing the x, y coordinates of data points). df.index.where(df['Clusters_encoded']==cluster_id) selects points that belong to the current cluster.
    • Line 47: Checks if the point is not NaN. This is necessary because where() fills non-matching entries with NaN.
    • Lines 48–49: Adds the x-coordinate (pt[0]) and y-coordinate (pt[1]) of the point to the running sum for the current cluster centroid.
    • Line 50: Increments the counter for the number of points in the current cluster.
    • Lines 52–53: Divides the summed coordinates by count to calculate the mean x and y positions of all points in the cluster. This gives the updated centroid location for the current cluster.
    • Line 55: Returns the array means containing the updated centroid coordinates for all clusters.
  • Lines 58–64: The third function, map_color, is used to map an integer value to a color to visualize the clusters correctly.

  • Lines 67–77: Finally, the code runs a loop for a specified number of iterations. It updates the clusters and the centroids and plots the results at each iteration using the matplotlib library. The output is a set of subplots showing how the clusters evolve over the iterations.

Visual comparison of actual and clustered data using k-means
Visual comparison of actual and clustered data using k-means

Now, we’ll see how sklearn performs the same job.

Python 3.10.4
# Importing required packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = np.array([1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7])
y = np.array([2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5])
# Transforming dimensions to perform KMeans.fit()
x = np.expand_dims(x, axis = 1)
y = np.expand_dims(y, axis = 1)
X = np.concatenate((x, y), axis=1)
# Partitioning the dataset into 3 clusters using sklearn.cluster.KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
# Defining a color mapping function to visualize the clusters properly
def map_color(val):
if val==0:
return 'red'
elif val==1:
return 'blue'
elif val==2:
return 'green'
else:
return 'hotpink'
# Plotting the actual dataset along with the one where clusters
# have been highlighted side by side
fig, axs = plt.subplots(1, 2, figsize = (16, 8))
axs[0].scatter(x=X[:,0], y=X[:,1])
axs[0].set_title("Actual dataset", size = 20)
colors = map(map_color, y_kmeans)
axs[1].scatter(x=X[:, 0], y=X[:, 1], c=list(colors), alpha = 0.5)
axs[1].set_title('Clustering using k-means', size = 20)
centers = np.array(kmeans.cluster_centers_)
axs[1].scatter(centers[:,0], centers[:,1], marker="s", c=['red', 'blue', 'green'], s=50, alpha=0.4)
plt.show()

Following is the explanation for the code above:

  • Lines 6–12: The code first defines the dataset by creating arrays x and y and concatenating them to form the 2D array X.

  • Line 15: Next, the code uses the KMeans() function from the sklearn.cluster package to partition the dataset into three clusters.

  • Line 16: The fit() method of the KMeans class is called on the data X, which performs the clustering and assigns each data point to one of the three clusters.

  • Line 17: The predict() method is called on the fitted model to get the cluster labels for each data point. The predict() method of the KMeans object is used to obtain the predicted cluster assignments for each point in the dataset.

  • Lines 32–40: To visualize the clustering results, the code defines a color mapping function map_color() that maps the cluster assignments to colors. Then, it plots the actual dataset and the dataset with the clusters highlighted side by side using plt.subplots(). The actual dataset and clustered data are plotted using the scatter() function with the colors defined by map_color(). The cluster_centers_ attribute of the KMeans object is used to plot the cluster centers as squares on the clustered dataset plot. Finally, the code uses plt.show() to display the plot.

Conclusion

This walk-through demonstrated the core mechanism of the K-means algorithm using Lloyd’s iterative approach.

We observed that by iteratively calculating the distance (or dissimilarity), assigning points to the closest centroid, and then recomputing the centroid as the mean of the new cluster members, the algorithm minimizes the total within-cluster variance.

This process enables the initial, randomly selected cluster centers to converge into stable positions that represent the true means of the data groups.