K-Means Walk-Through Example
Explore the K-means clustering algorithm through a detailed walk-through using Python and sklearn. Understand how to initialize centroids, calculate distances, assign points to clusters, and iteratively update centroids until convergence. This lesson offers practical coding insights and visualizations to solidify your grasp of clustering.
In the previous lesson, we discussed that K-means clustering is an algorithm designed to partition a dataset into distinct clusters by minimizing the total variance within each cluster. In this lesson, we will move from theory to practice by performing a dry run of the K-means algorithm on a small synthetic dataset. We will follow Lloyd’s algorithm step-by-step, using Python and the sklearn library to visualize how data points are assigned to clusters and how centroids are iteratively updated.
-means algorithm
For a given dataset and value of , -means clustering has the following steps:
- Choose : Select the number of clusters, , such that .
- Initialize centroids: Choose number of centroids, typically randomly, as initial cluster centers.
- Calculate dissimilarity: Find the similarity score (distance, e.g., Euclidean) of each data point with respect to each centroid.
- Assign clusters: Based on the similarity score, assign each data point to the cluster whose centroid is the closest (least distant).
- Update centroids: From these new groupings, find new centroids by taking the mean of all data points assigned to that cluster.
- Repeat: Repeat steps 3 to 5 until the difference between old and new centroids is negligible (convergence).
If the steps above seem unclear, don’t worry. We will illustrate each step with an example.
Dry running the example
Let’s say we have the following dataset:
Step 1: Plotting the data
Run the following code widget in order to plot the data. Here, x and y contain all the x and y-coordinates to represent our synthetic data.
Let’s start with the first step of -means clustering and decide how many clusters we want if the number isn’t given already. Let the number of clusters be three, which means .
Step 2: Assigning values to centroids
The second step is to assign number of centroids with the random value. Since is , we’ll get three centroids , , and . Also, assign them random values yields:
In the following code, Cx and Cy represent x and y coordinates of the centroids:
Step 3: Calculating the dissimilarity score
The third step is to find the dissimilarity score of each data point (15 total) with each centroid. We’ll be using the Euclidean distance as the dissimilarity score. The function euclidean_distances takes two arrays, where each array is an array of points. Let’s see how to calculate the dissimilarity score using sklearn:
Here is the explanation for the code above:
-
Lines 3–7: We define two lists
xandyrepresenting the x and y-coordinates of the data points. Similarly,CxandCyrepresent the x and y-coordinates of initial centroids. -
Lines 10–11: We convert the data points and the centroid coordinates into arrays of 2-D points because
euclidean_distancesexpects its inputs to be matrices, where each row represents one point. The function computes pairwise distances between every point in the first matrix and every point in the second. By restructuring the data into lists of [x, y] pairs, we ensure the function receives properly formatted inputs and can correctly compute the distances. -
Line 13: We use the
euclidean_distancesfunction to calculate the Euclidean distances between data points and centroids and print the resulting array.
The code output will be a 2D array where each row represents a data point, and each column represents a centroid. The value at position in the array represents the Euclidean distance between the data point and the ...