Fundamentals of Machine Learning: A Pythonic Introduction/

...

K-Means Walk-Through Example

Practice the k-means algorithm with a step-by-step walkthrough example in this lesson.

We'll cover the following...

K-means algorithm
Dry running the example

$K$ -means algorithm

For a given dataset and value of $k$ , $k$ -means clustering has the following steps:

Choose some value of $k$ such that $k\ge2$ , if it’s not given already.
Choose $k$ number of centroids, randomly.
Find the similarity score of each data point with respect to each centroid.
Based on the similarity score, assign each data point its centroid.
From these new groupings, find new centroids by taking the mean of all data points of a cluster.
Repeat steps 3 to 5 until the difference between old and new centroids is negligible.

If the steps above seem unclear, don’t worry. We’re going to show each step in an example with an illustration.

Dry running the example

Let’s say we have the following dataset:

Press + to interact

Python 3.10.4

import matplotlib.pyplot as plt
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
colors = ['red', 'blue', 'green']
# Plotting the actual datapoints
plt.scatter(x, y, color = 'white', edgecolor = 'black')
# Plotting centroids
for ctr, clr in zip(range(len(Cx)), colors):
  plt.plot(Cx[ctr] , Cy[ctr], color = clr, marker = 's', markersize=10, alpha = 0.2)
plt.show()

Press + to interact

Here is the explanation for the code above:

Lines 3–7: We define two lists x and y representing the x and y-coordinates of the data points. Similarly, Cx and Cy represent the x and y-coordinates of initial centroids.
Line 10: We convert the x and y lists into an array of arrays representing data_points using a list comprehension.
Line 13: We use the euclidean_distances function to calculate the Euclidean distances between data points and centroids and print the resulting array.

The code output will be a 2D array where each row represents a data point, and each column represents a centroid. The value at position $(i, j)$ in the array represents the Euclidean distance between the $i^{th}$ data point and the $j^{th}$ centroid.

The dissimilarity scores were calculated using sklearn and are also given below:

Dissimilarity Scores

Data Points	Centroid_1 (1, 1)	Centroid_2 (7, 2)	Centroid_3 (5, 6.5)
1, 2	1	6	6.020797289
2, 1	1	5.099019514	6.264982043
2, 1.5	1.118033989	5.024937811	5.830951895
2.5, 3.5	2.915475947	4.74341649	3.905124838
3, 4	3.605551275	4.472135955	3.201562119
4, 3.5	3.905124838	3.354101966	3.16227766
4, 7.5	7.158910532	6.264982043	1.414213562
5, 6	6.403124237	4.472135955	0.5
5, 7	7.211102551	5.385164807	0.5
5.5, 2	4.609772229	1.5	4.527692569
6, 1.5	5.024937811	1.118033989	5.099019514
6, 3	5.385164807	1.414213562	3.640054945
6, 5.5	6.726812024	3.640054945	1.414213562
6.5, 5	6.800735254	3.041381265	2.121320344
7, 2.5	6.184658438	0.5	4.472135955

Course Overview

Supervised Learning

Detect Cyber Intrusion Using Machine Learning

Clustering

Project: Bag of Visual Words

Generalized Linear Regression

Face Recognition Using Kernel Linear Discriminant

Support Vector Machine

Logistic Regression

Ensemble Learning

Early Stage Diabetes Prediction Using Ensemble Learning

Decoding Dimensions: PCA and Autoencoders

Image Reconstruction Using PCA

Image Colorization using Autoencoders

Colorful Face Generation with VAEs

Appendix

Wrapping Up

How to Predict the Traffic Volume Using Machine Learning

K-Means Walk-Through Example

$K$ -means algorithm

Dry running the example

Step 1: Plotting the data

Step 2: Assigning values to centroids

Step 3: Calculating the dissimilarity score

Dissimilarity Scores

Detect Cyber Intrusion Using Machine Learning

Project: Bag of Visual Words

Face Recognition Using Kernel Linear Discriminant

Early Stage Diabetes Prediction Using Ensemble Learning

Image Reconstruction Using PCA

Image Colorization using Autoencoders

Colorful Face Generation with VAEs

How to Predict the Traffic Volume Using Machine Learning

K-Means Walk-Through Example

KKK-means algorithm

Dry running the example

Step 1: Plotting the data

Step 2: Assigning values to centroids

Step 3: Calculating the dissimilarity score

Dissimilarity Scores

$K$ -means algorithm