Trusted answers to developer questions

Shanzay Gauhar

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

** K-means** is an unsupervised Machine Learning algorithm. In order to determine the optimal numbers of clusters (

`k`

), the `Elbow method`

is most commonly used.The two methods to determine optimal clusters include `distortion`

and `inertia`

, the first of which uses Euclidean distance to calculate the average of the squared distances between the cluster centers of the respective clusters, while the latter is the sum of the squared distance of sample cluster points from their closest centroid. Now, there are two ways to implement the Elbow method using either distortions or inertia. We will discuss both.

The simple yet practical demonstration of the Elbow method is given below. First, we will import the required libraries and visualize the data.

import numpy as npfrom sklearn.cluster import KMeansfrom sklearn import metricsimport matplotlib.pyplot as pltfrom scipy.spatial.distance import cdistxaxis = np.array([1, 4, 5, 6, 1, 4, 5, 6, 1, 2, 4, 6, 5, 5, 5, 1, 3, 4, 2, 6, 4, 5, 5])yaxis = np.array([5, 4, 5, 6, 5, 3, 5, 4, 4, 4, 8, 1, 3, 2, 1, 5, 1, 8, 7, 6, 9, 1, 10])X = np.array(list(zip(xaxis, yaxis))).reshape(len(xaxis), 2)plt.plot()plt.title('Data Visualization')plt.xlabel('x')plt.ylabel('y')plt.xlim([0, 10])plt.ylim([0, 15])plt.scatter(xaxis, yaxis)plt.show()

Looking at the scatter plot, it is safe to say that it can be quite difficult to determine the optimal number of clusters just by visualizing the data. Hence, the Elbow method would help us to make a better decision about the number of clusters.

Two implementations for the elbow method are displayed below, in different tabs, for the above data.

# K-means using Euclidean distance and distortion conceptdistortions = []#total number of clustersK = range(1,10)#for every cluster value we calculate distortionfor k in K:kmeanModel = KMeans(n_clusters=k).fit(X)kmeanModel.fit(X)distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])# Now that we have all the distortions we will plot the graphplt.plot(K, distortions, 'bx-')plt.xlabel('k-clusters')plt.ylabel('Distortion')plt.xlim([0, 10])# plt.ylim([0, 3]) can set the y limit accordingly if and when needed otherwise it sets default value depending on graphplt.title('Elbow method with Euclidean Distance')plt.show()

As is evident from the graph, as `k`

`k = 4`

is optimal where `k`

is the number of clusters. We will always choose the value of `k`

at the elbow.

RELATED TAGS

python

communitycreator

CONTRIBUTOR

Shanzay Gauhar

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring

Related Courses