Demystifying sklearn.cluster.kmeans

Getting started

sklearn.cluster.kmeans is a really cool algorithm that helps you to group data points according to how similar or close to each other they are. But how does it work?

Sklearn? cluster? K-means?

Scikit-learn (sklearn) is an open-source Python library that is mainly used in data analysis and machine learning. In machine learning, there are (arguably) two main categories:

Supervised machine learning, which has both the predictor variablesalso called features or independent variables and the targetalso called the label or dependent variable , and is used to make predictions.
Unsupervised machine learning which has only has independent variables, and is used for pattern recognition.

The below example highlights the difference: Image source

As seen in the above image, the supervised learning method uses the classification algorithm to predict whether an animal is a duck or not. The data has labels/targets (“Duck” vs “Not Duck”) which the algorithm uses to make predictions. On the other hand, the unsupervised learning algorithm uses clustering to group the animals into categories or clusters based on their similarities. The three birds are in one cluster, the rabbit is in another, and the hedgehog is in yet another cluster.

K-means clustering is a clustering algorithm that divides data points into groups or clusters based on how similar or close to each other they are. Each cluster has a centroid, which is a real or imaginary data point that is at the center of the cluster. The aim of k-means clustering is to minimize the distance between the cluster points and their respective centroids.

sklearn.cluster.kmeans uses the K-means algorithm which is part of the cluster module in the Sklearn library.

How does K-means clustering work?

Choose a value for the number of clusters you wish to have. For example, k=3 will set up 3 clusters.
Randomly select k data points to act as initial centroids for those clusters.
Measure the distance between each point and the centroid, and assign each point to the cluster it is closest to.
Calculate the means of the data points in each cluster and set them as the new centroids or cluster centers.
Repeat the process of adding data points to clusters whose centroids they are closest to until the centroids and other data points stop changing or until you reach the maximum number of iterations.

Demystifying sklearn.cluster.kmeans

Getting started

Sklearn? cluster? K-means?

How does K-means clustering work?

Code implementation