Imagine an algorithm that is very fast and efficient in categorizing data points into sub-groups with little information. The whole reason for classification tasks in unsupervised machine learning is to correctly distinguish which data points belong to which group, and K-Means Clustering Algorithm does just that.
Clustering is a method used in exploratory data analysis to unravel similarities and insights about the structure of the data. It can be said to be the method for pinpointing where data points can be categorized into smaller groups in the data. The smaller groups, known as clusters, have data points that are alike in specific respects, while data points that belong to different clusters are dissimilar.
Clustering is considered an unsupervised learning method because the target classes aren’t provided in order to compare the output of the clusters against, to evaluate its performance. This algorithm can be used to monitor students’ academic performance, for example. Based on the students’ score, they would be grouped into different clusters.
Clustering can also be used in market segmentation, where the clusters are created based on the spending or behavioral attributes of customers. Among other uses, clustering can be using for drug activity predictions, spam filtering, identifying fake news, etc.
The k-means algorithm is an iterative procedure aimed at segregating datasets into K predefined, separate, and non-overlapping subsets, known as clusters, ensuring that each data point exclusively falls within a single group.
#K Means cluster creationfrom sklearn import KMeans#creating an instance of a K Means model with 3 clustersKmeans=KMeans(n_Clusters=2)kmeans.fit(x)Kmeans.cluster_centers_
Choose a number of clusters ‘K’.
Randomly assign each data point to a cluster until the clusters stop adjusting.
Take the mean vector of the points and find the cluster centroid,
Assign each data point to the cluster for which the centroid is the closest.
The approach that K-means follows to solve the problem is called Expectation-Maximization. The assignment of the data points to the closest cluster is the Expectation step.
The Maximization step is to compute the centroid of each cluster.
In the illustration above: • The box labelled 1 is the plane data – the observations.
• Each data point is then randomly assigned to a cluster, as seen in box 2.
• The box on the top-right, labelled 3, is the stage where the cluster centroids are computed. The centroids (the larger circles) are not distinctly placed because the initial cluster assignment for the data points were chosen at random.
• In box 4, each observation is assigned to the nearest centroid.
• The centroid is then readjusted and step 3 is repeated.
• These steps are done over again until there are no new centroids, as seen in the last box.
When choosing a K-value, it is important to remember that there is no best way to choose a K-value. Choosing this value comes with experience and domain knowledge. Regardless, there are a number of ways this can be done, one of these is the ‘Elbow method.’
The Elbow method provides us with insights into the appropriate number of clusters, denoted as “k”, by considering the sum of squared distances (SSE) between data points and the centroids to which they belong. We select the value of “k” where SSE exhibits a sharp decline, resulting in the distinctive visual pattern known as the elbow effect on the graph.
K-Means clustering algorithm is a good algorithm mostly used by machine learning engineers when dealing with clustering problems. It does a decent job in blindly recognizing patterns and effectively classifying data points.