K-means and k-medoids are two algorithms that are frequently used for clustering in the field of machine learning. Before we discuss their differences in detail, let's first understand the basics of clustering and these algorithms.
Suppose we have been given unlabelled data and wish to make it meaningful by grouping it into data sets, i.e., clusters with similar properties. This phenomenon is known as clustering. K-means and k-medoids are examples of unsupervised clustering algorithms. Unsupervised clustering refers to the clustering of comparable or similar entities without the need for human intervention.
K-means is a very simple to implement clustering algorithm that works by selecting k centroids initially, where k serves as the input to the algorithm and can be defined as the number of clusters required. The centroids serve as the center of each new cluster. We first assign each data point of the given data to the nearest centroid. Next, we calculate the mean of each cluster and the means then serve as the new centroids. This step is repeated until the positions of the centroids do not change anymore.
The goal of k-means is to minimize the sum of the squared distances between each data point and its centroid. Let Zi represent the data point and Cj represent its cluster centroid, the formula then becomes:
K-medoids is another clustering method but unlike k-means, rather than minimizing the sum of squared distances, k-medoids works on minimizing the number of
Let Zi represent the data point and Mj represent its cluster medoid, the formula then becomes:
Let's take a look at how their implementations vary. We will be using Python's scikit-learn library. A library most commonly used for machine learning.
Here's how we can implement k-means in Python using scikit-learn:
# required packagesimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeans# unorganized datadata = np.array([[1, 1], [2, 1], [1, 2], [10, 10], [11, 10], [10, 11], [2, 10], [9, 3], [8, 4], [9, 4]])# number of clustersk = 3# k-means clusteringkmeans = KMeans(n_clusters=k)kmeans.fit(data)# getting labels and centroidslabels = kmeans.labels_centroids = kmeans.cluster_centers_# plotting dataplt.scatter(data[:, 0], data[:, 1], c=labels)# plotting centroidsplt.scatter(centroids[:, 0], centroids[:, 1], marker='o', c='black')# setting plot attributesplt.title('K-Means Algorithm')plt.xlabel('X Axis')plt.ylabel('Y Axis')# displaying the plotplt.show()
Note: Click on the resulting graph to expand it.
Lines 2–4: We import the necessary packages to run the code.
Line 7: We set up initial unorganized data and store it in the data
array.
Line 10: The variable k
is set to 3
, indicating the desired number of clusters.
Line 13: An instance of the KMeans
class is created where we specify the clusters by the attribute n_clusters = k
.
Line 14: We call the fit
method on the kmeans
object, which performs the k-Means clustering on the data
array.
Lines 17–18: We set the labels
and centroids
through algorithm calculations.
Line 21: We use the plt.scatter
function to plot the data points visually.
Line 24: We then plot the centroids. We use the marker='o'
argument to depict centroids as circles, and c='black'
to set their color to black.
Lines 27–32: We set certain attributes of the plot, including the title
, xlabel
, and ylabel
. We finally display the plot using plt.show()
.
We can reuse the above code simply by replacing the k-means algorithm with a k-medoids clustering algorithm from the sklearn_extra
module.
# required packagesimport numpy as npimport matplotlib.pyplot as pltfrom sklearn_extra.cluster import KMedoids# unorganized datadata = np.array([[1, 1], [2, 1], [1, 2], [10, 10], [11, 10], [10, 11], [2, 10], [9, 3], [8, 4], [9, 4]])# number of clustersk = 3# k-medoids clusteringkmedoids = KMedoids(n_clusters=k)kmedoids.fit(data)# getting labels and medoidslabels = kmedoids.labels_medoids = kmedoids.medoid_indices_# plotting dataplt.scatter(data[:, 0], data[:, 1], c=labels)# plotting medoidsplt.scatter(data[medoids][:, 0], data[medoids][:, 1], marker='o', c='black')# setting attributesplt.title('K-Medoids Algorithm')plt.xlabel('X Axis')plt.ylabel('Y Axis')# displaying the plotplt.show()
Here, instead of k-Means, an instance of the KMedoids
class is created and we fit this clustering technique over the data array. Similarly, instead of cluster_centers_
, we make use of medoid_indices_
to calculate the medoid points.
We can see that k-means makes new points as centroids, while k-medoids uses existing points as medoids.
K-means | K-medoids |
K-means takes the mean of data points to create new points called centroids. | K-medoids uses points from the data to serve as points called medoids. |
Centroids are new points previously not found in the data. | Medoids are existing points from the data. |
K-means can only by used for numerical data. | K-medoids can be used for both numerical and categorical data. |
K-means focuses on reducing the sum of squared distances, also known as the sum of squared error (SSE). | K-medoids focuses on reducing the dissimilarities between clusters of data from the dataset. |
K-means uses Euclidean distance. | K-medoids uses Manhattan distance. |
K-means is not sensitive to outliers within the data. | K-medoids is outlier resistant and can reduce the effect of outliers. |
K-means does not cater to noise in the data. | K-medoids effectively reduces the noise in the data. |
K-means is less costly to implement. | K-medoids is more costly to implement. |
K-means is faster. | K-medoids is comparatively not as fast. |
K-means is generally faster as compared to the k-medoids algorithm since calculating means is relatively simpler.
Let's take a look at a few scenarios:
If noise is negatively affecting our analysis, k-medoids is the way to go.
Since k-medoids is an outlier-resistant algorithm, it is more useful when the clusters have irregular shapes or different sizes.
Suppose our data set only consists of numbers, in that case, we can simply apply k-means.
Conclusively, the algorithm we ultimately choose depends on our data and certain requirements.
Which algorithm should be applied in this scenario?
We want every point of the cluster to be from within the existing dataset and noise is a concern.
K-means
K-medoids