What is the difference between k-means and k-medoids algorithms?

K-means and k-medoids are two algorithms that are frequently used for clustering in the field of machine learning. Before we discuss their differences in detail, let's first understand the basics of clustering and these algorithms.

An overview of unsupervised clustering

Suppose we have been given unlabelled data and wish to make it meaningful by grouping it into data sets, i.e., clusters with similar properties. This phenomenon is known as clustering. K-means and k-medoids are examples of unsupervised clustering algorithms. Unsupervised clustering refers to the clustering of comparable or similar entities without the need for human intervention.

K-means algorithm

K-means is a very simple to implement clustering algorithm that works by selecting k centroids initially, where k serves as the input to the algorithm and can be defined as the number of clusters required. The centroids serve as the center of each new cluster. We first assign each data point of the given data to the nearest centroid. Next, we calculate the mean of each cluster and the means then serve as the new centroids. This step is repeated until the positions of the centroids do not change anymore.

The goal of k-means is to minimize the sum of the squared distances between each data point and its centroid. Let Zi represent the data point and Cj represent its cluster centroid, the formula then becomes:

K-medoids algorithm

K-medoids is another clustering method but unlike k-means, rather than minimizing the sum of squared distances, k-medoids works on minimizing the number of paired dissimilaritiesThe distances between individual data points within a cluster.. We find this useful since k-medoids aims to form clusters where the objects within each cluster are more similar to each other and dissimilar to objects in the other clusters. Instead of centroids, this approach makes use of medoids. Medoids are points in the dataset whose sum of distances to other points in the cluster is minimal.
Let Zi represent the data point and Mj represent its cluster medoid, the formula then becomes:

Coding examples

Let's take a look at how their implementations vary. We will be using Python's scikit-learn library. A library most commonly used for machine learning.

K-means code

Here's how we can implement k-means in Python using scikit-learn:

# required packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# unorganized data
data = np.array([[1, 1], [2, 1], [1, 2], [10, 10], [11, 10], [10, 11], [2, 10], [9, 3], [8, 4], [9, 4]])
# number of clusters
k = 3
# k-means clustering
kmeans = KMeans(n_clusters=k)
kmeans.fit(data)
# getting labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# plotting data
plt.scatter(data[:, 0], data[:, 1], c=labels)
# plotting centroids
plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', c='black')
# setting plot attributes
plt.title('K-Means Algorithm')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
# displaying the plot
plt.show()

Note: Click on the resulting graph to expand it.

Code explanation

Lines 2–4: We import the necessary packages to run the code.

Line 7: We set up initial unorganized data and store it in the data array.

Line 10: The variable k is set to 3, indicating the desired number of clusters.

Line 13: An instance of the KMeans class is created where we specify the clusters by the attribute n_clusters = k.

Line 14: We call the fit method on the kmeans object, which performs the k-Means clustering on the data array.

Lines 17–18: We set the labels and centroids through algorithm calculations.

Line 21: We use the plt.scatter function to plot the data points visually.

Line 24: We then plot the centroids. We use the marker='o' argument to depict centroids as circles, and c='black' to set their color to black.

Lines 27–32: We set certain attributes of the plot, including the title, xlabel, and ylabel. We finally display the plot using plt.show().

K-medoids code

We can reuse the above code simply by replacing the k-means algorithm with a k-medoids clustering algorithm from the sklearn_extra module.

# required packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids
# unorganized data
data = np.array([[1, 1], [2, 1], [1, 2], [10, 10], [11, 10], [10, 11], [2, 10], [9, 3], [8, 4], [9, 4]])
# number of clusters
k = 3
# k-medoids clustering
kmedoids = KMedoids(n_clusters=k)
kmedoids.fit(data)
# getting labels and medoids
labels = kmedoids.labels_
medoids = kmedoids.medoid_indices_
# plotting data
plt.scatter(data[:, 0], data[:, 1], c=labels)
# plotting medoids
plt.scatter(data[medoids][:, 0], data[medoids][:, 1], marker='o', c='black')
# setting attributes
plt.title('K-Medoids Algorithm')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
# displaying the plot
plt.show()

Here, instead of k-Means, an instance of the KMedoids class is created and we fit this clustering technique over the data array. Similarly, instead of cluster_centers_, we make use of medoid_indices_ to calculate the medoid points.

We can see that k-means makes new points as centroids, while k-medoids uses existing points as medoids.

Summarized Differences between K-Means and K-Medoids

K-means

K-medoids

K-means takes the mean of data points to create new points called centroids.

K-medoids uses points from the data to serve as points called medoids.

Centroids are new points previously not found in the data.

Medoids are existing points from the data.

K-means can only by used for numerical data.

K-medoids can be used for both numerical and categorical data.

K-means focuses on reducing the sum of squared distances, also known as the sum of squared error (SSE).

K-medoids focuses on reducing the dissimilarities between clusters of data from the dataset.

K-means uses Euclidean distance.

K-medoids uses Manhattan distance.

K-means is not sensitive to outliers within the data.

K-medoids is outlier resistant and can reduce the effect of outliers.

K-means does not cater to noise in the data.

K-medoids effectively reduces the noise in the data.

K-means is less costly to implement.

K-medoids is more costly to implement.

K-means is faster.

K-medoids is comparatively not as fast.

Computational complexity

K-means is generally faster as compared to the k-medoids algorithm since calculating means is relatively simpler.

Which one should we use?

Let's take a look at a few scenarios:

  1. If noise is negatively affecting our analysis, k-medoids is the way to go.

  2. Since k-medoids is an outlier-resistant algorithm, it is more useful when the clusters have irregular shapes or different sizes.

  3. Suppose our data set only consists of numbers, in that case, we can simply apply k-means.

Conclusively, the algorithm we ultimately choose depends on our data and certain requirements.

Which algorithm should be applied in this scenario?

Q

We want every point of the cluster to be from within the existing dataset and noise is a concern.

A)

K-means

B)

K-medoids