**K-means** and** k-medoids** are two algorithms that are frequently used for clustering in the field of machine learning. Before we discuss their differences in detail, let's first understand the basics of clustering and these algorithms.

Suppose we have been given unlabelled data and wish to make it meaningful by grouping it into data sets, i.e., clusters with similar properties. This phenomenon is known as** clustering**. K-means and k-medoids are examples of unsupervised clustering algorithms. **Unsupervised clustering** refers to the clustering of comparable or similar entities without the need for human intervention.

K-means is a very simple to implement clustering algorithm that works by selecting *k *centroids initially, where *k* serves as the *input* to the algorithm and can be defined as the *number of clusters required*. The** **centroids serve as the center of each new cluster. We first assign each data point of the given data to the nearest centroid. Next, we calculate the mean of each cluster and the means then serve as the new centroids. This step is repeated until the positions of the centroids do not change anymore.

The goal of k-means is to minimize the sum of the squared distances between each data point and its centroid. Let *Z*_{i} represent the data point and *C*_{j} represent its cluster centroid, the formula then becomes:

K-medoids is another clustering method but unlike k-means, rather than minimizing the sum of squared distances, k-medoids works on minimizing the number of **Medoids **are points in the dataset whose sum of distances to other points in the cluster is minimal.

Let *Z*_{i} represent the data point and *M*_{j} represent its cluster medoid, the formula then becomes:

Let's take a look at how their implementations vary. We will be using Python's scikit-learn library. A library most commonly used for machine learning.

Here's how we can implement k-means in Python using scikit-learn:

# required packagesimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeans# unorganized datadata = np.array([[1, 1], [2, 1], [1, 2], [10, 10], [11, 10], [10, 11], [2, 10], [9, 3], [8, 4], [9, 4]])# number of clustersk = 3# k-means clusteringkmeans = KMeans(n_clusters=k)kmeans.fit(data)# getting labels and centroidslabels = kmeans.labels_centroids = kmeans.cluster_centers_# plotting dataplt.scatter(data[:, 0], data[:, 1], c=labels)# plotting centroidsplt.scatter(centroids[:, 0], centroids[:, 1], marker='o', c='black')# setting plot attributesplt.title('K-Means Algorithm')plt.xlabel('X Axis')plt.ylabel('Y Axis')# displaying the plotplt.show()

Note:Click on the resulting graph to expand it.

**Lines 2–4:** We import the necessary packages to run the code.

**Line 7:** We set up initial unorganized data and store it in the `data`

array.

**Line 10:** The variable `k`

is set to `3`

, indicating the desired number of clusters.

**Line 13:** An instance of the `KMeans`

class is created where we specify the clusters by the attribute `n_clusters = k`

.

**Line 14:** We call the `fit`

method on the `kmeans`

object, which performs the k-Means clustering on the `data`

array.

**Lines 17–18:** We set the `labels`

and `centroids`

through algorithm calculations.

**Line 21:** We use the `plt.scatter`

function to plot the data points visually.

**Line 24:** We then plot the centroids. We use the `marker='o'`

argument to depict centroids as circles, and `c='black'`

to set their color to black.

**Lines 27–32:** We set certain attributes of the plot, including the `title`

, `xlabel`

, and `ylabel`

. We finally display the plot using `plt.show()`

.

We can reuse the above code simply by replacing the k-means algorithm with a k-medoids clustering algorithm from the `sklearn_extra`

module.

# required packagesimport numpy as npimport matplotlib.pyplot as pltfrom sklearn_extra.cluster import KMedoids# unorganized datadata = np.array([[1, 1], [2, 1], [1, 2], [10, 10], [11, 10], [10, 11], [2, 10], [9, 3], [8, 4], [9, 4]])# number of clustersk = 3# k-medoids clusteringkmedoids = KMedoids(n_clusters=k)kmedoids.fit(data)# getting labels and medoidslabels = kmedoids.labels_medoids = kmedoids.medoid_indices_# plotting dataplt.scatter(data[:, 0], data[:, 1], c=labels)# plotting medoidsplt.scatter(data[medoids][:, 0], data[medoids][:, 1], marker='o', c='black')# setting attributesplt.title('K-Medoids Algorithm')plt.xlabel('X Axis')plt.ylabel('Y Axis')# displaying the plotplt.show()

Here, instead of k-Means, an instance of the `KMedoids`

class is created and we fit this clustering technique over the data array. Similarly, instead of `cluster_centers_`

, we make use of `medoid_indices_`

to calculate the medoid points.

We can see that k-means makes new points as centroids, while k-medoids uses existing points as medoids.

K-means | K-medoids |

K-means takes the mean of data points to create new points called centroids. | K-medoids uses points from the data to serve as points called medoids. |

Centroids are new points previously not found in the data. | Medoids are existing points from the data. |

K-means can only by used for numerical data. | K-medoids can be used for both numerical and categorical data. |

K-means focuses on reducing the sum of squared distances, also known as the sum of squared error (SSE). | K-medoids focuses on reducing the dissimilarities between clusters of data from the dataset. |

K-means uses Euclidean distance. | K-medoids uses Manhattan distance. |

K-means is not sensitive to outliers within the data. | K-medoids is outlier resistant and can reduce the effect of outliers. |

K-means does not cater to noise in the data. | K-medoids effectively reduces the noise in the data. |

K-means is less costly to implement. | K-medoids is more costly to implement. |

K-means is faster. | K-medoids is comparatively not as fast. |

K-means is generally faster as compared to the k-medoids algorithm since calculating means is relatively simpler.

Let's take a look at a few scenarios:

If noise is negatively affecting our analysis, k-medoids is the way to go.

Since k-medoids is an outlier-resistant algorithm, it is more useful when the clusters have irregular shapes or different sizes.

Suppose our data set only consists of numbers, in that case, we can simply apply k-means.

Conclusively, the algorithm we ultimately choose depends on our data and certain requirements.

Which algorithm should be applied in this scenario?

Q

We want every point of the cluster to be from within the existing dataset and noise is a concern.

A)

K-means

B)

K-medoids

TRENDING TOPICS