What is hierarchical clustering?

Overview

Clustering is a data modeling technique that is used to group objects based on multiple features that may be common amongst items present in a data set. We can then analyze these clusters and obtain significant information from them.

Hierarchical clustering is one of the various types of cluster algorithms.

How does hierarchical clustering work?

The Agglomerative Nesting (AGNES) is a convergent approach. It starts off by assigning each of the data points to a cluster of its own. After this dissimilarity is calculated amongst each of the clusters, the clusters with the least dissimilarity are merged together. Eventually, all the nodes end up in just one cluster as shown in the figure above.

Divisive (top-down)

Inverting the order of Agglomerative hierarchical analysis, gives birth to Divisive Analysis (DIANA), which is a divergent approach. The algorithm starts off with just one cluster and eventually each node ends up with a cluster of its own.

Computing dissimilarity

The following measures can be used to compute dissimilarity between the clusters to merge them:

Single link distance: The smallest distance between point x and y, where x belongs to cluster 0 and y belongs to cluster 1.
Complete link distance: The largest distance between point x and point y, where x belongs to cluster 0 and y belongs to cluster 1.
Average distance: The average distance between points x and y, where x belongs to cluster 0 and y belongs to cluster 1.
Centroid: The distance between the centroids of the two clusters.

For any cluster, the centroid is defined as a point that represents the center of a cluster.

Medoid: The distance between the medoids of the two clusters.

For any cluster, the medoid is defined as a point whose dissimilarity to all the elements in its cluster is minimal.

Hierarchical clustering using sklearn

# importing libraries
from sklearn.cluster import AgglomerativeClustering
import numpy as np
import matplotlib.pyplot as plt

# initializing sample data
X = np.array([[1, 4], [1, 5], [1, 8], [6, 3], [9, 2], [1, 6]])
print('Dataset: ')
print(X)

# loading the clustering algorithm
model = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean')

# fitting the data
model = model.fit(X)

# printing the assigned cluster labels
print('\nThe labels assigned to the train data are: ')
print(model.labels_)

# using the model to predict labels on test data
Y = ([1,2], [2,3], [5,5], [6,0])
print('\nTest data: ')
print(Y)

print('\nThe labels assigned to test data are: ')
print(model.fit_predict(Y))

Explanation

In the code above, we have clustered the NumPy array labeled X, which is a 2D-array of size 6x2 (that represents our data).

We use the scikit-learn machine learning library for python to do the clustering.

First, we import the required python modules and libraries.

Next, we load the clustering model and specify the arguments.

n_clusters: This is the number of clusters to form, and hence the number of centroids to produce.
affinity: This defines the similarity metric that is to be used.

After the model has been set up, it is run on the train data. We assign labels to each item in the train data set.

After training the model the model is run on the test data set using the fit and predict functions, each point from the data is assigned to a cluster. We can print the labels to see which point was assigned to which of their respective clusters.