What is hierarchical clustering?
Overview
Clustering is a data modeling technique that is used to group objects based on multiple features that may be common amongst items present in a data set. We can then analyze these clusters and obtain significant information from them.
Hierarchical clustering is one of the various types of cluster algorithms.
How does hierarchical clustering work?
Types of hierarchical clustering algorithms
There are two types of hierarchical clustering algorithms:
- Agglomerative (bottom-up approach)
- Divisive (top-down approach)
Agglomerative (bottom-up)
The Agglomerative Nesting (AGNES) is a convergent approach. It starts off by assigning each of the data points to a cluster of its own. After this dissimilarity is calculated amongst each of the clusters, the clusters with the least dissimilarity are merged together. Eventually, all the nodes end up in just one cluster as shown in the figure above.
Divisive (top-down)
Inverting the order of Agglomerative hierarchical analysis, gives birth to Divisive Analysis (DIANA), which is a divergent approach. The algorithm starts off with just one cluster and eventually each node ends up with a cluster of its own.
Computing dissimilarity
The following measures can be used to compute dissimilarity between the clusters to merge them:
- Single link distance: The smallest distance between point x and y, where x belongs to cluster 0 and y belongs to cluster 1.
- Complete link distance: The largest distance between point x and point y, where x belongs to cluster 0 and y belongs to cluster 1.
- Average distance: The average distance between points x and y, where x belongs to cluster 0 and y belongs to cluster 1.
- Centroid: The distance between the centroids of the two clusters.
- For any cluster, the centroid is defined as a point that represents the center of a cluster.
- Medoid: The distance between the medoids of the two clusters.
- For any cluster, the medoid is defined as a point whose dissimilarity to all the elements in its cluster is minimal.
Hierarchical clustering using sklearn
# importing librariesfrom sklearn.cluster import AgglomerativeClusteringimport numpy as npimport matplotlib.pyplot as plt# initializing sample dataX = np.array([[1, 4], [1, 5], [1, 8], [6, 3], [9, 2], [1, 6]])print('Dataset: ')print(X)# loading the clustering algorithmmodel = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean')# fitting the datamodel = model.fit(X)# printing the assigned cluster labelsprint('\nThe labels assigned to the train data are: ')print(model.labels_)# using the model to predict labels on test dataY = ([1,2], [2,3], [5,5], [6,0])print('\nTest data: ')print(Y)print('\nThe labels assigned to test data are: ')print(model.fit_predict(Y))
Explanation
In the code above, we have clustered the NumPy array labeled X, which is a 2D-array of size 6x2 (that represents our data).
We use the scikit-learn machine learning library for python to do the clustering.
First, we import the required python modules and libraries.
Next, we load the clustering model and specify the arguments.
n_clusters: This is the number of clusters to form, and hence the number of centroids to produce.affinity: This defines the similarity metric that is to be used.
After the model has been set up, it is run on the train data. We assign labels to each item in the train data set.
After training the model the model is run on the test data set using the fit and predict functions, each point from the data is assigned to a cluster. We can print the labels to see which point was assigned to which of their respective clusters.