...

/

Similarity and Dissimilarity Measures

Similarity and Dissimilarity Measures

We'll cover the following...

Similarity or dissimilarity measures are core components of clustering algorithms that cluster similar data points into the same clusters. In contrast, dissimilar or distant data points are placed into different clusters. Although the choice of a similarity/dissimilarity measure is task-dependent, it’s good to know the common ones.

Note: The measures involve two data points, say x\bold x and y\bold y, in Rd\R^d.

Minkowski distance

The Minkowski distance dminkd_{mink} between points x\bold x and y\bold y is defined as follows:

dmink(x,y,p)=(i=1dxiyip)1pd_{mink}(\bold x, \bold y, p)=\bigg(\sum_{i=1}^d|x_i - y_i|^p \bigg)^\frac{1}{p}

Here, pZ+p \in \Z^+, that is, pp is a positive integer. The code below implements Minkowski distance given two points x and y for a given value of p:

Python 3.10.4
import numpy as np
def Minkowski_distance(x, y, p=2):
return np.sum(np.abs(x-y)**p)**(1./p)
d, p = 20, 3
x, y = np.random.rand(d), np.random.rand(d)
print(f'The Minkowski distance between x and y is {Minkowski_distance(x, y, p=2)}')

The p-norm

The p-norm of a vector xRd\bold x \in \R^d, denoted by xp ...