Mahalanobis distance is defined as the distance between a distribution and a point. It was introduced in 1936 by P. C. Mahalanobis.
As you might know, the Euclidean distance is simple and the most commonly used as metric to find distance. We use the Mahalanobis distance because there are several limitations of the Euclidean distance. These limitations are as follows:
The Euclidean distance is very sensitive to the dataset's scales of features or variables. The Mahalanobis distance normalizes the variables, as large-scale variables may lead to inappropriate results.
The result of the Euclidean distance is affected by the outliers, while on the other hand, the Mahalanobis distance finds the distribution of the dataset. It can ignore the outliers as well.
The Mahalanobis distance takes account of the covariance and structure of the variables. The equation for calculating the Mahalanobis distance is given below.
In this equation:
The term
The below figure illustrates the Mahalanobis distance between the point and the distribution.
There are several applications of the Mahalanobis distance, which are discussed below.
Mahalanobis distance is used to figure out the outliers in the data set. The points with large Mahalanobis distances are far from the distribution and considered outliers.
Based on the characteristics of the variables, it is used to determine the similarity and dissimilarity of the points with the clusters.
In terms of deep learning, it is used to find anomaly detection with the help of outliers.
In this Answer, we discussed the mathematical intuition and applications of the Mahalanobis distance. We also looked at the limitations of the Euclidean distance and how the Mahalanobis distance is better than it.
Note: Here is the link, if you want to learn about Manhattan distance.
To test your understanding of the Mahalanobis distance, try to solve this quiz.
What does Mahalanobis distance measure?
Distance between two points in a multivariate dataset.
Distance between a point and the mean of a distribution, accounting for variable correlations and scales.
Distance between two distributions in a multivariate dataset.
Distance between two variables in a dataset.