Proximity measures are essential tools in data analysis, specifically for ordinal data (ranked or rated data). They allow us to quantify relationships between data points by tools like Spearman’s rank correlation or Goodman and Kruskal’s gamma coefficient. These measures, crucial for tasks like clustering and classification, reveal valuable patterns and structures by quantifying the proximity or dissimilarity between data points.
Let's understand how to calculate the proximity measure for ordinal attributes using the example below.
Suppose we have a table with five ranks, i.e., Excellent, Very Good, Good, Fair, and Poor. For these ranks, we have an ordinal attribute named Test, as given below:
Object Identifier | Test |
1 | Excellent |
2 | Good |
3 | Poor |
4 | Good |
5 | Very Good |
6 | Fair |
7 | Poor |
8 | Good |
9 | Fair |
10 | Very Good |
For each data point in the dataset, let’s determine its numeric rank based on each value of the Test attribute. We are doing this because it helps maintain the order of the attributes, making it easier to accurately measure the distance or similarity between them. These ranks are assigned in ascending order, starting from 1 as the lowest and incrementing to 5 as the highest. Let’s start doing it:
Object 1, has a rank Excellent and obtains a numeric rank value 5 since it is the highest value among the data points.
Object 2, has a rank Good, and obtains a numeric rank value 3 since it represents the third highest value in the dataset.
The updated table after assigning ranks looks like this:
Object Identifier | Test |
1 | 5 |
2 | 3 |
3 | 1 |
4 | 3 |
5 | 4 |
6 | 2 |
7 | 1 |
8 | 3 |
9 | 2 |
10 | 4 |
Now that we have assigned ranks to each data point. Next, normalize these ranks so that they fall in the range of 0.0 to 1.0.
We can map ranks with the help of the following formula:
Now, using these normalized values for each rank, let’s replace the value of the Test attribute with the normalized ones.
Object Identifier | Test |
1 | 1 |
2 | 0.5 |
3 | 0 |
4 | 0.5 |
5 | 0.75 |
6 | 0.25 |
7 | 0 |
8 | 0.5 |
9 | 0.25 |
10 | 0.75 |
With the normalized ranks, let’s calculate the dissimilarity between pairs of data points using the Euclidean distance formula. The Euclidean distance between two points (x1) and (x2) in a 1D space is given by:
In our case:
Distance between Object 1 and 2:|1 - 0.5| = 0.5
Distance between Object 1 and 3:|1 - 0| = 1
Distance between Object 1 and 4:|1 - 0.5| = 0.5
Distance between Object 1 and 5:|1 - 0.75| = 0.25
Distance between Object 1 and 6:|1 - 0.25| = 0.75
Distance between Object 1 and 7:|1 - 0| = 1
Distance between Object 1 and 8:|1 - 0.5| = 0.5
Distance between Object 1 and 9:|1 - 0.25| = 0.75
Distance between Object 1 and 10:|1 - 0.75| = 0.25
Note: There’s no need to separately calculate the upper right triangle when calculating the left lower triangle of the dissimilarity matrix, as they are symmetrical.
Similarly, calculate this for the rest of the pairs. The dissimilarity matrix would look like:
As a result, we can observe that:
Objects 1 and 3 are the most dissimilar, with a dissimilarity score of 1.00.
Objects 3 and 7 are highly similar, with a dissimilarity score of 0.00.
Objects 5 and 10 are also highly similar, with a dissimilarity score of 0.00.
Objects 1 and 6 are highly dissimilar, with a dissimilarity score of 0.75.
Objects 3 and 9 are also moderately similar, with a dissimilarity score of 0.25.
Objects 1 and 2 are moderately dissimilar, with a dissimilarity score of 0.50.
Objects 2 and 4 are also highly similar, with a dissimilarity score of 0.00.
In conclusion, calculating dissimilarity matrices using appropriate proximity measures for ordinal attributes is instrumental in revealing patterns within ranked data.