Z-score is a numeric measurement that identifies how far a data point is from the mean. It is measured in terms of the standard deviation.
The z-score is calculated as follows:
$Z = \frac{data \ point - mean} {standard \ deviation}$
Z-scores can be a vital tool for statisticians and developers alike.
Through z-scores, developers can easily detect anomalies within our dataset.
The following points allow us to extract more information from a z-score:
By using the z-score of a particular data point, we can measure how close or far the point is from our mean. By setting a range of acceptable z-scores, we can identify the anomalies as the points that lie outside of our acceptable range( e.g., $\pm1$).
A range of $\pm1$ means that we will be considering points that are one standard deviation from our mean (as acceptable). All other points will be anomalies or outliers.
Let’s consider the following dataset:
[2, 3, 5, 4, 7, 19, 6, 4, 3, 6]
First, we will calculate the mean and standard deviation of our dataset. These come out as:
5.9
4.6
Now, we will proceed to calculate the z-scores using the formula above.
Data point | z-score |
2 | -0.8 |
3 | -0.6 |
5 | -0.1 |
4 | -0.4 |
7 | 0.2 |
19 | 2.8 |
6 | 0.02 |
4 | -0.4 |
3 | -0.6 |
6 | 0.02 |
From the table, we can easily identify that data point 19
has the highest z-score. Hence, the point can be considered an anomaly with a z-score of 2.8
. The point lies 2.8 standard deviations beyond the mean.
Note:
The z-score may also be referred to as Standard Score.