Trusted answers to developer questions

Hassaan Waqar

The ** z-score** is a concept widely used in probability and statistics. It is used when data is

`z-score`

better, we first need to know what a normal distribution is.The figure below shows a normal distribution curve:

A Normal Distribution Curve

In normally distributed data, data lying above and below the mean is proportionate. The resulting curve is of a bell shape. The center of the curve denotes the mean.

The mean, mode, and median are all equal.

The area under the curve is 1. The curve is symmetrical about the mean.

Oftentimes, we need to compare values from different datasets. Let’s suppose a university accepts ACT and SAT scores for admissions. Both these tests have different metrics, cumulative scores, and hence different means. How can the university compare results from each test and decide which student performed better than the other? In such situations, we need to standardize the scores to compare them. The resulting **standardized normal variable** for each score is called `Z`

.

A random normal variable

`X`

is standardized to have a mean of 0 and a standard deviation of 1.

`z-score`

is used when the data is normally distributed.

The

`z-score`

will tell us how many standard deviations above or below the mean does a value lie.

Let’s familiarize with some terminology before we craft a formula:

Symbol | Name | Purpose |
---|---|---|

z | Standard Normal Variable | Standardized score |

$X$ | Random Normal Variable | Actual value |

$\mu$ | mu | Mean of the data |

$\sigma$ | sigma | Standard deviation of the data |

To standardize a random normal variable, we need to carry out the following steps:

- Subtract the mean ($\mu$) from the random normal variable ($X$).
- Divide the result by the standard deviation ($\sigma$)

The final formula is as follows:

`z`

= $\frac{X - \mu}{\sigma}$

The illustration below summarizes the procedure:

Mean is the average of all values in the data. It is calculated as follows:

- Take the sum of all values in the dataset.
- Divide by the total number of values.

The final formula is as follows:

$\mu$ = $\sum_{i=1}^{N} Xi$

where $Xi$ is each Random Normal Variable and $N$ is the number of values.

Standard deviation indicates how far a value is from the mean. It is calculated as follows:

- Subtract the mean from each value of $Xi$.
- Take the square of the result of each of the value above.
- Add all these squares together.
- Divide by the number of values in the dataset.
- Take the square of the result of the previous step.

The final formula is as follows:

$\sigma$ = $\sqrt{\frac{\sum_{i=1}^N (x_i -{\mu})^2}{N}}$

We have gathered all the bits of information we need to work with `z-score`

. Let’s work through a simple example:

Suppose 15 students in a class took a test. The professor wants to ensure that he grades them realistically. Therefore, he decides that whoever scores more than 1 standard deviation below the mean will fail while others will pass. The table below shows the summary of scores:

Student | Test Scores (out of 100) |
---|---|

Jack | 72 |

Jim | 86 |

Gabe | 56 |

Bill | 92 |

Alice | 78 |

Veronica | 94 |

Angelica | 32 |

Matt | 44 |

Thomas | 66 |

Dice | 100 |

Donald | 28 |

Rice | 42 |

Jones | 88 |

Chris | 79 |

Liam | 73 |

In order to discuss these scores in terms of standard deviation, we need to standardize them. To do so, we will calculate the `z-score`

for each.

Remember! Standardized scores have a mean of 0 and standard deviation of 1.

Total number of values are 15. Therefore, $N = 15$.

Sum $= 72 + 86 + 56 + 92 + 78 + 94 + 32 + 44 + 66 + 100 + 28 +42 + 88 + 79 + 73 = 1030$

$\mu = 1030/N = 1030/15 = 68.7$

The mean is **68.7**.

Follow the steps discussed above to calculate the standard deviation.

It will look something like this:

$\sigma$ = $\sqrt{\frac{ (72-{68.7})^2 + (86 -{68.7})^2 + ... + (73 -{68.7})^2}{15}}$ = $22.4$

The standard deviation is **22.4**.

We can now plug these values in the formula for `z-score`

.

`z`

= $\frac{X - \mu}{\sigma}$

For Jack:

`z`

= $\frac{72 - 68.7}{22.4} = 0.147$

In simpler words, Jack is 0.147 standard deviations above the mean.

We can repeat the process for all the students. The updated table below shows the `z-score`

of each student as well:

Student | Test Scores (out of 100) | z-score |
---|---|---|

Jack | 72 | 0.147 |

Jim | 86 | 0.772 |

Gabe | 56 | -0.567 |

Bill | 92 | 1.04 |

Alice | 78 | 0.415 |

Veronica | 94 | 1.129 |

Angelica | 32 | -1.638 |

Matt | 44 | -1.102 |

Thomas | 66 | -0.120 |

Dice | 100 | 1.400 |

Donald | 28 | -1.817 |

Rice | 42 | -1.192 |

Jones | 88 | 0.861 |

Chris | 79 | 0.460 |

Liam | 73 | 0.192 |

As the table above shows, Angelica, Matt, Donald, and Rice score more than 1 standard deviation below the mean. Hence, they failed the test.

The `z-score`

follows the same pattern of calculation in **statistical inference** as well. In statistical inference, we need to validate whether a hypothesis generalizes to the entire population or is only applicable to the sample data. For such purposes, statisticians carry out hypothesis testing which requires standardizing data and calculating `z-scores`

.

Similarly, when comparing two datasets with different metrics of calculations, we can use the `z-score`

as a standardized metric.

RELATED TAGS

z-score

CONTRIBUTOR

Hassaan Waqar

Copyright ©2022 Educative, Inc. All rights reserved

RELATED COURSES

View all Courses

Keep Exploring

Related Courses