Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

datascience
statistics
statisticalmethod
communitycreator

What are different statistical measures for analysis?

Fatima Hasan

In a world full of data, we must learn how to analyze it and extract meaningful information. Let’s look at a few basic statistical methods that can be used to discover patterns and trends from raw data.

1. Mean

The first method is where we find the mean of the data, more commonly known as the average. To calculate the mean, we add all the numbers in the data and divide the sum by the number of entries in the list.

The mean of a data set gives us a general trend of where most of the values lie. However, it is prone to inconsistencies caused by outliers in the data.

Example

Data
5
9
3
6
Mean = ( 5 + 9 + 4 + 6 ) / 4 \
= 6

This shows that the values in the data set lie close to 6. However, if the same data consisted of a value 1200, then the average would come out to be:

(5 + 9 + 4 + 6 + 1200)/5 = 244.8

As you can see, only one outlier caused the mean to change by a great amount. This is why the mean is most useful when we remove the outliers from the data.

2. Median

Median is the value that lies in the center of the data set. To calculate the median, we first sort the data in ascending order and then choose the middle value. If the number of entries is odd, then the median is simply the center value. If this number is even, we take the average of the two central values to get the median.

Example

For the data given above, if we wish to find the median, first we sort the data as follows:

3, 5, 6, 9

Then, since the number of entries is fouran even number, with no number in the middle, we take an average of the two central values:

( 5 + 6 ) / 2.

The benefit of the median is that it ignores outliers, and gives an accurate center of the data.

3. Mode

Mode represents the most frequent value of a data set. If no values are repeated in the data, then there is no mode.

Example

For example, in the data above, there is no mode. However, if we have the following data:

5, 5, 7, 8, 9, 1, 2, 5, 8

Then the mode would be 5, since it is repeated three times.

4. Standard deviation

The next important measure is the standard deviation, which describes the spread of data around the mean. Greater standard deviation means that the data is highly variable.

The formula to calculate standard deviation is given below:

Here, we sum the square of the difference of each value from the mean, divide it by the total number of entries, and take the square root.

Example

Data:
5, 6, 3, 2, 9, 10

Mean = (5+6+3+2+9+10) / 6 = 5.83 

SD = sqrt( ( ( 5 - 5.83)^2 + ( 6 - 5.83 )^2 + .... + ( 10 - 5.83)^2 ) / 6 )

= 2.9107 

5. Range

The range is the difference between the highest and lowest point of the data. It gives us an idea of how the data is spread.

6. Percentiles

A percentile is a value or a score below which a percentage of the data falls. For example, if you have 10 mangoes and the second heaviest mango weighs 150gm, 80% of the mangoes weigh less. 150gm is the 80th percentile weight.

Formula

To get this “80,” we use the following equation:
( 10 - 2 / 10 ) * 100

7. Regression

Regression shows the relationship between a dependent and an independent variable. It explains how changes in one variable affect the other. See the formula and example graph for regression below.

Formula

Graph example

Here, rainfall is the independent variable, and umbrellas sold are the dependent variable.

Regression can help us find out whether the variables have a strong or weak relationship. It can also help us to forecast values in the future.

RELATED TAGS

datascience
statistics
statisticalmethod
communitycreator
RELATED COURSES

View all Courses

Keep Exploring