Trusted answers to developer questions

Fatima Hasan

In a world full of data, we must learn how to analyze it and extract meaningful information. Let’s look at a few basic statistical methods that can be used to discover patterns and trends from raw data.

The first method is where we find the **mean** of the data, more commonly known as the average. To calculate the mean, we add all the numbers in the data and divide the sum by the number of entries in the list.

The mean of a data set gives us a general trend of where most of the values lie. However, it is prone to inconsistencies caused by outliers in the data.

Data |
---|

5 |

9 |

3 |

6 |

Mean = ( 5 + 9 + 4 + 6 ) / 4 \ = 6

This shows that the values in the data set lie close to `6`

. However, if the same data consisted of a value `1200`

, then the average would come out to be:

(5 + 9 + 4 + 6 + 1200)/5 = 244.8

As you can see, only one outlier caused the mean to change by a great amount. This is why the mean is most useful when we remove the outliers from the data.

**Median** is the value that lies in the center of the data set. To calculate the median, we first sort the data in ascending order and then choose the middle value. If the number of entries is odd, then the median is simply the center value. If this number is even, we take the average of the two central values to get the median.

For the data given above, if we wish to find the median, first we sort the data as follows:

3, 5, 6, 9

Then, since the number of entries is

( 5 + 6 ) / 2.

The benefit of the median is that it ignores outliers, and gives an accurate center of the data.

**Mode** represents the most frequent value of a data set. If no values are repeated in the data, then there is no mode.

For example, in the data above, there is no mode. However, if we have the following data:

5, 5, 7, 8, 9, 1, 2, 5, 8

Then the mode would be 5, since it is repeated three times.

The next important measure is the **standard deviation**, which describes the spread of data around the mean. Greater standard deviation means that the data is highly variable.

The formula to calculate standard deviation is given below:

Here, we sum the square of the difference of each value from the mean, divide it by the total number of entries, and take the square root.

Data: 5, 6, 3, 2, 9, 10 Mean = (5+6+3+2+9+10) / 6 = 5.83 SD = sqrt( ( ( 5 - 5.83)^2 + ( 6 - 5.83 )^2 + .... + ( 10 - 5.83)^2 ) / 6 ) = 2.9107

The **range** is the difference between the highest and lowest point of the data. It gives us an idea of how the data is spread.

A **percentile** is a value or a score below which a percentage of the data falls. For example, if you have 10 mangoes and the second heaviest mango weighs 150gm, 80% of the mangoes weigh less. 150gm is the 80th percentile weight.

To get this “80,” we use the following equation:

`( 10 - 2 / 10 ) * 100 `

**Regression** shows the relationship between a dependent and an independent variable. It explains how changes in one variable affect the other. See the formula and example graph for regression below.

Here, rainfall is the *independent variable*, and umbrellas sold are the *dependent variable*.

Regression can help us find out whether the variables have a strong or weak relationship. It can also help us to forecast values in the future.

RELATED TAGS

datascience

statistics

statisticalmethod

communitycreator

CONTRIBUTOR

Fatima Hasan

RELATED COURSES

View all Courses

Keep Exploring

Related Courses