Finding Outliers in Data
In this lesson, an explanation is provided on what outliers in data are and how to detect them.
We'll cover the following...
What is an outlier?
Anything that lies outside the normal distribution of the provided dataset is known as an outlier. Let’s suppose a list has these elements: [32,30,39,35,31,4,37]. It is quite evident that 4 is the outlier in this list because all the other elements lie around a mean value of 35. Similarly, any data point that behaves differently from the rest of the set is known as an outlier.
Why do outliers exist?
An outlier in any dataset mostly exists for the following two reasons:
- 
Variance in data: There can always be anomalies and ambiguities in data, which can be quite different from the normal distribution. 
- 
Entry error: This occurs mainly due to human error while preparing the dataset or entering values. 
Identifying outliers
There are two main methods used to identify outliers in any dataset:
- Visualization plots: The outliers are clearly visible if we plot the data in a scatter, box, or histogram plot, as they are away from the center of the data. More about this will be