How to identify outliers in a dataset using SciPy in Python

Outliers are the data that is different from the rest of the dataset and significantly influence the analysis. In this answer, we'll show how to identify outliers in the dataset using SciPy. We'll use two methods for this purpose:

Z-score
IQR (Interquartile range)

First, we import the required libraries to identify outliers in the dataset:

Explanation

Lines 1–2: We import numpy and scipy libraries.
Line 4: We use random.seed() function to generate random numbers by setting the seed value 42 for reproducibility.
Line 5: We generate an array of 1000 random numbers with mean 0 and standard deviation 1.
Line 6: We add 5 to the values of data from the index 900 onwards, introducing outliers.
Line 8: We determine the deviation of a data point from the mean by calculating the number of standard deviations.
Line 10: We set the threshold value for identifying potential outliers.
Line 11: We identify the values that exceed the threshold.
Line 13: We print the identified outliers.

Using IQR method

This method creates a limit between the first and third quartiles and considers the data points that exceed this limit as outliers.

Coding example

Here is the coding example to identify outliers in the dataset using the IQR method:

Explanation

Line 8: We calculate the first quartile (25th percentile) of the data.
Line 9: We calculate the third quartile (75th percentile) of the data.
Line 10: We compute the range between the first and third quartiles.
Lines 12 and 13: We define upper and lower limits to identify potential outliers using the IQR method.
Line 15: We compute the IQR using the SciPy library's iqr function.
Line 17: We determine the outliers using a logical condition that checks if data points are outside the calculated limit.
Line 18: We print the identified outliers.

Visualization of identified outliers

Box plot is commonly used to visualize the presence of outliers in a dataset. Here is the visualization of identified outliers using IQR method:

How to identify outliers in a dataset using SciPy in Python

Using Z-score method

Coding example

Explanation

Using IQR method

Coding example

Explanation

Visualization of identified outliers