How to identify outliers in a dataset using SciPy in Python
Outliers are the data that is different from the rest of the dataset and significantly influence the analysis. In this answer, we'll show how to identify outliers in the dataset using SciPy. We'll use two methods for this purpose:
Z-score
IQR (Interquartile range)
First, we import the required libraries to identify outliers in the dataset:
import numpy as npfrom scipy import stats
Using Z-score method
This approach quantifies how far a data point deviates from the mean regarding standard deviations. We set a threshold (usually 2 or 3) to identify data points with high z-scores as potential outliers.
Coding example
Here is the coding example to identify outliers in the dataset using the z-score method:
import numpy as npfrom scipy import statsnp.random.seed(42)data = np.random.normal(0, 1, 1000)data[900:] += 5z_scores = np.abs(stats.zscore(data))threshold = 3outliers_z = np.where(z_scores > threshold)[0]print("Outliers detected using z-score:", outliers_z)
Explanation
Lines 1–2: We import
numpyandscipylibraries.Line 4: We use
random.seed()function to generate random numbers by setting the seed value42for reproducibility.Line 5: We generate an array of
1000random numbers with mean0and standard deviation1.Line 6: We add
5to the values ofdatafrom the index900onwards, introducing outliers.Line 8: We determine the deviation of a data point from the mean by calculating the number of standard deviations.
Line 10: We set the
thresholdvalue for identifying potential outliers.Line 11: We identify the values that exceed the
threshold.Line 13: We print the identified outliers.
Using IQR method
This method creates a limit between the first and third quartiles and considers the data points that exceed this limit as outliers.
Coding example
Here is the coding example to identify outliers in the dataset using the IQR method:
import numpy as npfrom scipy import statsnp.random.seed(42)data = np.random.normal(0, 1, 1000)data[900:] += 5first_quar = np.percentile(data, 25)third_quar = np.percentile(data, 75)IQR = third_quar - first_quarlower_limit = first_quar - 1.5 * IQRupper_limit = third_quar + 1.5 * IQRoutliers_iqr = stats.iqr(data, nan_policy='omit', axis=0, rng=(25, 75)) * 1.5outliers = np.where((data < lower_limit) | (data > upper_limit))[0]print("Outliers detected using IQR:", outliers)
Explanation
Line 8: We calculate the first quartile (25th percentile) of the data.
Line 9: We calculate the third quartile (75th percentile) of the data.
Line 10: We compute the range between the first and third quartiles.
Lines 12 and 13: We define upper and lower limits to identify potential outliers using the IQR method.
Line 15: We compute the IQR using the SciPy library's
iqrfunction.Line 17: We determine the outliers using a logical condition that checks if data points are outside the calculated limit.
Line 18: We print the identified outliers.
Visualization of identified outliers
Box plot is commonly used to visualize the presence of outliers in a dataset. Here is the visualization of identified outliers using IQR method:
import matplotlib.pyplot as pltplt.figure(figsize=(10, 6))plt.boxplot(data)plt.title("Box Plot of Data with Outliers (IQR Method)")plt.show()
Note: The Z-score and interquartile range (IQR) methods are two different approaches for identifying outliers in a dataset, and they can produce different results due to the differences in their underlying principles and calculations.
Free Resources