Robust covariance is used to detect anomalies in datasets with Gaussian distribution. In this model, the data points away from the 3rd deviation are likely to be considered as anomalies. On the other hand, the one-class SVM finds a hyperplane that separates the two classes of data points. The aim is to separate the anomalies from the clusters of data points.
Let's apply tree classifiers on a dataset and observe the produced results to analyze which classifier provides better precision. In this case, we will use the following classifiers:
Empirical covariance
Robust covariance
One class SVM
We define an elliptical hypersphere that covers most of the data points; hence we interpret the data points in the hyperspace as normal and the data points lying far away as outliers. To do this, we apply different classifier models to the dataset and analyze the decision boundaries created by each one of them. The goal will be to see which classifier provides a precise and accurate decision boundary.
Let's write a code step-by-step that uses the pre-existing sample dataset, applies different classifiers to it, and then creates a scatter plot that visualizes the results of each.
Before starting the code. let's understand the modules we must import and how they are used.
We import the following from numpy
, matplotlib
and sklearn
libraries.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.covariance import EllipticEnvelopefrom sklearn.datasets import load_winefrom sklearn.svm import OneClassSVM
numpy
: To handle data arrays and perform numerical operations.
matplotlib.pyplot
: To create and customize data visuals, including various types of plots.
sklearn.covariance
: To access functionalities for robust covariance estimation. We import EllipticalEnvelope
to detect outliers.
sklearn.datasets
: To access the pre-existing data sets i.e. load_wine
that loads the Wine dataset.
sklearn.svm
: To access the support vector machine i.e. OneClassSVM
to get one-class SVM to detect outliers.
Import the wine dataset and select two different columns from it to depict the relationship between the two and analyze the anomalies in the dataset corresponding to those columns.
We import the wine data set and use the first and second columns of it to create a plot in 2D space and analyze the anomalies lying in the dataset depending on these two variables:
Column 1: alcohol
Column 2: malic_acid
In this code, we apply three classifiers to the first and second columns of the wine dataset and plot the results to identify the anomalous data.
import numpy as np import matplotlib.pyplot as plt from sklearn.covariance import EllipticEnvelope from sklearn.datasets import load_wine from sklearn.svm import OneClassSVM classifiers = { "Empirical Covariance": EllipticEnvelope(support_fraction=1.0, contamination=0.25), "Robust Covariance": EllipticEnvelope(contamination=0.25), "OCSVM": OneClassSVM(nu=0.25, gamma=0.35), } colors = ["m", "g", "r"] legend1 = {} # Get data X1 = load_wine()["data"][:, [1, 2]] #Learn a frontier for outlier detection with several classifiers xx1 , yy1 = np.meshgrid(np.linspace(0, 6, 500) , np.linspace(1, 4.5, 500)) for i, (clf_name, clf) in enumerate(classifiers.items()): plt.figure(1) clf.fit(X1) Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()]) Z1 = Z1.reshape(xx1.shape) legend1[clf_name] = plt.contour(xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i]) legend1_values_list = list(legend1.values()) legend1_keys_list = list(legend1.keys()) # Plot the results plt.figure(1) plt.title("Outlier detection on the dataset") plt.scatter(X1[:, 0], X1[:, 1], color="blue") plt.xlim((xx1.min(), xx1.max())) plt.ylim((yy1.min(), yy1.max())) plt.legend( ( legend1_values_list[0].collections[0], legend1_values_list[1].collections[0], legend1_values_list[2].collections[0], ), (legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2]), loc="upper center", ) plt.ylabel("alcohol") plt.xlabel("malic_acid") plt.show()
It displays a scatter plot showing the decision boundaries of all three classifiers that helps to identify the inliers and the outliers. Notice that the data points are concentrated in one place that shows the inliers.
It can be interpreted from the results that:
Empirical covariance: The magenta decision boundary covers significant data points, but it is influenced by the diverse and dissimilar patterns in the dataset.
Robust covariance: The green decision boundary covers the main data points, but it has an assumption that the data should be Gaussian distributed, and the results are influenced by it.
One-class SVM: The red decision boundary covers most of the data points as it does not assume any parametric form of the data distribution.
We import the wine data set and use the first and second columns of it to create a plot in 2D space and analyze the anomalies lying in the dataset depending on the two variables:
Column 5: magnesium
Column 9: proanthocyanins
In this code, we apply three classifiers to the first and second columns of the wine dataset and plot the results to identify the anomalous data.
import numpy as np import matplotlib.pyplot as plt from sklearn.covariance import EllipticEnvelope from sklearn.datasets import load_wine from sklearn.svm import OneClassSVM classifiers = { "Empirical Covariance": EllipticEnvelope(support_fraction=1.0, contamination=0.25), "Robust Covariance": EllipticEnvelope(contamination=0.25), "OCSVM": OneClassSVM(nu=0.25, gamma=0.35), } colors = ["m", "g", "r"] legend2 = {} X2 = load_wine()["data"][:, [5, 9]] # Learn a frontier for outlier detection with several classifiers xx2, yy2 = np.meshgrid(np.linspace(-1, 5.5, 500), np.linspace(-2.5, 16, 500)) for i, (clf_name, clf) in enumerate(classifiers.items()): plt.figure(2) clf.fit(X2) Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()]) Z2 = Z2.reshape(xx2.shape) legend2[clf_name] = plt.contour( xx2, yy2, Z2, levels=[0], linewidths=2, colors=colors[i] ) legend2_values_list = list(legend2.values()) legend2_keys_list = list(legend2.keys()) # Plot the results plt.figure(2) plt.title("Outlier detection on a dataset") plt.scatter(X2[:, 0], X2[:, 1], color="blue") plt.xlim((xx2.min(), xx2.max())) plt.ylim((yy2.min(), yy2.max())) plt.legend( ( legend2_values_list[0].collections[0], legend2_values_list[1].collections[0], legend2_values_list[2].collections[0], ), (legend2_keys_list[0], legend2_keys_list[1], legend2_keys_list[2]), loc="upper center", ) plt.ylabel("magnesium") plt.xlabel("proanthocyanins") plt.show()
It displays a scatter plot showing the decision boundaries of all three classifiers that helps to identify the inliers and the outliers. Notice that the data points are concentrated in one place that shows the inliers.
It can be interpreted from the results that:
Empirical covariance: The magenta decision boundary covers significant data points, but it is influenced by the diverse and dissimilar patterns in the dataset.
Robust covariance: The green decision boundary covers the main data points, but it has an assumption that the data should be Gaussian distributed and the results are influenced by it.
One class SVM: The red decision boundary covers most of the data points as it does not assume any parametric form of the data distribution.
Both the codes follow the same syntax, so here is an explanation that elaborates on both codes.
Note: The variable names used in the explanation may differ from the code variables as 1 and 2 are added in front of the variables from example 1 and 2 respectively.
Lines 1–5: Import the required method and libraries.
Lines 7–10: Create a classifiers
dictionary that contains the outlier detection models.
Empirical Covariance: Contains an instance of EllipticEnvelope
with support_factor
1 and contamination
0.25.
Robust Covariance: Contains an instance of EllipticEnvelope
with contamination
0.25.
OCSVM: Contains an instance of OneClassSVM
with nu
0.25 and gamma
0.35.
Lines 13–14: Create an array for the decision boundary colors
i.e. magenta, green and red; and an empty dictionary for legend
.
Line 17: Load the wine dataset and select the two columns to create a 2D matrix.
Line 20: Create a 2D grid with the ranges 0
to 6
and 1
to 4.5
and store the coordinates in xx
and yy
variables.
Lines 22–27: Create a loop that iterates through the classifier
dictionary and perform the following tasks for each classifier:
Create a new figure using figure()
to ensure a separate figure for each classifier's decision boundary is plotted.
Fit the model on the data X
using fit()
.
Obtain the anomaly score for each data point in the grid using decision_function()
.
Reshape the Z
array containing the anomaly scores according to the xx
grid shape.
Create a contour plot using contour()
to visualize the
Lines 29–30: Create two lists; one for the keys
and other for the corresponding values
from legend1
.
Lines 33–35: Create a plot and set its title, set the coordinates, and add the required annotations.
Lines 37–38: Set the maximum and minimum limits of the x-axis and the y-axis, respectively.
Lines 40–47: Add the legend
to the plot and set the values from the collection. Use loc
to define the position of the legend.
Lines 50–51: Label the x-axis and y-axis. In this case, we are using the wine dataset to label according to the columns used.
Line 53: Use show()
to display the created plot.
We can use different unsupervised techniques to detect anomalies on the same dataset and present them in a scatter plot to get a better understanding of the normal and anomalous data points. We can implement it using the following steps:
Generate or import sample dataset.
Apply different classifiers to it and record their result.
Plot the figure for each classifier to visualize the results. The data points that are away from the decision boundaries are considered an anomaly.
In this code, we applied empirical covariance, robust covariance, and one-class SVM to the wine dataset to compare the results. It can be observed that overall, one class SVM produces comparatively more precise results because it does not assume any parametric form of the data distribution.