Outlier detection with covariance vs. OCSVM
Robust covariance is used to detect anomalies in datasets with Gaussian distribution. In this model, the data points away from the 3rd deviation are likely to be considered as anomalies. On the other hand, the one-class SVM finds a hyperplane that separates the two classes of data points. The aim is to separate the anomalies from the clusters of data points.
Let's apply tree classifiers on a dataset and observe the produced results to analyze which classifier provides better precision. In this case, we will use the following classifiers:
Empirical covariance
Robust covariance
One class SVM
How does it work?
We define an elliptical hypersphere that covers most of the data points; hence we interpret the data points in the hyperspace as normal and the data points lying far away as outliers. To do this, we apply different classifier models to the dataset and analyze the decision boundaries created by each one of them. The goal will be to see which classifier provides a precise and accurate decision boundary.
How to implement this understanding?
Let's write a code step-by-step that uses the pre-existing sample dataset, applies different classifiers to it, and then creates a scatter plot that visualizes the results of each.
Before starting the code. let's understand the modules we must import and how they are used.
Required imports
We import the following from numpy, matplotlib and sklearn libraries.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.covariance import EllipticEnvelopefrom sklearn.datasets import load_winefrom sklearn.svm import OneClassSVM
numpy: To handle data arrays and perform numerical operations.matplotlib.pyplot: To create and customize data visuals, including various types of plots.sklearn.covariance: To access functionalities for robust covariance estimation. We importEllipticalEnvelopeto detect outliers.sklearn.datasets: To access the pre-existing data sets i.e.load_winethat loads the Wine dataset.sklearn.svm: To access the support vector machine i.e.OneClassSVMto get one-class SVM to detect outliers.
Implementation
Import the wine dataset and select two different columns from it to depict the relationship between the two and analyze the anomalies in the dataset corresponding to those columns.
Using columns 1 and 2
We import the wine data set and use the first and second columns of it to create a plot in 2D space and analyze the anomalies lying in the dataset depending on these two variables:
Column 1: alcohol
Column 2: malic_acid
Example code
In this code, we apply three classifiers to the first and second columns of the wine dataset and plot the results to identify the anomalous data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import load_wine
from sklearn.svm import OneClassSVM
classifiers = {
"Empirical Covariance": EllipticEnvelope(support_fraction=1.0, contamination=0.25),
"Robust Covariance": EllipticEnvelope(contamination=0.25),
"OCSVM": OneClassSVM(nu=0.25, gamma=0.35),
}
colors = ["m", "g", "r"]
legend1 = {}
# Get data
X1 = load_wine()["data"][:, [1, 2]]
#Learn a frontier for outlier detection with several classifiers
xx1 , yy1 = np.meshgrid(np.linspace(0, 6, 500) , np.linspace(1, 4.5, 500))
for i, (clf_name, clf) in enumerate(classifiers.items()):
plt.figure(1)
clf.fit(X1)
Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
Z1 = Z1.reshape(xx1.shape)
legend1[clf_name] = plt.contour(xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i])
legend1_values_list = list(legend1.values())
legend1_keys_list = list(legend1.keys())
# Plot the results
plt.figure(1)
plt.title("Outlier detection on the dataset")
plt.scatter(X1[:, 0], X1[:, 1], color="blue")
plt.xlim((xx1.min(), xx1.max()))
plt.ylim((yy1.min(), yy1.max()))
plt.legend(
(
legend1_values_list[0].collections[0],
legend1_values_list[1].collections[0],
legend1_values_list[2].collections[0],
),
(legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2]),
loc="upper center",
)
plt.ylabel("alcohol")
plt.xlabel("malic_acid")
plt.show()Code output
It displays a scatter plot showing the decision boundaries of all three classifiers that helps to identify the inliers and the outliers. Notice that the data points are concentrated in one place that shows the inliers.
It can be interpreted from the results that:
Empirical covariance: The magenta decision boundary covers significant data points, but it is influenced by the diverse and dissimilar patterns in the dataset.
Robust covariance: The green decision boundary covers the main data points, but it has an assumption that the data should be Gaussian distributed, and the results are influenced by it.
One-class SVM: The red decision boundary covers most of the data points as it does not assume any parametric form of the data distribution.
Using columns 5 and 9
We import the wine data set and use the first and second columns of it to create a plot in 2D space and analyze the anomalies lying in the dataset depending on the two variables:
Column 5: magnesium
Column 9: proanthocyanins
Example code
In this code, we apply three classifiers to the first and second columns of the wine dataset and plot the results to identify the anomalous data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import load_wine
from sklearn.svm import OneClassSVM
classifiers = {
"Empirical Covariance": EllipticEnvelope(support_fraction=1.0, contamination=0.25),
"Robust Covariance": EllipticEnvelope(contamination=0.25),
"OCSVM": OneClassSVM(nu=0.25, gamma=0.35),
}
colors = ["m", "g", "r"]
legend2 = {}
X2 = load_wine()["data"][:, [5, 9]]
# Learn a frontier for outlier detection with several classifiers
xx2, yy2 = np.meshgrid(np.linspace(-1, 5.5, 500), np.linspace(-2.5, 16, 500))
for i, (clf_name, clf) in enumerate(classifiers.items()):
plt.figure(2)
clf.fit(X2)
Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()])
Z2 = Z2.reshape(xx2.shape)
legend2[clf_name] = plt.contour(
xx2, yy2, Z2, levels=[0], linewidths=2, colors=colors[i]
)
legend2_values_list = list(legend2.values())
legend2_keys_list = list(legend2.keys())
# Plot the results
plt.figure(2)
plt.title("Outlier detection on a dataset")
plt.scatter(X2[:, 0], X2[:, 1], color="blue")
plt.xlim((xx2.min(), xx2.max()))
plt.ylim((yy2.min(), yy2.max()))
plt.legend(
(
legend2_values_list[0].collections[0],
legend2_values_list[1].collections[0],
legend2_values_list[2].collections[0],
),
(legend2_keys_list[0], legend2_keys_list[1], legend2_keys_list[2]),
loc="upper center",
)
plt.ylabel("magnesium")
plt.xlabel("proanthocyanins")
plt.show()Code output
It displays a scatter plot showing the decision boundaries of all three classifiers that helps to identify the inliers and the outliers. Notice that the data points are concentrated in one place that shows the inliers.
It can be interpreted from the results that:
Empirical covariance: The magenta decision boundary covers significant data points, but it is influenced by the diverse and dissimilar patterns in the dataset.
Robust covariance: The green decision boundary covers the main data points, but it has an assumption that the data should be Gaussian distributed and the results are influenced by it.
One class SVM: The red decision boundary covers most of the data points as it does not assume any parametric form of the data distribution.
Code explanation
Both the codes follow the same syntax, so here is an explanation that elaborates on both codes.
Note: The variable names used in the explanation may differ from the code variables as 1 and 2 are added in front of the variables from example 1 and 2 respectively.
Lines 1–5: Import the required method and libraries.
Lines 7–10: Create a
classifiersdictionary that contains the outlier detection models.Empirical Covariance: Contains an instance of
EllipticEnvelopewithsupport_factor1 andcontamination0.25.Robust Covariance: Contains an instance of
EllipticEnvelopewithcontamination0.25.OCSVM: Contains an instance of
OneClassSVMwithnu0.25 andgamma0.35.
Lines 13–14: Create an array for the decision boundary
colorsi.e. magenta, green and red; and an empty dictionary forlegend.Line 17: Load the wine dataset and select the two columns to create a 2D matrix.
Line 20: Create a 2D grid with the ranges
0to6and1to4.5and store the coordinates inxxandyyvariables.Lines 22–27: Create a loop that iterates through the
classifierdictionary and perform the following tasks for each classifier:Create a new figure using
figure()to ensure a separate figure for each classifier's decision boundary is plotted.Fit the model on the data
Xusingfit().Obtain the anomaly score for each data point in the grid using
decision_function().Reshape the
Zarray containing the anomaly scores according to thexxgrid shape.Create a contour plot using
contour()to visualize the .decision boundary Where the anomaly score is zero.
Lines 29–30: Create two lists; one for the
keysand other for the correspondingvaluesfromlegend1.Lines 33–35: Create a plot and set its title, set the coordinates, and add the required annotations.
Lines 37–38: Set the maximum and minimum limits of the x-axis and the y-axis, respectively.
Lines 40–47: Add the
legendto the plot and set the values from the collection. Uselocto define the position of the legend.Lines 50–51: Label the x-axis and y-axis. In this case, we are using the wine dataset to label according to the columns used.
Line 53: Use
show()to display the created plot.
Summary
We can use different unsupervised techniques to detect anomalies on the same dataset and present them in a scatter plot to get a better understanding of the normal and anomalous data points. We can implement it using the following steps:
Generate or import sample dataset.
Apply different classifiers to it and record their result.
Plot the figure for each classifier to visualize the results. The data points that are away from the decision boundaries are considered an anomaly.
In this code, we applied empirical covariance, robust covariance, and one-class SVM to the wine dataset to compare the results. It can be observed that overall, one class SVM produces comparatively more precise results because it does not assume any parametric form of the data distribution.
Free Resources