Outlier detection with covariance vs. OCSVM

Robust covariance is used to detect anomalies in datasets with Gaussian distribution. In this model, the data points away from the 3rd deviation are likely to be considered as anomalies. On the other hand, the one-class SVM finds a hyperplane that separates the two classes of data points. The aim is to separate the anomalies from the clusters of data points.

Let's apply tree classifiers on a dataset and observe the produced results to analyze which classifier provides better precision. In this case, we will use the following classifiers:

  • Empirical covariance

  • Robust covariance

  • One class SVM

How does it work?

We define an elliptical hypersphere that covers most of the data points; hence we interpret the data points in the hyperspace as normal and the data points lying far away as outliers. To do this, we apply different classifier models to the dataset and analyze the decision boundaries created by each one of them. The goal will be to see which classifier provides a precise and accurate decision boundary.

Decision boundaries for each classifier.
Decision boundaries for each classifier.

How to implement this understanding?

Let's write a code step-by-step that uses the pre-existing sample dataset, applies different classifiers to it, and then creates a scatter plot that visualizes the results of each.

Before starting the code. let's understand the modules we must import and how they are used.

Required imports

We import the following from numpy, matplotlib and sklearn libraries.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import load_wine
from sklearn.svm import OneClassSVM
  • numpy: To handle data arrays and perform numerical operations.

  • matplotlib.pyplot: To create and customize data visuals, including various types of plots.

  • sklearn.covariance: To access functionalities for robust covariance estimation. We import EllipticalEnvelope to detect outliers.

  • sklearn.datasets: To access the pre-existing data sets i.e. load_wine that loads the Wine dataset.

  • sklearn.svm: To access the support vector machine i.e. OneClassSVM to get one-class SVM to detect outliers.

Implementation

Import the wine dataset and select two different columns from it to depict the relationship between the two and analyze the anomalies in the dataset corresponding to those columns.

Using columns 1 and 2

We import the wine data set and use the first and second columns of it to create a plot in 2D space and analyze the anomalies lying in the dataset depending on these two variables:

  • Column 1: alcohol

  • Column 2: malic_acid

Example code

In this code, we apply three classifiers to the first and second columns of the wine dataset and plot the results to identify the anomalous data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import load_wine
from sklearn.svm import OneClassSVM

classifiers = {
    "Empirical Covariance": EllipticEnvelope(support_fraction=1.0, contamination=0.25),
    "Robust Covariance": EllipticEnvelope(contamination=0.25),
    "OCSVM": OneClassSVM(nu=0.25, gamma=0.35),
}

colors = ["m", "g", "r"]
legend1 = {}

# Get data
X1 = load_wine()["data"][:, [1, 2]] 

#Learn a frontier for outlier detection with several classifiers
xx1 , yy1 = np.meshgrid(np.linspace(0, 6, 500) , np.linspace(1, 4.5, 500))

for i, (clf_name, clf) in enumerate(classifiers.items()):
    plt.figure(1)
    clf.fit(X1)
    Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
    Z1 = Z1.reshape(xx1.shape)
    legend1[clf_name] = plt.contour(xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i])

legend1_values_list = list(legend1.values())
legend1_keys_list = list(legend1.keys())

# Plot the results
plt.figure(1) 
plt.title("Outlier detection on the dataset")
plt.scatter(X1[:, 0], X1[:, 1], color="blue")

plt.xlim((xx1.min(), xx1.max()))
plt.ylim((yy1.min(), yy1.max()))

plt.legend(
    (
        legend1_values_list[0].collections[0],
        legend1_values_list[1].collections[0],
        legend1_values_list[2].collections[0],
    ),
    (legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2]),
    loc="upper center",
)

plt.ylabel("alcohol")
plt.xlabel("malic_acid")

plt.show()
Detecting outliers using robust covariance technique.

Code output

It displays a scatter plot showing the decision boundaries of all three classifiers that helps to identify the inliers and the outliers. Notice that the data points are concentrated in one place that shows the inliers.

The scatter plot for column 1 and 2.
The scatter plot for column 1 and 2.

It can be interpreted from the results that:

  • Empirical covariance: The magenta decision boundary covers significant data points, but it is influenced by the diverse and dissimilar patterns in the dataset.

  • Robust covariance: The green decision boundary covers the main data points, but it has an assumption that the data should be Gaussian distributed, and the results are influenced by it.

  • One-class SVM: The red decision boundary covers most of the data points as it does not assume any parametric form of the data distribution.

Using columns 5 and 9

We import the wine data set and use the first and second columns of it to create a plot in 2D space and analyze the anomalies lying in the dataset depending on the two variables:

  • Column 5: magnesium

  • Column 9: proanthocyanins

Example code

In this code, we apply three classifiers to the first and second columns of the wine dataset and plot the results to identify the anomalous data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import load_wine
from sklearn.svm import OneClassSVM

classifiers = {
    "Empirical Covariance": EllipticEnvelope(support_fraction=1.0, contamination=0.25),
    "Robust Covariance": EllipticEnvelope(contamination=0.25),
    "OCSVM": OneClassSVM(nu=0.25, gamma=0.35),
}

colors = ["m", "g", "r"]
legend2 = {}

X2 = load_wine()["data"][:, [5, 9]] 

# Learn a frontier for outlier detection with several classifiers
xx2, yy2 = np.meshgrid(np.linspace(-1, 5.5, 500), np.linspace(-2.5, 16, 500))
for i, (clf_name, clf) in enumerate(classifiers.items()):
    plt.figure(2)
    clf.fit(X2)
    Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()])
    Z2 = Z2.reshape(xx2.shape)
    legend2[clf_name] = plt.contour(
        xx2, yy2, Z2, levels=[0], linewidths=2, colors=colors[i]
    )

legend2_values_list = list(legend2.values())
legend2_keys_list = list(legend2.keys())

# Plot the results
plt.figure(2) 
plt.title("Outlier detection on a dataset")
plt.scatter(X2[:, 0], X2[:, 1], color="blue")

plt.xlim((xx2.min(), xx2.max()))
plt.ylim((yy2.min(), yy2.max()))

plt.legend(
    (
        legend2_values_list[0].collections[0],
        legend2_values_list[1].collections[0],
        legend2_values_list[2].collections[0],
    ),
    (legend2_keys_list[0], legend2_keys_list[1], legend2_keys_list[2]),
    loc="upper center",
)

plt.ylabel("magnesium")
plt.xlabel("proanthocyanins")

plt.show()
Detecting outliers using robust covariance technique.

Code output

It displays a scatter plot showing the decision boundaries of all three classifiers that helps to identify the inliers and the outliers. Notice that the data points are concentrated in one place that shows the inliers.

The scatter plot for column 5 and 9.
The scatter plot for column 5 and 9.

It can be interpreted from the results that:

  • Empirical covariance: The magenta decision boundary covers significant data points, but it is influenced by the diverse and dissimilar patterns in the dataset.

  • Robust covariance: The green decision boundary covers the main data points, but it has an assumption that the data should be Gaussian distributed and the results are influenced by it.

  • One class SVM: The red decision boundary covers most of the data points as it does not assume any parametric form of the data distribution.

Code explanation

Both the codes follow the same syntax, so here is an explanation that elaborates on both codes.

Note: The variable names used in the explanation may differ from the code variables as 1 and 2 are added in front of the variables from example 1 and 2 respectively.

  • Lines 1–5: Import the required method and libraries.

  • Lines 7–10: Create a classifiers dictionary that contains the outlier detection models.

    • Empirical Covariance: Contains an instance of EllipticEnvelope with support_factor 1 and contamination 0.25.

    • Robust Covariance: Contains an instance of EllipticEnvelope with contamination 0.25.

    • OCSVM: Contains an instance of OneClassSVM with nu 0.25 and gamma 0.35.

  • Lines 13–14: Create an array for the decision boundary colors i.e. magenta, green and red; and an empty dictionary for legend.

  • Line 17: Load the wine dataset and select the two columns to create a 2D matrix.

  • Line 20: Create a 2D grid with the ranges 0 to 6 and 1 to 4.5 and store the coordinates in xx and yy variables.

  • Lines 22–27: Create a loop that iterates through the classifier dictionary and perform the following tasks for each classifier:

    • Create a new figure using figure() to ensure a separate figure for each classifier's decision boundary is plotted.

    • Fit the model on the data X using fit().

    • Obtain the anomaly score for each data point in the grid using decision_function().

    • Reshape the Z array containing the anomaly scores according to the xx grid shape.

    • Create a contour plot using contour() to visualize the decision boundaryWhere the anomaly score is zero..

  • Lines 29–30: Create two lists; one for the keys and other for the corresponding values from legend1.

  • Lines 33–35: Create a plot and set its title, set the coordinates, and add the required annotations.

  • Lines 37–38: Set the maximum and minimum limits of the x-axis and the y-axis, respectively.

  • Lines 40–47: Add the legend to the plot and set the values from the collection. Use loc to define the position of the legend.

  • Lines 50–51: Label the x-axis and y-axis. In this case, we are using the wine dataset to label according to the columns used.

  • Line 53: Use show() to display the created plot.

Summary

We can use different unsupervised techniques to detect anomalies on the same dataset and present them in a scatter plot to get a better understanding of the normal and anomalous data points. We can implement it using the following steps:

  • Generate or import sample dataset.

  • Apply different classifiers to it and record their result.

  • Plot the figure for each classifier to visualize the results. The data points that are away from the decision boundaries are considered an anomaly.

In this code, we applied empirical covariance, robust covariance, and one-class SVM to the wine dataset to compare the results. It can be observed that overall, one class SVM produces comparatively more precise results because it does not assume any parametric form of the data distribution.

Copyright ©2024 Educative, Inc. All rights reserved