**Robust covariance** is used to detect anomalies in datasets with Gaussian distribution. In this model, the data points away from the 3rd deviation are likely to be considered as anomalies. On the other hand, the one-class SVM finds a hyperplane that separates the two classes of data points. The aim is to separate the anomalies from the clusters of data points.

Let's apply tree classifiers on a dataset and observe the produced results to analyze which classifier provides better precision. In this case, we will use the following classifiers:

Empirical covariance

Robust covariance

One class SVM

We define an elliptical hypersphere that covers most of the data points; hence we interpret the data points in the hyperspace as normal and the data points lying far away as outliers. To do this, we apply different classifier models to the dataset and analyze the decision boundaries created by each one of them. The goal will be to see which classifier provides a precise and accurate decision boundary.

Let's write a code step-by-step that uses the pre-existing sample dataset, applies different classifiers to it, and then creates a scatter plot that visualizes the results of each.

Before starting the code. let's understand the modules we must import and how they are used.

We import the following from `numpy`

, `matplotlib`

and `sklearn`

libraries.

import numpy as npimport matplotlib.pyplot as pltfrom sklearn.covariance import EllipticEnvelopefrom sklearn.datasets import load_winefrom sklearn.svm import OneClassSVM

`numpy`

: To handle data arrays and perform numerical operations.`matplotlib.pyplot`

: To create and customize data visuals, including various types of plots.`sklearn.covariance`

: To access functionalities for robust covariance estimation. We import`EllipticalEnvelope`

to detect outliers.`sklearn.datasets`

: To access the pre-existing data sets i.e.`load_wine`

that loads the Wine dataset.`sklearn.svm`

: To access the support vector machine i.e.`OneClassSVM`

to get one-class SVM to detect outliers.

Import the wine dataset and select two different columns from it to depict the relationship between the two and analyze the anomalies in the dataset corresponding to those columns.

We import the wine data set and use the first and second columns of it to create a plot in 2D space and analyze the anomalies lying in the dataset depending on these two variables:

Column 1: alcohol

Column 2: malic_acid

In this code, we apply three classifiers to the first and second columns of the wine dataset and plot the results to identify the anomalous data.

import numpy as np import matplotlib.pyplot as plt from sklearn.covariance import EllipticEnvelope from sklearn.datasets import load_wine from sklearn.svm import OneClassSVM classifiers = { "Empirical Covariance": EllipticEnvelope(support_fraction=1.0, contamination=0.25), "Robust Covariance": EllipticEnvelope(contamination=0.25), "OCSVM": OneClassSVM(nu=0.25, gamma=0.35), } colors = ["m", "g", "r"] legend1 = {} # Get data X1 = load_wine()["data"][:, [1, 2]] #Learn a frontier for outlier detection with several classifiers xx1 , yy1 = np.meshgrid(np.linspace(0, 6, 500) , np.linspace(1, 4.5, 500)) for i, (clf_name, clf) in enumerate(classifiers.items()): plt.figure(1) clf.fit(X1) Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()]) Z1 = Z1.reshape(xx1.shape) legend1[clf_name] = plt.contour(xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i]) legend1_values_list = list(legend1.values()) legend1_keys_list = list(legend1.keys()) # Plot the results plt.figure(1) plt.title("Outlier detection on the dataset") plt.scatter(X1[:, 0], X1[:, 1], color="blue") plt.xlim((xx1.min(), xx1.max())) plt.ylim((yy1.min(), yy1.max())) plt.legend( ( legend1_values_list[0].collections[0], legend1_values_list[1].collections[0], legend1_values_list[2].collections[0], ), (legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2]), loc="upper center", ) plt.ylabel("alcohol") plt.xlabel("malic_acid") plt.show()

Detecting outliers using robust covariance technique.

It displays a scatter plot showing the decision boundaries of all three classifiers that helps to identify the inliers and the outliers. Notice that the data points are concentrated in one place that shows the inliers.

It can be interpreted from the results that:

Empirical covariance: The magenta decision boundary covers significant data points, but it is influenced by the diverse and dissimilar patterns in the dataset.

Robust covariance: The green decision boundary covers the main data points, but it has an assumption that the data should be Gaussian distributed, and the results are influenced by it.

One-class SVM: The red decision boundary covers most of the data points as it does not assume any parametric form of the data distribution.

We import the wine data set and use the first and second columns of it to create a plot in 2D space and analyze the anomalies lying in the dataset depending on the two variables:

Column 5: magnesium

Column 9: proanthocyanins

In this code, we apply three classifiers to the first and second columns of the wine dataset and plot the results to identify the anomalous data.

import numpy as np import matplotlib.pyplot as plt from sklearn.covariance import EllipticEnvelope from sklearn.datasets import load_wine from sklearn.svm import OneClassSVM classifiers = { "Empirical Covariance": EllipticEnvelope(support_fraction=1.0, contamination=0.25), "Robust Covariance": EllipticEnvelope(contamination=0.25), "OCSVM": OneClassSVM(nu=0.25, gamma=0.35), } colors = ["m", "g", "r"] legend2 = {} X2 = load_wine()["data"][:, [5, 9]] # Learn a frontier for outlier detection with several classifiers xx2, yy2 = np.meshgrid(np.linspace(-1, 5.5, 500), np.linspace(-2.5, 16, 500)) for i, (clf_name, clf) in enumerate(classifiers.items()): plt.figure(2) clf.fit(X2) Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()]) Z2 = Z2.reshape(xx2.shape) legend2[clf_name] = plt.contour( xx2, yy2, Z2, levels=[0], linewidths=2, colors=colors[i] ) legend2_values_list = list(legend2.values()) legend2_keys_list = list(legend2.keys()) # Plot the results plt.figure(2) plt.title("Outlier detection on a dataset") plt.scatter(X2[:, 0], X2[:, 1], color="blue") plt.xlim((xx2.min(), xx2.max())) plt.ylim((yy2.min(), yy2.max())) plt.legend( ( legend2_values_list[0].collections[0], legend2_values_list[1].collections[0], legend2_values_list[2].collections[0], ), (legend2_keys_list[0], legend2_keys_list[1], legend2_keys_list[2]), loc="upper center", ) plt.ylabel("magnesium") plt.xlabel("proanthocyanins") plt.show()

Detecting outliers using robust covariance technique.

It displays a scatter plot showing the decision boundaries of all three classifiers that helps to identify the inliers and the outliers. Notice that the data points are concentrated in one place that shows the inliers.

It can be interpreted from the results that:

Empirical covariance: The magenta decision boundary covers significant data points, but it is influenced by the diverse and dissimilar patterns in the dataset.

Robust covariance: The green decision boundary covers the main data points, but it has an assumption that the data should be Gaussian distributed and the results are influenced by it.

One class SVM: The red decision boundary covers most of the data points as it does not assume any parametric form of the data distribution.

Both the codes follow the same syntax, so here is an explanation that elaborates on both codes.

Note:The variable names used in the explanation may differ from the code variables as 1 and 2 are added in front of the variables from example 1 and 2 respectively.

**Lines 1–5:**Import the required method and libraries.**Lines 7–10:**Create a`classifiers`

dictionary that contains the outlier detection models.**Empirical Covariance**: Contains an instance of`EllipticEnvelope`

with`support_factor`

1 and`contamination`

0.25.**Robust Covariance**: Contains an instance of`EllipticEnvelope`

with`contamination`

0.25.**OCSVM**: Contains an instance of`OneClassSVM`

with`nu`

0.25 and`gamma`

0.35.

**Lines 13–14:**Create an array for the decision boundary`colors`

i.e. magenta, green and red; and an empty dictionary for`legend`

.**Line 17:**Load the wine dataset and select the two columns to create a 2D matrix.**Line 20:**Create a 2D grid with the ranges`0`

to`6`

and`1`

to`4.5`

and store the coordinates in`xx`

and`yy`

variables.**Lines 22–27:**Create a loop that iterates through the`classifier`

dictionary and perform the following tasks for each classifier:Create a new figure using

`figure()`

to ensure a separate figure for each classifier's decision boundary is plotted.Fit the model on the data

`X`

using`fit()`

.Obtain the anomaly score for each data point in the grid using

`decision_function()`

.Reshape the

`Z`

array containing the anomaly scores according to the`xx`

grid shape.Create a contour plot using

`contour()`

to visualize the .decision boundary Where the anomaly score is zero.

**Lines 29–30:**Create two lists; one for the`keys`

and other for the corresponding`values`

from`legend1`

.**Lines 33–35:**Create a plot and set its title, set the coordinates, and add the required annotations.**Lines 37–38:**Set the maximum and minimum limits of the x-axis and the y-axis, respectively.**Lines 40–47:**Add the`legend`

to the plot and set the values from the collection. Use`loc`

to define the position of the legend.**Lines 50–51:**Label the x-axis and y-axis. In this case, we are using the wine dataset to label according to the columns used.**Line 53:**Use`show()`

to display the created plot.

We can use different unsupervised techniques to detect anomalies on the same dataset and present them in a scatter plot to get a better understanding of the normal and anomalous data points. We can implement it using the following steps:

Generate or import sample dataset.

Apply different classifiers to it and record their result.

Plot the figure for each classifier to visualize the results. The data points that are away from the decision boundaries are considered an anomaly.

In this code, we applied empirical covariance, robust covariance, and one-class SVM to the wine dataset to compare the results. It can be observed that overall, one class SVM produces comparatively more precise results because it does not assume any parametric form of the data distribution.

Copyright ©2024 Educative, Inc. All rights reserved

TRENDING TOPICS