Outlier detection with the local outlier factor

The local outlier factor is a density-based technique that identifies the outliers based on the neighbors. The data points that are lying in areas with lower density as compared to the neighbors are considered anomalous.

Algorithm

This technique uses the following algorithm to calculate the anomaly score and categorizes the data points to find the anomalous ones.

In this algorithm, we calculate the local and global density to calculate the anomaly score.

a: We calculate the average local reachability density of data points in the neighborhood of x_i.
b: Total elements present in the neighborhood of x_i.
c: We calculate the local reachability density of x_i.

How does it work?

We create or import a dataset and then calculate the anomaly score for each data point according to which the marker is positioned and drawn. The data points with more density are considered normal, and the data points lying on the less dense marker around them or away from them are considered outliers.

How to implement this understanding?

Let's write a code step-by-step that generates sample data, fits the model to it, and then creates a scatter plot to visualize the results obtained after applying the algorithm.

While generating the dataset and assigning values, keep in mind that the neighbors considered should be:

Greater than the minimum number of samples a cluster has to contain so that other samples can be local outliers relative to this cluster.
Smaller than the maximum number of close-by samples that can potentially be local outliers.

Before starting the code, let's understand the modules we must import and how they are used.

Required imports

We import the following from numpy, matplotlib and sklearn libraries.

numpy handles data arrays and performs numerical operations.
matplotlib.pyplot creates and customizes data visuals, including various types of plots.
matplotlib.legend_handler creates and customizes data visuals, including various types of plots. We import HandlerPathCollection to detect outliers.
sklearn.neighbors accesses functionalities for robust covariance estimation. We import LocalOutlierFactor to detect outliers.

Step 1: Generate data

We generate the random sample data using random from numpy.

random.randn is used to create clusters for the inliers.
random.uniform is used to create the outliers.

Once the random data is generated, we calculate the length of the outlying data points and assign them a ground_truth label of -1. The inlying points are at the start, and the outlying points are at the end, so make sure the correct data points are assigned the outlying ground truth value.

Step 3: Plot the results

Once the data is generated and the model is successfully fitted to get the predictions and necessary statistical information, we plot our results. A scatter plot is created for a defined axis range in which all the data points are created on their coordinates, and around them, a circular marker is drawn that represents the outlier score. The position of the marker is obtained after applying the algorithm.

Example code

In this example, we create a plot for a randomly generated dataset and show the results using a scattered plot.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from matplotlib.legend_handler import HandlerPathCollection

np.random.seed(42)

X_inliers = 0.3 * np.random.randn(140, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]
X_outliers = np.random.uniform(low=-4, high=4, size=(30, 2))
dataArr = np.r_[X_inliers, X_outliers]

n_outliers = len(X_outliers)
ground_truth = np.ones(len(dataArr), dtype=int)
ground_truth[-n_outliers:] = -1

clf = LocalOutlierFactor(n_neighbors=24, contamination=0.1)
y_pred = clf.fit_predict(dataArr)
n_errors = (y_pred != ground_truth).sum()
X_scores = clf.negative_outlier_factor_

def update_legend_marker_size(handle, orig):
    "Customize size of the legend marker"
    handle.update_from(orig)
    handle.set_sizes([20])


plt.scatter(dataArr[:, 0], dataArr[:, 1], color="blue", s=3.0, label="Data points")

radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())

scatter = plt.scatter(
    dataArr[:, 0],
    dataArr[:, 1],
    s=1000 * radius,
    edgecolors="purple",
    facecolors="none",
    label="Outlier scores",
)

plt.axis("tight")
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.xlabel("prediction errors: %d" % (n_errors))

plt.legend(
    handler_map={scatter: HandlerPathCollection(update_func=update_legend_marker_size)}

)

plt.title("Plot Local Outlier Factor Results")

plt.show()

Detecting outliers using local factor outlier technique.

Line 17: Create a LocalOutlierFactor instance and save it on clf. We specify the n_neighbors as 24 and contamination as 0.1.
Lines 18–20: Fit the LocalOutlierFactor algorithm to the data inside the dataArr and obtain the predictions, errors, and outlier scores.
Lines 22–25: Customize the legend as per requirement and set its size.
Line 28: Create a scatter block using scatter() and pass the color, size, and label as parameters.
Lines 32–38: Plot a circular representation of the outlier scores and specify the properties as per requirement.

Outlier detection with the local outlier factor

Algorithm

How does it work?

How to implement this understanding?

Required imports

Step 1: Generate data

Step 2: Fit the model

Step 3: Plot the results

Example code

Code explanation

Code output

Summary

Test your understanding