Outlier detection with the local outlier factor
The local outlier factor is a density-based technique that identifies the outliers based on the neighbors. The data points that are lying in areas with lower density as compared to the neighbors are considered anomalous.
Algorithm
This technique uses the following algorithm to calculate the anomaly score and categorizes the data points to find the anomalous ones.
In this algorithm, we calculate the local and global density to calculate the anomaly score.
a: We calculate the average local reachability density of data points in the neighborhood of xi.
b: Total elements present in the neighborhood of xi.
c: We calculate the local reachability density of xi.
How does it work?
We create or import a dataset and then calculate the anomaly score for each data point according to which the marker is positioned and drawn. The data points with more density are considered normal, and the data points lying on the less dense marker around them or away from them are considered outliers.
How to implement this understanding?
Let's write a code step-by-step that generates sample data, fits the model to it, and then creates a scatter plot to visualize the results obtained after applying the algorithm.
While generating the dataset and assigning values, keep in mind that the neighbors considered should be:
Greater than the minimum number of samples a cluster has to contain so that other samples can be local outliers relative to this cluster.
Smaller than the maximum number of close-by samples that can potentially be local outliers.
Before starting the code, let's understand the modules we must import and how they are used.
Required imports
We import the following from numpy, matplotlib and sklearn libraries.
import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.legend_handler import HandlerPathCollectionfrom sklearn.neighbors import LocalOutlierFactor
numpyhandles data arrays and performs numerical operations.matplotlib.pyplotcreates and customizes data visuals, including various types of plots.matplotlib.legend_handlercreates and customizes data visuals, including various types of plots. We importHandlerPathCollectionto detect outliers.sklearn.neighborsaccesses functionalities for robust covariance estimation. We importLocalOutlierFactorto detect outliers.
Step 1: Generate data
We generate the random sample data using random from numpy.
random.randnis used to create clusters for the inliers.random.uniformis used to create the outliers.
Once the random data is generated, we calculate the length of the outlying data points and assign them a ground_truth label of -1. The inlying points are at the start, and the outlying points are at the end, so make sure the correct data points are assigned the outlying ground truth value.
import numpy as npnp.random.seed(42)X_inliers = 0.3 * np.random.randn(140, 2)X_inliers = np.r_[X_inliers + 2, X_inliers - 2]X_outliers = np.random.uniform(low=-4, high=4, size=(30, 2))dataArr = np.r_[X_inliers, X_outliers]n_outliers = len(X_outliers)ground_truth = np.ones(len(dataArr), dtype=int)ground_truth[-n_outliers:] = -1
Step 2: Fit the model
We create an instance of the LocalOutlierFactor that we import from the neighbor module. Once the instance is created, we use fir_predict to fir the model on our dataset and predict labels for it. Along with the predictions, we also store the total prediction errors that are identified during the process as well as the outliers scores for the data points.
from sklearn.neighbors import LocalOutlierFactorclf = LocalOutlierFactor(n_neighbors=24, contamination=0.1)y_pred = clf.fit_predict(dataArr)n_errors = (y_pred != ground_truth).sum()X_scores = clf.negative_outlier_factor_
Step 3: Plot the results
Once the data is generated and the model is successfully fitted to get the predictions and necessary statistical information, we plot our results. A scatter plot is created for a defined axis range in which all the data points are created on their coordinates, and around them, a circular marker is drawn that represents the outlier score. The position of the marker is obtained after applying the algorithm.
Example code
In this example, we create a plot for a randomly generated dataset and show the results using a scattered plot.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from matplotlib.legend_handler import HandlerPathCollection
np.random.seed(42)
X_inliers = 0.3 * np.random.randn(140, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]
X_outliers = np.random.uniform(low=-4, high=4, size=(30, 2))
dataArr = np.r_[X_inliers, X_outliers]
n_outliers = len(X_outliers)
ground_truth = np.ones(len(dataArr), dtype=int)
ground_truth[-n_outliers:] = -1
clf = LocalOutlierFactor(n_neighbors=24, contamination=0.1)
y_pred = clf.fit_predict(dataArr)
n_errors = (y_pred != ground_truth).sum()
X_scores = clf.negative_outlier_factor_
def update_legend_marker_size(handle, orig):
"Customize size of the legend marker"
handle.update_from(orig)
handle.set_sizes([20])
plt.scatter(dataArr[:, 0], dataArr[:, 1], color="blue", s=3.0, label="Data points")
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
scatter = plt.scatter(
dataArr[:, 0],
dataArr[:, 1],
s=1000 * radius,
edgecolors="purple",
facecolors="none",
label="Outlier scores",
)
plt.axis("tight")
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.xlabel("prediction errors: %d" % (n_errors))
plt.legend(
handler_map={scatter: HandlerPathCollection(update_func=update_legend_marker_size)}
)
plt.title("Plot Local Outlier Factor Results")
plt.show()Code explanation
Lines 1–4: Import the required method and libraries.
Line 6: Set a random seed to ensure every time the code is executed, the same result is produced.
Lines 8–11: Generate two sets of data for inlier and outlier data points each and store them in the
dataArr. This dataset has 140 inlying points and 30 outlying data points.Lines 13–15: Set the ground_truth labels of the outlying data points as -1.
Line 17: Create a
LocalOutlierFactorinstance and save it onclf. We specify the n_neighbors as 24 and contamination as 0.1.Lines 18–20: Fit the
LocalOutlierFactoralgorithm to the data inside thedataArrand obtain the predictions, errors, and outlier scores.Lines 22–25: Customize the legend as per requirement and set its size.
Line 28: Create a scatter block using
scatter()and pass the color, size, and label as parameters.Lines 32–38: Plot a circular representation of the outlier scores and specify the properties as per requirement.
Lines 41–44: Define the plot limits for the x-axis and the y-axis and set the axis label.
Lines 46–47: Create a legend for the plot and set the size of the markers using
update_legend_marker_size.Line 51: Set a suitable title for the plot that defines it.
Line 52: Use
show()to display the created plot.
Code output
A scattered plot is created that shows the data points in blue. A purple circular marker is drawn around the data points representing the outlier score for each.
Summary
T local outlier factor is an effective unsupervised technique to detect anomalies in a dataset and present it in a scatter plot. We can implement it using the following steps:
Generate random data for the inliers and outliers and store it in an array.
Fit the model on the data points and obtain the results using the algorithm.
Plot the obtained results in a scatter plot that can be used to identify the inliers and outliers.
Test your understanding
To ensure the same plot is executed every time.
HandlerPathCollection()
To generate random data for outliers.
np.random.seed(42)
To control the properties of legend.
np.random.uniform()
Free Resources