An anomaly is defined as a persisting deviation in a physical quantity from its expected value.
For a given dataset, anomalies are synonymous with outliers. Outliers are data objects that stand out amongst other data objects within the dataset and do not conform to expected behavior.
Anomalies can often be detected using data visualization techniques that allow us to plot the dataset and identify the outliers.
To quantify which values constitute expected behavior, the mean and standard deviation of the dataset may be used.
The mean is used to calculate the average value associated with the dataset – it is the sum of all values divided by the total number of items.
The standard deviation is used to calculate an acceptable deviation from the mean. It is the square root of the
Together, these values are used to identify a range of values that will be considered reasonable for the dataset.
Values below and beyond the set range will be considered as anomalies.
Let’s consider the following list of numbers as our initial dataset:
[2, 3, 5, 7, 4, 6, 29, 1, 4, 3]
First, we will plot the data using a line graph to visually identify the anomaly.
From the above graph, it is clear that the value 29
is an outlier.
Now, we will try to define a range of acceptable values for our dataset using the mean and standard deviation.
The mean is the sum of all the values divided by the total number of items.
Mean = 6.4
Standard Deviation = 8.1
To calculate the acceptable range, we find the lower and upper bounds of the data by subtracting and adding the standard deviation from the mean respectively.
-1.7
14.5
Any value not within the above range is an anomaly.
We can see that the element 29
does not reside within the specified range and will thus be considered an anomaly.