Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

data science

How to identify and handle anomalies in data

Hassaan Waqar

Large amounts of data are being generated and collected every day. Whether it be data from smart devices, social media, or the internet, we are surrounded by data. But not all data is perfect. Errors can occur during the generation or collection of data.

The illustration below shows some prominent sources of data:

Sources of Data

The rule of five

When analyzing data, we need to validate its reliability. We can break down the analysis of the quality of data into five categories:

  • Accuracy: Recorded data is within the acceptable range.
  • Validity: Data meets required standards and suits the need of time.
  • Completeness: Data is complete, and portions of it are not missing.
  • Consistency: Data from a single dataset is consistent and similar.
  • Uniformity: Measurement metrics for generating data are uniform and consistent.

Anomalies in data

We will now discuss some of the weaknesses that might exist within a dataset:

Missing values

Probably the easiest to detect, missing values are values that are not present in the dataset. Values might not be generated, or there was some issue while collecting and recording them. This means that some information is not available and must be extracted from relevant fields to estimate the missing values. In Python, missing values are represented by NA meaning ‘Not Available’ or NaN meaning ‘Not a Number’. NaN is used when a numerical value is represented using a non-numerical value.

The illustration below gives an example of missing data:

Missing Values in Data

Missing values can be filled by taking the mean, median, or mode of the data (depending on what suits best). They can also be filled based on data from other columns.

The Scikit Learn library in Python has a method known as KNNImputer. KNNImputer extrapolates values based on other columns.

Outliers

Outliers are data values that are not within the acceptable range. They can affect the mean of the dataset drastically. Imagine the following scenario:

Weights of 5 students are recorded manually. The table below shows the actual weights of students:

Student Weight (Kg)
Bob 60
Alice 55
Jim 62
Jill 44
Jack 50

The average weight of all students is 54.2.

However, while entering data into the records, the gym instructor accidentally entered Jack’s weight as 500 Kg. This is an outlier since a student is unlikely to weigh that much. The average of all students also increases to 144.2.

Outliers can be observed using a scatterplot or boxplot.

Illustrations below show both these plots:

Scatterplot and Boxplot showing outliers

It is difficult to adjust outliers. Outliers can be removed by removing the entire row that contains them. However, if the row is important, outliers can be treated as missing data and adjusted accordingly.

Duplicates

Duplicates refer to data that is the same. They might occur when the same data has been entered more than once. Oftentimes, we need to merge data from multiple sources before it can be processed. Merges are also events where duplicates might arise. Duplicates can be removed by removing the entire extra row.

The illustration below serves as a good example of a duplicate:

Duplicate information
Duplicate information

Irrelevant data

Sometimes, the data we are analyzing might have columns that we do not need. This is specific to our research question. Secondary research involves analyzing data that has already been collected. Such data might have extra information that is deemed irrelevant to our work. We can choose to get rid of such information by dropping entire columns that we do not require.

Non-uniform data

Data entry through drop-downs and checklists offer a safer method to input data. However, it might not entertain all possible values. Inconsistencies in data can occur when users are allowed to enter values manually. This can introduce errors such as variable spellings for the same word, different ways of representing data, or qualitative information that can not be scaled.

The illustration below shows examples of non-uniform data:

Examples of Non-Uniform data

Non-uniform data needs to be identified first. This can be done by scanning the dataset manually or using relevant functions.

The Pandas library in Python has the function value_counts that shows unique values for a particular column of the dataset along with the number of times they occur.

Non-uniform data can then be standardized by encoding them into special values that provide some quantifiable information.

Summary

Data generation and collection are prone to errors within the data. These errors need to be identified and fixed before data can be used further. Errors may range from missing data, outliers, duplicates, irrelevant, and non-uniform data.

RELATED TAGS

data science

CONTRIBUTOR

Hassaan Waqar
Copyright ©2022 Educative, Inc. All rights reserved
RELATED COURSES

View all Courses

Keep Exploring