Large amounts of data are being generated and collected every day. Whether it be data from smart devices, social media, or the internet, we are surrounded by data. But not all data is perfect. Errors can occur during the generation or collection of data.
The illustration below shows some prominent sources of data:
When analyzing data, we need to validate its reliability. We can break down the analysis of the quality of data into five categories:
We will now discuss some of the weaknesses that might exist within a dataset:
Probably the easiest to detect, missing values are values that are not present in the dataset. Values might not be generated, or there was some issue while collecting and recording them. This means that some information is not available and must be extracted from relevant fields to estimate the missing values. In Python, missing values are represented by
NA meaning ‘Not Available’ or
NaN meaning ‘Not a Number’.
NaN is used when a numerical value is represented using a non-numerical value.
The illustration below gives an example of missing data:
Missing values can be filled by taking the mean, median, or mode of the data (depending on what suits best). They can also be filled based on data from other columns.
Scikit Learnlibrary in Python has a method known as
KNNImputer. KNNImputer extrapolates values based on other columns.
Outliers are data values that are not within the acceptable range. They can affect the mean of the dataset drastically. Imagine the following scenario:
Weights of 5 students are recorded manually. The table below shows the actual weights of students:
The average weight of all students is 54.2.
However, while entering data into the records, the gym instructor accidentally entered Jack’s weight as 500 Kg. This is an outlier since a student is unlikely to weigh that much. The average of all students also increases to 144.2.
Outliers can be observed using a scatterplot or boxplot.
Illustrations below show both these plots:
It is difficult to adjust outliers. Outliers can be removed by removing the entire row that contains them. However, if the row is important, outliers can be treated as missing data and adjusted accordingly.
Duplicates refer to data that is the same. They might occur when the same data has been entered more than once. Oftentimes, we need to merge data from multiple sources before it can be processed. Merges are also events where duplicates might arise. Duplicates can be removed by removing the entire extra row.
The illustration below serves as a good example of a duplicate:
Sometimes, the data we are analyzing might have columns that we do not need. This is specific to our research question. Secondary research involves analyzing data that has already been collected. Such data might have extra information that is deemed irrelevant to our work. We can choose to get rid of such information by dropping entire columns that we do not require.
Data entry through drop-downs and checklists offer a safer method to input data. However, it might not entertain all possible values. Inconsistencies in data can occur when users are allowed to enter values manually. This can introduce errors such as variable spellings for the same word, different ways of representing data, or qualitative information that can not be scaled.
The illustration below shows examples of non-uniform data:
Non-uniform data needs to be identified first. This can be done by scanning the dataset manually or using relevant functions.
Pandaslibrary in Python has the function
value_countsthat shows unique values for a particular column of the dataset along with the number of times they occur.
Non-uniform data can then be standardized by encoding them into special values that provide some quantifiable information.
Data generation and collection are prone to errors within the data. These errors need to be identified and fixed before data can be used further. Errors may range from missing data, outliers, duplicates, irrelevant, and non-uniform data.
View all Courses