Missing data
Missing data is one of the most common problems in datasets, the purpose of this lesson is to explore this problem, how missing data occurs, and the mechanisms by which data is missing.
Common issues in datasets
Datasets in machine learning contain many features, and having a lot of features often causes a variety of issues that must be addressed before actually using those features to train our model. The most common issues that can be encountered when working with datasets are:
- Missing data.
- Categorical variable — cardinality.
- Categorical variable — rare labels.
- Linear model assumptions.
- Variable distribution.
- Outliers.
- Feature magnitude.
This section is devoted to learning how to identify these characteristics and problems and how they can have a remarkable impact on machine learning models.
Missing data
Missing data or missing values is defined as the data value that is not stored for a variable in a particular observation. The problem of missing data is relatively common in most datasets and can significantly affect the conclusions that can be drawn from the data.
Here is an example showing a dataset with some missing values:
Why can data be missed?
There are many reasons for missing data:
- A value can be missing because it was forgotten, omitted, lost, or not stored properly.
- A variable is created from the division of 2 variables, and the denominator takes 0, which leaves the second variable with a missing value.
- Many features of a given dataset are not necessary when collected early on; Therefore, these values may be missing if a user does not record them.
Solution?
This problem can be resolved using missing data imputation techniques, but these techniques may tamper with the original distribution of variables. Besides, they may interfere with the interaction between variables. Since the distribution of variables is crucial to the expectations of specific models, these distortions can directly influence the machine learning model’s performance. Therefore, we need to choose the right technique for missing data imputation carefully.
Mechanisms that lead to missing data
Before deciding which method is suitable for our missing data case, we certainly need to understand the three main mechanisms that lead to missing data:
- Missing data entirely at random (MCAR): There is no relationship between the missing data and other values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others, and disregarding those cases would not influence the deduction made.
- Missing data at random (MAR): The probability of missing is related to the observation but can be predicted from other information.
- Missing data not at random (MNAR): Missing values are not random, and there is a reason they are present in the dataset.