Data imputation definition and techniques
In this lesson, we will explore a new concept in machine learning called data imputation, what it is and what its purpose is, and how to install the library needed for data imputation. Missing data imputation is the solution for missing data, as we have different variables. We also have different data imputation techniques depending on the data type. We will see in this lesson all the different methods that we can use to impute numerical variables and categorical variables.
Introduction
This lesson aims to learn the different imputation techniques and understand their impacts on our variables and machine learning models to better handle the missing data issue, along with a few code snippets to use directly in your machine learning project.
Data Imputation
Data imputation is the process of replacing missing data with substituted values to produce a complete dataset to use for training machine learning models.
In order to use data imputation techniques, we need to use a library called feature-engine that can simplify the process of imputing missing values.
It is already installed in our environment, but if you are using your laptop, use can this command to pip install it:
$ pip install feature-engine
The imputation techniques
We are going to go through the different techniques that can be useful for numerical and categorical variables and a few methods that apply to both:
Numerical variables techniques
- Mean or median imputation
- Arbitrary value imputation
- End of tail imputation
Categorical variables techniques
- Frequent category imputation
- Add a missing category
Both types techniques
- Complete case analysis
- Add a missing indicator
- Random sample imputation
To check if the data is null in a given pandas data frame, you can use the following code:
import seaborn as snsimport matplotlib.pyplot as pltimport numpy as npimport pandas as pd# loading the titanic dataset.titanic = sns.load_dataset("titanic")# check if data has null valuesprint(titanic.isnull().sum())