Data imputation definition and techniques

In this lesson, we will explore a new concept in machine learning called data imputation, what it is and what its purpose is, and how to install the library needed for data imputation. Missing data imputation is the solution for missing data, as we have different variables. We also have different data imputation techniques depending on the data type. We will see in this lesson all the different methods that we can use to impute numerical variables and categorical variables.

Data Imputation

Data imputation is the process of replacing missing data with substituted values to produce a complete dataset to use for training machine learning models.

In order to use data imputation techniques, we need to use a library called feature-engine that can simplify the process of imputing missing values.

It is already installed in our environment, but if you are using your laptop, use can this command to pip install it:

$ pip install feature-engine

The imputation techniques

We are going to go through the different techniques that can be useful for numerical and categorical variables and a few methods that apply to both:

Numerical variables techniques

Mean or median imputation
Arbitrary value imputation
End of tail imputation

Categorical variables techniques

Frequent category imputation
Add a missing category

Both types techniques

Complete case analysis
Add a missing indicator
Random sample imputation

Access this course and 1200+ top-rated courses and projects.

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion