Categorical encoding

In this lesson, we will explore an important concept in feature engineering called categorical encoding, what it is and what its importance is, the different encoding techniques, and how to install the library that provides these techniques.

Definition

Most machine learning algorithms and deep neural networks work with numerical inputs. In other words, building models and networks that function requires encoding any actual categorical data in the beginning.

The goal of categorical encoding is to create predictive and informative numerical variables from the categorical variables in our dataset to build, train and evaluate the machine learning model and improve its performance.

Several techniques exist for this kind of data transformation; to name a few:

Traditional Techniques

One-hot encoding
Count or frequency encoding
Ordinal or label encoding

Monotonic Relationship

Ordered label encoding
Mean encoding
Probability ratio encoding
Weight of evidence

Alternative Techniques

Rare labels encoding
Binary encoding

To encode categorical variables, we need a library called category_encoders, which contains many basic and advanced methods for categorical variable encoding. You can use the following commands to install it:

# using pip
pip install category_encoders
# using conda
conda install -c conda-forge category_encoders

We will use the following data sample for our encodings:

Access this course and 1200+ top-rated courses and projects.

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion

Categorical encoding

Definition