Categorical encoding
In this lesson, we will explore an important concept in feature engineering called categorical encoding, what it is and what its importance is, the different encoding techniques, and how to install the library that provides these techniques.
Definition
Most machine learning algorithms and deep neural networks work with numerical inputs. In other words, building models and networks that function requires encoding any actual categorical data in the beginning.
The goal of categorical encoding is to create predictive and informative numerical variables from the categorical variables in our dataset to build, train and evaluate the machine learning model and improve its performance.
Several techniques exist for this kind of data transformation; to name a few:
Traditional Techniques
- One-hot encoding
- Count or frequency encoding
- Ordinal or label encoding
Monotonic Relationship
- Ordered label encoding
- Mean encoding
- Probability ratio encoding
- Weight of evidence
Alternative Techniques
- Rare labels encoding
- Binary encoding
To encode categorical variables, we need a library called category_encoders, which contains many basic and advanced methods for categorical variable encoding. You can use the following commands to install it:
# using pip
pip install category_encoders
# using conda
conda install -c conda-forge category_encoders
We will use the following data sample for our encodings: