Advanced categorical encoding

In this lesson, we will see a range of advanced techniques for categorical encoding, like Catboost and LOO encoders, while going through the definition and the idea of how each encoder works and how to implement them in your projects.

Advanced feature engineering techniques

We have learned many techniques and methods for feature engineering so far. In this last part of the course, we will learn some advanced feature engineering approaches. Specifically, we will talk about advanced categorical encoding, advanced outlier detection, automated feature engineering, and more.

Advanced categorical encoding

Other than the primary methods for encoding categorical variables, there are more advanced and effective techniques that we will cover in the next section.

For that, we need a new library called category_encoders that will give us access to these new methods.

Use the following commands to install it:

pip install category_encoders
conda install -c conda-forge category_encoders

Catboost encoder

A CatBoost encoder is very similar to target encoding, and it uses the principle similar to the time series data validation. The idea here is to replace the category with the mean target value for that category, so the values of target statistic rely on the observed history, i.e., target probability for the current feature is calculated only from the rows (observations) before it. This process may lead to overfitting; the CatBoost encoder will have to train many times on mixed-up copies of the dataset to avoid overfitting.

Run the following code:

Python 3.8

import pandas as pd
import numpy as np
from tabulate import tabulate
from category_encoders import CatBoostEncoder
data = pd.DataFrame({'color':['Yellow','Yellow', 'Blue', 'Yellow', 'Red', 'Yellow', 'Red', 'Red', 'Yellow', 'Blue'],
                    'target':[0,1,1,1,1,0,1,0,1,0] })
# create the encoder.
encoder = CatBoostEncoder(return_df=True)
# fit and transform the data.
new_data = encoder.fit_transform(data['color'], data['target'])
print(tabulate(new_data, headers='keys', tablefmt='psql'))

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion

Advanced categorical encoding

Advanced feature engineering techniques

Advanced categorical encoding

Catboost encoder

Leave-one-out Encoder (LOO)