Advanced categorical encoding
In this lesson, we will see a range of advanced techniques for categorical encoding, like Catboost and LOO encoders, while going through the definition and the idea of how each encoder works and how to implement them in your projects.
Advanced feature engineering techniques
We have learned many techniques and methods for feature engineering so far. In this last part of the course, we will learn some advanced feature engineering approaches. Specifically, we will talk about advanced categorical encoding, advanced outlier detection, automated feature engineering, and more.
Advanced categorical encoding
Other than the primary methods for encoding categorical variables, there are more advanced and effective techniques that we will cover in the next section.
For that, we need a new library called category_encoders
that will give us access to these new methods.
Use the following commands to install it:
pip install category_encoders
conda install -c conda-forge category_encoders
Catboost encoder
A CatBoost encoder is very similar to target encoding, and it uses the principle similar to the time series data validation. The idea here is to replace the category with the mean target value for that category, so the values of target statistic rely on the observed history, i.e., target probability for the current feature is calculated only from the rows (observations) before it. This process may lead to overfitting; the CatBoost encoder will have to train many times on mixed-up copies of the dataset to avoid overfitting.
Run the following code:
import pandas as pdimport numpy as npfrom tabulate import tabulatefrom category_encoders import CatBoostEncoderdata = pd.DataFrame({'color':['Yellow','Yellow', 'Blue', 'Yellow', 'Red', 'Yellow', 'Red', 'Red', 'Yellow', 'Blue'],'target':[0,1,1,1,1,0,1,0,1,0] })# create the encoder.encoder = CatBoostEncoder(return_df=True)# fit and transform the data.new_data = encoder.fit_transform(data['color'], data['target'])print(tabulate(new_data, headers='keys', tablefmt='psql'))
Leave-one-out Encoder (LOO)
Leave-one-out Encoding is another example of target-based encoders; this encoder essentially calculates the mean of the target variables for all the records containing the same value for the categorical feature variable in question, but ...
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy