One-hot encoding

One-hot encoding is one of the most used techniques for categorical encoding. Therefore, this lesson will focus on one hot encoding and its two variants, its advantages and disadvantages, and a sample code that implements this encoding method.

Definition

One hot encoding replaces the categorical variable with different boolean variables, which take values of 0 or 1 to indicate whether or not a particular category/label of the variable was present for that observation.

There are multiple variants of this method:

One-hot encoding into `k-1` variables

One hot encoding into k-1 binary variables where k is the number of unique labels/categories considers that we can use 1 less dimension and still represent the complete information. If the observation is 0 in all the binary variables, it must be 1 in the final (removed) binary variable.

For example, take the case of binary variables like the COVID-19 test result. Where k=2 (positive/negative), we need to create only one (k -1 = 1) binary variable.

Most machine learning algorithms use the entire dataset for training. For that reason, encoding categorical variables into k-1 binary variables is preferable, as it helps reduce redundant information.

One-hot encoding into `k` variables

In some scenarios, encoding variables into K variables is a better choice, for example:

When working with tree-based algorithms.
When making feature selection with recursive algorithms.
When it is essential to determine the importance of every single category.

The way to achieve one hot encoding into k variables is illustrated below:

One-hot encoding of most frequent categories

This variant of one-hot encoding considers only the most frequent categories, which means the categories that have a high cardinality in a variable. This helps us abstain from expanding the feature space.

Advantages of one-hot encoding

Easy implementation
It makes no assumptions on the distribution of categories of the variable.
It retains all the information of the categorical variable.
Suitable for linear models.

Disadvantages of one-hot encoding

If the variable has loads of categories, it will dramatically increase the feature space.

It does not attach new information while encoding.
It can result in information redundancy since many dummy variables may be identical.

Run this Python code with Pandas library to execute k and k-1 one-hot encoding:

Python 3.8

import pandas as pd
from tabulate import tabulate
# create data frame, or you can read data from your csv file.
data = pd.DataFrame({'country':['Algeria','USA','Germany','Egypt','Germany', 'Palestine' , 'Egypt', 'Algeria', 'Germany', 'USA', 'USA', 'Palestine']})
# perform one hot encoding with k
data_with_k = pd.get_dummies(data)
print("######## K variables #####")
print(tabulate(data_with_k.head(), headers='keys', tablefmt='psql'))
# perform one hot encoding with k - 1, it automatically drop the first.
data_with_k_one = pd.get_dummies(data, drop_first = True)
print("######## K - 1 variables #####")
print(tabulate(data_with_k_one.head(), headers='keys', tablefmt='psql'))

Access this course and 1200+ top-rated courses and projects.

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion

One-hot encoding

Definition

One-hot encoding into `k-1` variables

One-hot encoding into `k` variables

One-hot encoding of most frequent categories

Advantages of one-hot encoding

Disadvantages of one-hot encoding

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion

One-hot encoding

Definition

One-hot encoding into k-1 variables

One-hot encoding into k variables

One-hot encoding of most frequent categories

Advantages of one-hot encoding

Disadvantages of one-hot encoding

One-hot encoding into `k-1` variables

One-hot encoding into `k` variables