...

/

One-hot encoding

One-hot encoding

One-hot encoding is one of the most used techniques for categorical encoding. Therefore, this lesson will focus on one hot encoding and its two variants, its advantages and disadvantages, and a sample code that implements this encoding method.

Definition

One hot encoding replaces the categorical variable with different boolean variables, which take values of 0 or 1 to indicate whether or not a particular category/label of the variable was present for that observation.

There are multiple variants of this method:

One-hot encoding into k-1 variables

One hot encoding into k-1 binary variables where k is the number of unique labels/categories considers that we can use 1 less dimension and still represent the complete information. If the observation is 0 in all the binary variables, it must be 1 in the final (removed) binary variable.

For example, take the case of binary variables like the COVID-19 test result. Where k=2 (positive/negative), we need to create only one (k -1 = 1) binary variable.

Most machine learning algorithms use the entire dataset for training. For that reason, encoding categorical variables into k-1 binary variables is preferable, as it helps reduce redundant information.

One-hot encoding into k variables

In some scenarios, encoding variables into K variables is a better choice, for example:

  • When working with tree-based algorithms.
  • When making feature selection with recursive algorithms.
  • When it is essential to determine the importance of every single category.

The way to achieve one hot encoding into k variables is illustrated below:

One-hot encoding of most frequent categories

This variant of one-hot encoding considers only the most frequent categories, which means the categories that have a high cardinality in a variable. This helps us abstain from expanding the feature space.

Advantages of one-hot encoding

  • Easy implementation
  • It makes no assumptions on the distribution of categories of the variable.
  • It retains all the information of the categorical variable.
  • Suitable for linear models.

Disadvantages of one-hot encoding

If the variable has loads of categories, it will dramatically increase the feature space.

  • It does not attach new information while encoding.
  • It can result in information redundancy since many dummy variables may be identical.

Run this Python code with Pandas library to execute k and k-1 one-hot encoding:

Python 3.8
import pandas as pd
from tabulate import tabulate
# create data frame, or you can read data from your csv file.
data = pd.DataFrame({'country':['Algeria','USA','Germany','Egypt','Germany', 'Palestine' , 'Egypt', 'Algeria', 'Germany', 'USA', 'USA', 'Palestine']})
# perform one hot encoding with k
data_with_k = pd.get_dummies(data)
print("######## K variables #####")
print(tabulate(data_with_k.head(), headers='keys', tablefmt='psql'))
# perform one hot encoding with k - 1, it automatically drop the first.
data_with_k_one = pd.get_dummies(data, drop_first = True)
print("######## K - 1 variables #####")
print(tabulate(data_with_k_one.head(), headers='keys', tablefmt='psql'))

Access this course and 1200+ top-rated courses and projects.