One-hot encoding
One-hot encoding is one of the most used techniques for categorical encoding. Therefore, this lesson will focus on one hot encoding and its two variants, its advantages and disadvantages, and a sample code that implements this encoding method.
Definition
One hot encoding replaces the categorical variable with different boolean variables, which take values of 0
or 1
to indicate whether or not a particular category/label of the variable was present for that observation.
There are multiple variants of this method:
One-hot encoding into k-1
variables
One hot encoding into k-1
binary variables where k is the number of unique labels/categories considers that we can use 1
less dimension and still represent the complete information. If the observation is 0
in all the binary variables, it must be 1
in the final (removed) binary variable.
For example, take the case of binary variables like the COVID-19 test result. Where k=2
(positive/negative), we need to create only one (k -1 = 1)
binary variable.
Most machine learning algorithms use the entire dataset for training. For that reason, encoding categorical variables into k-1 binary variables is preferable, as it helps reduce redundant information.
One-hot encoding into k
variables
In some scenarios, encoding variables into K variables is a better choice, for example:
- When working with tree-based algorithms.
- When making feature selection with recursive algorithms.
- When it is essential to determine the importance of every single category.
The way to achieve one hot encoding into k variables is illustrated below:
One-hot encoding of most frequent categories
This variant of one-hot encoding considers only the most frequent categories, which means the categories that have a high cardinality in a variable. This helps us abstain from expanding the feature space.
Advantages of one-hot encoding
- Easy implementation
- It makes no assumptions on the distribution of categories of the variable.
- It retains all the information of the categorical variable.
- Suitable for linear models.
Disadvantages of one-hot encoding
If the variable has loads of categories, it will dramatically increase the feature space.
- It does not attach new information while encoding.
- It can result in information redundancy since many dummy variables may be identical.
Run this Python code with Pandas library to execute k and k-1 one-hot encoding:
import pandas as pdfrom tabulate import tabulate# create data frame, or you can read data from your csv file.data = pd.DataFrame({'country':['Algeria','USA','Germany','Egypt','Germany', 'Palestine' , 'Egypt', 'Algeria', 'Germany', 'USA', 'USA', 'Palestine']})# perform one hot encoding with kdata_with_k = pd.get_dummies(data)print("######## K variables #####")print(tabulate(data_with_k.head(), headers='keys', tablefmt='psql'))# perform one hot encoding with k - 1, it automatically drop the first.data_with_k_one = pd.get_dummies(data, drop_first = True)print("######## K - 1 variables #####")print(tabulate(data_with_k_one.head(), headers='keys', tablefmt='psql'))