Most of the existing machine learning algorithms cannot be executed on categorical data. Instead, the categorical data needs to first be converted to numerical data. One-hot encoding is one of the techniques used to perform this conversion. This method is mostly used when deep learning techniques are to be applied to sequential classification problems.
One-hot encoding is essentially the representation of categorical variables as binary vectors. These categorical values are first mapped to integer values. Each integer value is then represented as a binary vector that is all 0s (except the index of the integer which is marked as 1).
Have a look at the example below which manually converts the categorical list of colors to a numerical list using one-hot encoding:
import numpy as np ### Categorical data to be converted to numeric data colors = ["red", "green", "yellow", "red", "blue"] ### Universal list of colors total_colors = ["red", "green", "blue", "black", "yellow"] ### map each color to an integer mapping = {} for x in range(len(total_colors)): mapping[total_colors[x]] = x one_hot_encode = [] for c in colors: arr = list(np.zeros(len(total_colors), dtype = int)) arr[mapping[c]] = 1 one_hot_encode.append(arr) print(one_hot_encode)
scikit-learn
Take a look at the example below. It uses the scikit-learn
library to perform one-hot encoding:
from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder ### Categorical data to be converted to numeric data colors = (["red", "green", "yellow", "red", "blue"]) ### integer mapping using LabelEncoder label_encoder = LabelEncoder() integer_encoded = label_encoder.fit_transform(colors) print(integer_encoded) integer_encoded = integer_encoded.reshape(len(integer_encoded), 1) ### One hot encoding onehot_encoder = OneHotEncoder(sparse=False) onehot_encoded = onehot_encoder.fit_transform(integer_encoded) print(onehot_encoded)
RELATED TAGS
View all Courses