Trusted answers to developer questions

One-hot encoding in Python

Free System Design Interview Course

Many candidates are rejected or down-leveled due to poor performance in their System Design Interview. Stand out in System Design Interviews and get hired in 2024 with this popular free course.

Most of the existing machine learning algorithms cannot be executed on categorical data. Instead, the categorical data needs to first be converted to numerical data. One-hot encoding is one of the techniques used to perform this conversion. This method is mostly used when deep learning techniques are to be applied to​ sequential classification problems.

One-hot encoding is essentially the representation of categorical variables as binary vectors. These categorical values are first mapped to integer values. Each integer value is then represented as a binary vector that is all 0s (except the index of the integer which is marked as 1).

svg viewer

Manual one-hot encoding

Have a look at the example below​ which manually converts the categorical list of colors to a numerical list using one-hot encoding:

import numpy as np
### Categorical data to be converted to numeric data
colors = ["red", "green", "yellow", "red", "blue"]
### Universal list of colors
total_colors = ["red", "green", "blue", "black", "yellow"]
### map each color to an integer
mapping = {}
for x in range(len(total_colors)):
mapping[total_colors[x]] = x
one_hot_encode = []
for c in colors:
arr = list(np.zeros(len(total_colors), dtype = int))
arr[mapping[c]] = 1
one_hot_encode.append(arr)
print(one_hot_encode)

One-hot encoding using scikit-learn

Take a look at the example below. It uses the scikit-learn library to perform one-hot encoding:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
### Categorical data to be converted to numeric data
colors = (["red", "green", "yellow", "red", "blue"])
### integer mapping using LabelEncoder
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(colors)
print(integer_encoded)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
### One hot encoding
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

RELATED TAGS

one hot encoding
python
machine learning
data sciences
Copyright ©2024 Educative, Inc. All rights reserved
Did you find this helpful?