Label encoding in Python

Label encoding is a data preprocessing technique used in machine learning projects that converts categorical columns into numerical values. It plays a significant role at times when we need to fitTraining a machine learning model on a dataset. our data into a machine-learning model that only takes numerical values.

In this Answer, we will explore the implementation of converting categorical data present in strings into numerical values using scikit-learn LaberEncoder class.

Example

Before getting into the coding part and using scikit-learn, let us first understand the result of performing label encoding on a dataset. For that, let us consider an example dataset of fruits along with their prices. The dataset is shown below:

Once we click the "Run" button, the data set's column gets encoded, which is now perfect to be fitted on a machine learning model that only takes numerical values.

Note: Sklearn's label encoding module encodes only a single column at a time.

The explanation of the above code is explained below:

Line 1: We import pandas library, which is used to create the DataFrame.
Line 2: We import the LabelEncoder from the sklearn.preprocessing package.
Lines 4–7: We create a DataFrame with the example data we have created in the above sections.
Line 8: We create an instance of the LabelEncoder class and store it in encoder variable.
Line 9: We use the fit_transform method of the encoder object and pass the 1-dimensional array which is to be encoded. We store the encoded array in the encoded_col variable.
Line 10: We replace the Fruits column data with the encoded_col data.
Line 11: We display the updated data frame with label encoded column.

Limitation

The limitation of label encoding is that as it converts categorical columns into numerical ones by assigning numbers starting from 0, this may cause priority issues as the column with a higher number will be considered to have a higher priority than a number having lower numerical values.

As an example, in our example data set, Apple is encoded to 0, and Orange is encoded to 2. But there is no priority relation between the two fruits.

Conclusion

In conclusion, label encoding does have limitations. Still, it is a vital tool to pre-process the data and make it perfect to fit it on a machine learning model that only takes numerical values.

Fruit	Price ($)
Apple	2
Banana	3
Orange	4
Banana	3
Apple	2

Label encoding in Python

Example

Dataset

Label encoded dataset

Encoding a column

Limitation

Conclusion