Label encoding in Python
Label encoding is a data preprocessing technique used in machine learning projects that converts categorical columns into numerical values. It plays a significant role at times when we need to
In this Answer, we will explore the implementation of converting categorical data present in strings into numerical values using scikit-learn LaberEncoder class.
Example
Before getting into the coding part and using scikit-learn, let us first understand the result of performing label encoding on a dataset. For that, let us consider an example dataset of fruits along with their prices. The dataset is shown below:
Dataset
Fruit | Price ($) |
Apple | 2 |
Banana | 3 |
Orange | 4 |
Banana | 3 |
Apple | 2 |
As we can see, the dataset contains two columns; "Fruit" and "Price ($)." If we want to fit this dataset on a machine learning model, we would need to apply label encoding to it. The result of applying label encoding will be:
Label encoded dataset
Fruit | Price ($) |
0 | 2 |
1 | 3 |
2 | 4 |
1 | 3 |
0 | 2 |
The output shows that the values of the "Fruit" column have converted into numerical values starting from 0. The numerical values assigned are not random. Rather, label encoding is based on assigning values in alphabetical order.
Encoding a column
Now, we will look into the implementation of encoding a dataset's column. We will create a data frame of the above-given data and encode its "Fruit" column. We can see the code below:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
fruits = pd.DataFrame({
'Fruits': ["Apple", "Banana", "Orange", "Banana" , "Apple"],
'Price ($)' : ["2", "3", "4", "3", "2"]
})
encoder = LabelEncoder()
encoded_col = encoder.fit_transform(fruits["Fruits"])
fruits['Fruits'] = encoded_col
print(fruits)Once we click the "Run" button, the data set's column gets encoded, which is now perfect to be fitted on a machine learning model that only takes numerical values.
Note: Sklearn's label encoding module encodes only a single column at a time.
The explanation of the above code is explained below:
Line 1: We import
pandaslibrary, which is used to create theDataFrame.Line 2: We import the
LabelEncoderfrom thesklearn.preprocessingpackage.Lines 4–7: We create a
DataFramewith the example data we have created in the above sections.Line 8: We create an instance of the
LabelEncoderclass and store it inencodervariable.Line 9: We use the
fit_transformmethod of theencoderobject and pass the 1-dimensional array which is to be encoded. We store the encoded array in theencoded_colvariable.Line 10: We replace the
Fruitscolumn data with theencoded_coldata.Line 11: We display the updated data frame with label encoded column.
Limitation
The limitation of label encoding is that as it converts categorical columns into numerical ones by assigning numbers starting from 0, this may cause priority issues as the column with a higher number will be considered to have a higher priority than a number having lower numerical values.
As an example, in our example data set, Apple is encoded to 0, and Orange is encoded to 2. But there is no priority relation between the two fruits.
Conclusion
In conclusion, label encoding does have limitations. Still, it is a vital tool to pre-process the data and make it perfect to fit it on a machine learning model that only takes numerical values.
Free Resources