Encoding
Explore how encoding transforms categorical features into numerical data suitable for machine learning algorithms. Understand when and how to use LabelEncoder, OneHotEncoder, and OrdinalEncoder within scikit-learn to prepare your dataset effectively for classification and other ML tasks.
We'll cover the following...
Encoding refers to the process of converting categorical features into numerical features so that ML algorithms can use them. Categorical features can take on a limited number of values and are unordered, making them difficult for algorithms to handle. By encoding these features, we can convert them into numerical representations that can be useful for ML algorithms.
It’s common in ML to have categorical features—such as “Sex,” “Zip code,” and “Profession”—that need to be transformed before they can be ingested by an ML algorithm. The table below features this type of categorical data:
Name | Sex | Zip code | Profession |
John Smith | Male | 12345 | Engineer |
Amy Johnson | Female | 67890 | Teacher |
Michael Davis | Male | 54321 | Doctor |
Sarah Miller | Female | 98765 | Accountant |
The scikit-learn library provides several tools for encoding features, including LabelEncoder, OneHotEncoder, and OrdinalEncoder.
The LabelEncoder method
The LabelEncoder method assigns integer values to each category, starting from