Why do we need feature engineering?
Feature engineering is an important concept and a crucial step in any machine learning pipeline for many reasons, which we will learn in this lesson.
Why feature engineering?
At the beginning of any machine learning project, Raw data will always be messy and unsuited for training a model. Hence, data exploration and cleaning should always be the first thing to do before anything. It helps us visualize and explore our data to understand our dataset better to effectively change data types, remove or input missing values, and eliminate outliers or other features that are not useful for the model to learn. It also involves creating features from the data that represent the problem better. In the end, it all results in a high-performing model.
Still, it is essential to understand that predictive analytics is not magic. Even though an algorithm’s learning phase is fundamental, it can only extract the meaning from the data you provide.
Algorithms do not have the luxury of intuition that humans have. Therefore, most of the time, the success of an algorithm depends on how you engineer the input features.
Course Sections
This course is divided into 12 parts or sections. We will first dive into the essence of feature engineering. The course sections are structured as follows:
- Feature Types: This part will present different types of variables such as continuous, discrete, and categorical variables (including nominal or ordinal), as well as time-date and mixed variables.
- Common Issues: In this part, we will explore many issues that occur in real-world datasets like missing data, variable distribution, data imputations, outliers, and more.
- Dealing with Missing Values: We will learn how to fill the missing data in your dataset using the most effective methods.
- Categorical Encoding: We will discuss in this part the different approaches to transform categorical variables to numerical variables using numbers—frequency encoding, one-hot encoding, and others.
- Feature Transformation: We will look at the mathematical transformations you can apply to improve the distribution of numerical variables, like logarithmic or reciprocal transformations.
- Variable Discretization: This part looks at the different techniques used to discretize variables, like equal width, equal-frequency, discrete decision using decision trees, clustering, and more.
- Dealing with Outliers: This part will show how to recognize outliers and eliminate them from your dataset.
- Feature Scaling: This part will cover various techniques to scale features, such as standardization, scaling to the minimum and maximum, scaling to the unit length of the vector, and more.
- Handling Time-Date and Mixed Variables: We will talk about several methods to generate new features from date, time, and mixed variables.
- Engineering Geospatial Features: In this part, we will see how to deal with geospatial features that are often represented as longitude and latitude.
- Resampling Imbalanced Datasets: This part will present the resampling technique used to solve an issue where the classes are not represented equally in a dataset.
- Advanced Feature Engineering: We will talk about advanced categorical encoding, advanced outlier detection, automated feature engineering, and more.
In each part of this course, we will practically learn many methods and techniques and explore their advantages, limitations, the assumptions that each method makes, and when you should consider applying each technique.