Dockerfile.tar.gz

Numerical Variables

Default Job

In this course, you will learn how to apply basic and more advanced feature engineering to tabular data with python, during this course,  we will see a range of different techniques and methods to handle many common cases within the data set, so in result, you will create great features so that your machine learning models can predict good results.

Feature Engineering with Python

## One more library

To handle imbalanced datasets, we need to use a new rich library called imbalanced-learn alongside the core Python libraries like NumPy, Pandas, and scikit-learn. The imbalanced-learn library is a part of scikit-learn-contrib projects. It gives us access to many advanced methods like SMOTE and Tomek Links.

You can install this library in your environment using these commands:

```
# using pip
pip install -U imbalanced-learn
# using conda
conda install -c conda-forge imbalanced-lear
```

## The metric problem
The accuracy is often the first measure we use when evaluating models on our classification problems. However, using accuracy as a performance measure for highly imbalanced datasets is not a good idea. 

For example, if ```90%``` of the data belong to class-1 in a binary classification problem, a default prediction of class-1 occurs for all data because the classifier always “predicts” the most common class without learning or performing any analysis. That may lead to a ```90%``` accuracy, but the predictions are wrong and misleading.

### Example

Let us suppose we have a dataset of 1000 patients, out of which 10 are cancer patients and the other 990 are healthy. As the majority class ```Healthy``` is about nine times bigger than the minority class ```Cancer```, using the accuracy metric to evaluate our model will be dangerously misleading because we will get an accuracy of 99%. 

This may seem significant at first. However, if we dig deeper, we will find that this accuracy only reflects the underlying class distribution. Models are designed to achieve the best accuracy, so the model decided to predict 'Healthy' and achieve high accuracy. As such, the model’s fantastic performance and success are just an illusion.
That is why the choice of metrics used when working with imbalanced datasets is significant.

## Investigate the dataset
The first thing to do is to check if your dataset presents an imbalanced dataset problem.

We will be using a generated dataset for this part to show how you can plot the count of each class using Seaborn like the following:



# One more library

To handle imbalanced datasets, we need to use a new rich library called imbalanced-learn alongside the core Python libraries like NumPy, Pandas, and scikit-learn. The imbalanced-learn library is a part of scikit-learn-contrib projects. It gives us access to many advanced methods like SMOTE and Tomek Links.

You can install this library in your environment using these commands:

```
# using pip
pip install -U imbalanced-learn
# using conda
conda install -c conda-forge imbalanced-lear
```

# The metric problem
The accuracy is often the first measure we use when evaluating models on our classification problems. However, using accuracy as a performance measure for highly imbalanced datasets is not a good idea. 

For example, if ```90%``` of the data belong to class-1 in a binary classification problem, a default prediction of class-1 occurs for all data because the classifier always “predicts” the most common class without learning or performing any analysis. That may lead to a ```90%``` accuracy, but the predictions are wrong and misleading.

## Example

Let us suppose we have a dataset of 1000 patients, out of which 10 are cancer patients and the other 990 are healthy. As the majority class ```Healthy``` is about nine times bigger than the minority class ```Cancer```, using the accuracy metric to evaluate our model will be dangerously misleading because we will get an accuracy of 99%. 

This may seem significant at first. However, if we dig deeper, we will find that this accuracy only reflects the underlying class distribution. Models are designed to achieve the best accuracy, so the model decided to predict 'Healthy' and achieve high accuracy. As such, the model’s fantastic performance and success are just an illusion.
That is why the choice of metrics used when working with imbalanced datasets is significant.

# Investigate the dataset
The first thing to do is to check if your dataset presents an imbalanced dataset problem.

We will be using a generated dataset for this part to show how you can plot the count of each class using Seaborn like the following:



Choosing a proper evaluation metric to evaluate your model's performance can be a real struggle if you have an imbalanced dataset and your project concerns a classification task. Why is it a problem? And how to evaluate your model? We will see all that in this lesson. We will also learn about a new library that gives us access to many robust solutions to handle imbalanced datasets, how to install it, and finally introduce a few methods to help you check your dataset and see whether it is an imbalanced dataset or not in order to fix it.



Setup & the metric problem

Choosing a proper evaluation metric to evaluate your model's performance can be a real struggle if you have an imbalanced dataset and your project concerns a classification task. Why is it a problem? And how to evaluate your model? We will see all that in this lesson. We will also learn about a new library that gives us access to many robust solutions to handle imbalanced datasets, how to install it, and finally introduce a few methods to help you check your dataset and see whether it is an imbalanced dataset or not in order to fix it.

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion

Setup & the metric problem

One more library