Dockerfile.tar.gz

Numerical Variables

Default Job

In this course, you will learn how to apply basic and more advanced feature engineering to tabular data with python, during this course,  we will see a range of different techniques and methods to handle many common cases within the data set, so in result, you will create great features so that your machine learning models can predict good results.

Feature Engineering with Python

## Advanced outlier detection
Previously in the course, we talked about various methods we can use to detect and handle outliers, but we used only statistical measurements to denote the outliers. However, this section will cover some advanced algorithms to detect these anomalies in datasets better.

### DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a popular clustering method used in machine learning to separate high-density clusters from clusters of low density. It is similar to K-means, except the number of clusters is not specified in advance. DBSCAN clustering is known for being robust to outliers.

We need to choose two hyperparameters:
- A positive number epsilon: is for the maximum distance between two samples for one to be considered in the neighborhood of the other.
- A natural number min_samples: is the number of samples in a neighborhood for a point to be considered a core point.

Here is an illustration showing how an abstract diagram on how DBSCAN finds the clusters and the noise points:



# Advanced outlier detection
Previously in the course, we talked about various methods we can use to detect and handle outliers, but we used only statistical measurements to denote the outliers. However, this section will cover some advanced algorithms to detect these anomalies in datasets better.

## DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a popular clustering method used in machine learning to separate high-density clusters from clusters of low density. It is similar to K-means, except the number of clusters is not specified in advance. DBSCAN clustering is known for being robust to outliers.

We need to choose two hyperparameters:
- A positive number epsilon: is for the maximum distance between two samples for one to be considered in the neighborhood of the other.
- A natural number min_samples: is the number of samples in a neighborhood for a point to be considered a core point.

Here is an illustration showing how an abstract diagram on how DBSCAN finds the clusters and the noise points:



The purpose of this lesson is to cover some advanced methods and techniques to help you detect outliers in your data. We will explore how each technique works in detail, and of course, some easy code snippets.


Advanced outlier detection

The purpose of this lesson is to cover some advanced methods and techniques to help you detect outliers in your data. We will explore how each technique works in detail, and of course, some easy code snippets.

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion

Advanced outlier detection

Advanced outlier detection

DBSCAN