What is dimensionality reduction?

In this shot, we will look at what dimensionality reduction is and why it is important.

Overview

What is dimensionality?
What is dimensionality reduction?
Why is dimensionality reduction important?
What are some dimensionality reduction methods?
Examples of dimensionality reduction in Machine Learning
What fields use dimensionality reduction?

Why is dimensionality reduction important?

While it is possible for a dataset to have 4 features, or even 50 features, what happens when the number of features increases exponentially to 1000 or even 1 million?

Analyzing high dimension data can be computationally expensive and difficult to control, and it is easier to run into issues when analyzing more data than less. Therefore, dimensionality reduction is important because it helps us reduce the number of features while still retaining important information needed for the data analysis.

Using Principal Component Analysis (PCA)

We can use the PCA technique to reduce the dimensionality of our dataset. The advantage of PCA is that it focuses on the principal components that contribute more to the overall variance of the dataset.

Before we use PCA, we must scale the feature data. With scaling, the different variables are placed on a normalized scale. Scaling is important because it removes the dominating impact one variable might have over another because of its range (e.g., a weight of 60 kg seems much higher in magnitude than a height of 1.6 m).

In the following example, we will use the StandardScaler from sklearn. You can read more about it here.

The code from the previous step has been prepended in the backend.

Using the `SelectKBest` function

In sklearn, there is a function called SelectKBest that allows us to select features according to the $k$ highest scores. The function calculates a metric we choose, sorts the features according to their metric scores, and selects the $k$ best features.

You can read more about SelectKBest here.

For the purposes of this example, we will select the best 6 features. Since we have already scaled the data, we will apply SelectKBest to our scaled features. The metric we will use is f_classif, which is the ANOVA f-value between label/feature for classification tasks.

You can read more about f_classif here.

What is dimensionality reduction?

Overview