Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

machine learning
communitycreator

What is dimensionality reduction?

Tamisha Dzifa Segbefia

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Answers Code

In this shot, we will look at what dimensionality reduction is and why it is important.

Overview

  • What is dimensionality?
  • What is dimensionality reduction?
  • Why is dimensionality reduction important?
  • What are some dimensionality reduction methods?
  • Examples of dimensionality reduction in Machine Learning
  • What fields use dimensionality reduction?

What is dimensionality?

To understand dimensionality, we need to understand what a dataset in Machine Learning (ML) is.

A dataset is simply a collection of data. Many ML projects use tabular data. Tabular data is data that contains rows and columns of information, e.g., a spreadsheet. Dimensionality refers to the number of features or columns a dataset has. For example, in the image below, the dimensionality of the dataset is 10 as there are 10 columns.

A screenshot of the first five rows of the seaborn diamonds dataset
A screenshot of the first five rows of the seaborn diamonds dataset

What is dimensionality reduction?

Dimensionality reduction is the process of transforming data from being in a high dimensional space to being in a low dimensional space. It can also refer to a number of techniques that are employed to reduce the number of input features in a dataset.

Why is dimensionality reduction important?

While it is possible for a dataset to have 4 features, or even 50 features, what happens when the number of features increases exponentially to 1000 or even 1 million?

Analyzing high dimension data can be computationally expensive and difficult to control, and it is easier to run into issues when analyzing more data than less. Therefore, dimensionality reduction is important because it helps us reduce the number of features while still retaining important information needed for the data analysis.

What are some dimensionality reduction methods?

Since this post is an introduction to dimensionality reduction, we won’t go into the details of the methods mentioned below. However, you can read more about them in your spare time.

Some dimensionality reduction methods include:

  • Principal Component Analysis (PCA)
  • Select K Best Features
  • Linear Discriminant Analysis (LDA)

Examples of dimensionality reduction in Machine Learning

In this section, we will look at dimensionality reduction in Machine Learning. This might also be referred to as feature engineering / feature selection. Feature selection is choosing a subset of relevant features for use in building or constructing a model. In applying feature selection, the dimensions of the dataset are reduced.

For the following examples, the breast cancer dataset from the scikit-learn inbuilt datasets will be used.

import numpy as np
import pandas as pd
import sklearn.datasets as datasets
breast_cancer = datasets.load_breast_cancer()
cancer = pd.DataFrame(breast_cancer.data, columns=[breast_cancer.feature_names])
cancer['target'] = breast_cancer.target
print(cancer.shape)

From the code above, we see that the original dataset has 30 features. The remaining column is the target (that is, what we are trying to predict).

Using Principal Component Analysis (PCA)

We can use the PCA technique to reduce the dimensionality of our dataset. The advantage of PCA is that it focuses on the principal components that contribute more to the overall variance of the dataset.

Before we use PCA, we must scale the feature data. With scaling, the different variables are placed on a normalized scale. Scaling is important because it removes the dominating impact one variable might have over another because of its range (e.g., a weight of 60 kg seems much higher in magnitude than a height of 1.6 m).

In the following example, we will use the StandardScaler from sklearn. You can read more about it here.

The code from the previous step has been prepended in the backend.

from sklearn.preprocessing import StandardScaler
cancer_features = cancer.drop('target', axis=1)
scaler = StandardScaler()
scaler.fit(cancer_features)
scaled_data = scaler.transform(cancer_features)
print(scaled_data)

Now that the data has been scaled, we can perform PCA. With the PCA algorithm, you can select the number of components and change this number as you see fit. For this example, we will select the arbitrary number of 5 components.

from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca.fit(scaled_data)
scaled_pca = pca.transform(scaled_data)
print(scaled_data.shape)
print(scaled_pca.shape)

From the code above, we have successfully reduced the dimensionality of the features from 30 to 5 using PCA.

Using the SelectKBest function

In sklearn, there is a function called SelectKBest that allows us to select features according to the kk highest scores. The function calculates a metric we choose, sorts the features according to their metric scores, and selects the kk best features.

You can read more about SelectKBest here.

For the purposes of this example, we will select the best 6 features. Since we have already scaled the data, we will apply SelectKBest to our scaled features. The metric we will use is f_classif, which is the ANOVA f-value between label/feature for classification tasks.

You can read more about f_classif here.

import sklearn.feature_selection as fs
best_k = fs.SelectKBest(fs.f_classif, 6)
best_k.fit(cancer_features, cancer['target'])
best_k_features = best_k.transform(cancer_features)
print(cancer_features.shape)
print(best_k_features.shape)

From the code above, we have successfully reduced the dimensionality of the features from 30 to 6 using SelectKBest.

What fields use dimensionality reduction?

Dimensionality reduction is commonly used in fields that process and work with high volumes of data, such as bioinformatics and signal processing. Dimensionality reduction also performs tasks such as noise reduction and visualization.

An image showing noise reduction in a signal.
An image showing noise reduction in a signal.

Recap and Conclusion

To conclude, you’ve read about what dimensionality is, as it relates to a dataset, and some issues that might arise when a dataset has many features. You’ve also seen the names of some dimensionality reduction methods and explored the implementation of two of them in Python. Finally, you’ve learned some fields that employ dimensionality reduction.

Hopefully, this article was helpful. Thanks for reading and have a great day.

RELATED TAGS

machine learning
communitycreator

CONTRIBUTOR

Tamisha Dzifa Segbefia

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Answers Code
Keep Exploring