Data Scrubbing Operation: Dimension Reduction

This lesson will introduce you to the concept of dimension reduction.

We'll cover the following

Quick overview

Dimension reduction, also known as descending dimension algorithms, transforms data to a lower dimension. This can help to lessen computational resources and visualize patterns in the data.

Dimensions are the number of variables describing the data, such as the city of residence, country of residence, age, and gender. Four variables can be plotted on a scatterplot, but three-dimensional and two-dimensional plots are easiest for the human eye to interpret.

The goal of a descending dimension algorithm is to arrive at a minimal set of variables that mimic the distribution of the original dataset’s variables. In addition, reducing the number of variables makes it easier to recognize patterns, including natural groupings, outliers, and anomalies.

It’s important to note that dimension reduction isn’t a case of deleting columns., Rather, it mathematically transforms information in those columns to capture the information using fewer variables (columns). If, for example, you look at house prices, you might find multiple correlated variables (such as house area and postcode) that you can merge into a new variable that adequately represents those two variables. Thus, by applying dimension reduction before running the core algorithm, the model will run faster, consume less computational resources, and may actually provide more accurate predictions.

Another side benefit of this technique is the opportunity to visualize multidimensional data. Remember that the maximum number of plottable dimensions for a scatterplot is four. Although two or three dimensions are ideal (the fourth dimension is time), descending dimension algorithms can be used to streamline a dataset with more than four dimensions into four or fewer variables and project the synthetic variables onto the visual workspace of a scatterplot.

Get hands-on with 1200+ tech skills courses.