Dimensionality Reduction with PCA

Learn how to use principal component analysis.

Principal component analysis (PCA)

Redundant information can skew the model outcome if we have a dataset with highly correlated features. This is known as the multicollinearity problem. Using Principal Component Analysis (PCA), we can reduce the number of attributes without losing the original information.

PCA is a data transformation technique that combines existing features into new components to maximize data variance. PCA also makes these components independent of each other (minimizing correlation) and ranks them based on their contribution factor. Later, we can select a subset of transformed features (components) that represent most of the data variance.

Let’s assume we have a dataset with two features (feature 1 and feature 2). PCA tries to fit these two features and calculates the first component in such a way that the variance is maximum and the sum of squared errors is the minimum. To do that, PCA draws a line through the observations like a regression line (the red line on slide 3). This is the first component.

Since we have two features, PCA will construct two components. For the second component, PCA will draw an orthogonal line to the first component (the blue line on slide 4). Again, it tries to maximize the spread and minimize the error. If we had more features, we would continue to apply the same principle and identify new axes.

Now, if we take component 1 as the x axis and rotate our diagram, we’ll get a different perspective of the dataset (slide 5).

PCA doesn’t reduce the number of features but sorts the components based on the variance they represent. Then, we can select a subset of the principal components with a comfortable variance (around 70% to 80% is ideal).

Get hands-on with 1200+ tech skills courses.