Search⌘ K
AI Features

Principal Component Analysis

Explore Principal Component Analysis to understand and apply dimensionality reduction techniques that simplify complex data sets by minimizing information loss. This lesson covers the curse of dimensionality, PCA properties, eigenvalue decomposition, and a practical PCA implementation using the Iris dataset.

In the previous chapter, we explored ensemble learning, where combining multiple models improved predictive performance. While these models are powerful, working with high-dimensional feature spaces can still be challenging due to computational costs, data sparsity, and the risk of overfitting.

Now, it’s time to simplify the data without losing essential information. This is where dimensionality reduction comes in. Before we dive into techniques like PCA, let’s first understand the curse of dimensionality, which refers to the challenges that arise when dealing with high-dimensional datasets and why reducing dimensions is so valuable.

Curse of dimensionality

Curse of dimensionality in machine learning refers to the challenges and computational complexities that arise when dealing with a large number of features (high-dimensional data/high-dimensional feature space). As the number of features or dimensions increases, the amount of data needed to maintain reliable and meaningful patterns also increases, often leading to increased data and computational demands and the risk of overfitting.

Example

Consider a product recommendation system where each product is described by multiple features such as price, size, color, brand, and so on. As the number of features increases, possible combinations grow exponentially, making it harder to find meaningful relationships between products and user preferences. This high-dimensional data can lead to sparse data points, which makes accurate predictions more challenging and requires more data to avoid unreliable results, hence, illustrating the curse of dimensionality.

It seems desirable to reduce the number of features by maintaining the information. Does the term “compression” ring a bell?

Dimensionality reduction

Dimensionality reduction involves decreasing the number of features and is achieved by either selecting the most significant ones or by transforming them into a smaller set of new features. Not all dimensionality reduction methods aim to maintain information (to reconstruct or decompress). Different objectives can be defined in this regard.

PCA

Principal Component Analysis (PCA) is a dimensionality reduction technique that identifies key patterns and relationships within data by projecting it onto a lower-dimensional space while preserving as much variance (spread or information) as possible.

We first need to understand the dimensions to understand PCA. Imagine you’re in a video game where you can move forward, backward, left, and right. These are two dimensions. Now, imagine you can also fly up or dig down. That’s a third dimension. In data science, dimensions are like these directions, but they can be anything—age, height, income, etc.

Note: We can visualize up to three dimensions easily, but what if we have more? That’s where PCA comes in. It helps us to reduce the number of dimensions while keeping the most important information intact.

Properties of PCA

PCA operates by finding a new set of orthogonal (perpendicular) axes, called principal components (PC1,PC2,etc.)PC_1, PC_2, \text{etc.}), that are oriented in the directions where the data is most spread out.

To explain the essential properties of PCA, let’s take an example of nn data points in dd-dimensional space being the columns of the matrix Xd×nX_{d \times n}. Furthermore, let the corresponding columns of the matrix Zk×nZ_{k \times n} represent the kk dimensional projections of the data points estimated using PCA.

Note: Dimensional projections refer to placing the original high-dimensional data points onto a new, simpler set of kk axes (the principal components) defined by the transformation matrix W\mathbf{W}. This matrix W\mathbf{W} is constructed from the top kk eigenvectorsDirections of maximum variance in the data, defining the principal components. of the covariance matrixA matrix that measures how features vary together and captures their pairwise relationships.. The goal is to capture the maximum spread of the data in this new space.

Following are the key properties of PCA:

  • PCA is a linear method (the transformation): The transformation from the original high-dimensional data X\mathbf{X} to the reduced, kk-dimensional data Z\mathbf{Z} is a simple linear mapping (matrix multiplication). Wk×d\mathbf{W}_{k \times d} is the projection matrix whose rows are the chosen principal components.

Z=WXZ = WX

  • The new axes are perpendicular (orthonormal bases): The rows of W\mathbf{W} (the principal components) are perfectly orthogonal (at 9090^\circ angles) to each other, and each vector has a unit length (norm of 1).

Note: Because the principal components are orthonormal, they capture unique, non-overlapping information. If they weren’t perpendicular, the first component might repeat variance captured by the second, making the reduction inefficient.

  • Reconstruction is linear: The original data X\mathbf{X}, denoted by X^\hat{\mathbf{X}} when reconstructed from the reduced data Z\mathbf{Z}, can also be recovered linearly.

X^=WTZ\hat{X} = W^TZ

  • PCA minimizes the reconstruction error: PCA is the optimal projection that minimizes the difference between the original data (X\mathbf{X}) and the reconstructed data (X^\hat{\mathbf{X}}). The Frobenius norm (F2\| \cdot \|_F^2
...