What is Principal Component Analysis (PCA)?

Introduction

Principal Component Analysis (PCA) is a fundamental method for dimensionality reduction as it can help us reduce the number of variables of a huge dataset into a smaller one, without losing much information. Our goal is to simplify the dataset as much as possible while trading off the least amount of information to maintain high accuracy. Minimized datasets are easier to analyze, evaluate, and visualize, which is why PCA is considered a core step in working with big data.

Steps for conducting PCA

PCA uses an orthogonal transformation to manipulate interdependent variables and change them into linearly independent variables, also known as principal components. PCA is a five-step procedure, as explained below.

Step 1: Standardizing the dataset

We will be standardizing the dataset by using the following formula:

$x_{new} = \frac{x - \mu}{\sigma}$

where,

x = data point value

mu ( $\mu$ ) = mean of the feature

sigma ( $\sigma$ ) = standard deviation of the feature

Step 2: Computing the covariance matrix

Once we have obtained the standardized matrix, we can get the covariance matrix using the technique below:

Once we have obtained the covariance matrix, we calculate the eigenvectors.

Step 3: Identifying principal components by computing eigenvectors

Let A be a square matrix (in our case the covariance matrix), ν a vector, and λ a scalar that satisfies Aν = λν. Then λ, called the eigenvalue, is associated with the eigenvector ν of A.

Eigenvectors are non-zero vectors that change when a linear transformation is applied. Eigenvalues are the factors at which the eigenvector changes.

To calculate an eigenvector, we use the following equation:

$A - \lambda I = 0$

where,

A = Covariance matrix

lambda ( $\lambda$ ) = eigenvalues

I = Identity matrix

Once we solve the equation, we will obtain multiple eigenvalues, which will be used to calculate the eigenvector.

Step 4: Constructing the feature vector

This will be done by sorting out the eigenvectors by their eigenvalues. We will then be discarding the vectors with the least significant eigenvalues and selecting the top k eigenvectors with the highest eigenvalues.

k should be chosen such that k is the smallest value, when at least 1% of the variance stays the same.

Step 5: Transforming the matrix

To transform the matrix, we will be using the following formula:

$M = FM \times FV$

where,

M = Transformed matrix

FM = Feature matrix (orignal dataset)

FV = Feature vector

Code example

We will be using the in-built dataset in R called mtcars. First, we import the devtools library so we can install ggbiplot library for making our PCA plots. Once that is done, we run PCA analysis on our data and plot to visualize the similarity of our data. Below, we have an example of running PCA in R.

Looking at the plot, we can see how three cars are clustered at the top. This shows their similarity and we can verify this as all three cars are sports cars and so the analysis makes sense. In our output window, we can also see how our principal components have been selected and what our scales are.

Code explanation

Lines 2 and 3: We check if the required libraries are installed for PCA. If not, the code will run into an error and you will need to install these packages.

Line 6: We use the built-in function prcomp for PCA.

Line 7: We then use the str function to visualize our analysis. This can be seen in the output window.

Line 10: Finally, we use the ggbiplot function to plot the PCA graph.

Improved PCA

For better visualization, we can cluster our data via countries. It will help us look at what countries prioritize what features in their cars.

	F1	F2	F3
F1	cov( F1, F1 ) = var ( F1 )	cov ( F1 , F2 )	cov ( F1 , F3 )
F2	cov ( F2 , F1 )	var ( F2 )	cov ( F2 , F3 )
F3	cov ( F3 , F1)	cov ( F3 , F2 )	var ( F3 )

What is Principal Component Analysis (PCA)?

Introduction

Steps for conducting PCA

Step 1: Standardizing the dataset

Step 2: Computing the covariance matrix

Covariance Matrix

Step 3: Identifying principal components by computing eigenvectors

Step 4: Constructing the feature vector

Step 5: Transforming the matrix

Code example

Code explanation

Improved PCA

Code explanation