**Principal Component Analysis (PCA)** is a fundamental method for dimensionality reduction as it can help us reduce the number of variables of a huge dataset into a smaller one, without losing much information. Our goal is to simplify the dataset as much as possible while trading off the least amount of information to maintain high accuracy. Minimized datasets are easier to analyze, evaluate, and visualize, which is why PCA is considered a core step in working with big data.

PCA uses an orthogonal transformation to manipulate interdependent variables and change them into linearly independent variables, also known as principal components. PCA is a five-step procedure, as explained below.

We will be standardizing the dataset by using the following formula:

where,

`x`

= data point value

`mu`

(

`sigma`

(

Once we have obtained the standardized matrix, we can get the covariance matrix using the technique below:

F1 | F2 | F3 | |

F1 | cov( F1, F1 ) = var ( F1 ) | cov ( F1 , F2 ) | cov ( F1 , F3 ) |

F2 | cov ( F2 , F1 ) | var ( F2 ) | cov ( F2 , F3 ) |

F3 | cov ( F3 , F1) | cov ( F3 , F2 ) | var ( F3 ) |

Once we have obtained the covariance matrix, we calculate the eigenvectors.

Let A be a square matrix (in our case the covariance matrix), ν a vector, and λ a scalar that satisfies Aν = λν. Then λ, called the eigenvalue, is associated with the eigenvector ν of A.

**Eigenvectors** are non-zero vectors that change when a linear transformation is applied. **Eigenvalues** are the factors at which the eigenvector changes.

To calculate an eigenvector, we use the following equation:

where,

`A`

= Covariance matrix

`lambda`

(

`I`

= Identity matrix

Once we solve the equation, we will obtain multiple eigenvalues, which will be used to calculate the eigenvector.

This will be done by sorting out the eigenvectors by their eigenvalues. We will then be discarding the vectors with the least significant eigenvalues and selecting the top **k **eigenvectors with the highest eigenvalues.

**k **should be chosen such that k is the smallest value, when at least 1% of the variance stays the same.

To transform the matrix, we will be using the following formula:

where,

`M`

= Transformed matrix

`FM`

= Feature matrix (orignal dataset)

`FV`

= Feature vector

We will be using the in-built dataset in R called `mtcars`

. First, we import the `devtools`

library so we can install `ggbiplot`

library for making our PCA plots. Once that is done, we run PCA analysis on our data and plot to visualize the similarity of our data. Below, we have an example of running PCA in **R**.

#check if required libraries are installed.require(devtools)require(ggbiplot)#pca analysis command for running PCA.mtcars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE,scale. = TRUE)str(mtcars.pca)#plotting PCAggbiplot(mtcars.pca, labels=rownames(mtcars))

Looking at the plot, we can see how three cars are clustered at the top. This shows their similarity and we can verify this as all three cars are sports cars and so the analysis makes sense. In our output window, we can also see how our principal components have been selected and what our scales are.

Lines 2 and 3: We check if the required libraries are installed for PCA. If not, the code will run into an error and you will need to install these packages.

Line 6: We use the built-in function `prcomp`

for PCA.

Line 7: We then use the `str`

function to visualize our analysis. This can be seen in the output window.

Line 10: Finally, we use the `ggbiplot`

function to plot the PCA graph.

For better visualization, we can cluster our data via countries. It will help us look at what countries prioritize what features in their cars.

#creating a cluster for cuntriesmtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))#plotting pca for visualizationggbiplot(mtcars.pca,ellipse=TRUE, labels=rownames(mtcars), groups=mtcars.country)

We can see that US cars are more focused on `hp`

, `cyl`

,`disp`

and `wt`

, whereas Japanese cars are focused on `gear`

, `drat`

, `mpg`

and `qsec`

. European cars tend to find a middle point between the two.

Line 2: We use the `c`

and `rep`

keywords to create a list of countries for each car to pass into `groups`

part of our PCA.

Line 4: We use the `ggbiplot`

function with the `groups`

parameter to label different cars via their origin so we can differentiate between cars belonging to different countries.

Copyright ©2024 Educative, Inc. All rights reserved

TRENDING TOPICS