The dimensionality of a dataset refers to the number of variables and attributes the data possesses. High input variables datasets can affect the function and performance of the algorithm used. This problem is termed the curse of dimensionality.
To avoid this problem, it is necessary to reduce the number of input variables. Reduction of input features manifests in the reduction of dimensions of the feature space. This reduction is termed as dimensionality reduction, which refers to techniques that are used to reduce the features or dimensions in a dataset.
The application of dimensionality reduction to a dataset gives the following benefits:
Two methods are used when reducing dimensionality, and they are:
The following are some techniques of dimensionality reduction as applied in Python.
Principal Component Analysis (PCA) works on the principle of reducing the number of variables of a dataset while maintaining vital information relating to the dataset. Features or variables present in the original set are linearly combined. The resulting features from these combinations are termed principal components. The first principal component encapsulates the bulk of the variance present in the dataset. The second principal component takes the majority of the remaining variance in the dataset. This follows suite in the rest of the principal components. The principal components are not mutually related. Principal components are eigenvectors of a data’s covariance matrix. Eigenvectors are linear algebra concepts computed from the covariance matrix, which is a square matrix that calculates the covariance between each pair of elements of the initial variables. Every eigenvector has its corresponding eigenvalue and together, they sum up to the number of dimensions of the data. Eigenvalues are coefficients attached to eigenvectors which indicate the variance of each principal component.
The following are some applications of PCA:
|1. Correlated features are removed||1. Difficult to interpret independent variables|
|2. Algorithm performance is enhanced||2. Data has to be standardized before PCA is done|
|3. Reduces overfitting of data||3. There might be loss of information|
|4. Visualization of data is improved|
Figure 1 represents a figure of two principal components in a dataset:
#importing sklearn library import sklearn #importing dataset from sklearn library import sklearn.datasets as etheldata iris = etheldata.load_iris() from sklearn.decomposition import PCA import matplotlib.pyplot as PCAplot from sklearn.model_selection import train_test_split X, y = etheldata.load_iris(return_X_y=True) #scaling data from sklearn.preprocessing import StandardScaler X_scaler = StandardScaler () .fit_transform (X) X_scaler[:4] pca = PCA(n_components=3) pca.fit(X) X_tran = pca.transform(X) #visualization of output fig, axe = PCAplot.subplots(dpi=400) axe.scatter(X_tran[:, 0], X_tran[:, 1], c=y, marker= 'o') fig.savefig("output/img.png") PCAplot.show(fig)
View all Courses