In machine learning, it is important to have enough data to train our models with high generality. At the same time, this large amount of data may have redundant features that do not contribute much towards the learning process of the models. Instead, they slow down the training process, and result in wasted time and effort.
In such cases, we can use dimension reduction methods to remove these unnecessary features and keep ones that are of utmost importance. In this shot, we will discuss some of the most commonly used methods.
While they explore any data set, an ML engineer’s first step is to locate the missing values and impute them. However, if the number of missing values is too big, the feature may not help train the model. Therefore, such attributes can be dropped. The missing value ratio method suggests that when the percentage of missing values becomes greater than a defined threshold, that column should be dropped.
The following command can be used to calculate the percentage of missing values,
train contains the pandas dataframe of the training data.
There may be some variables in the data that have almost the same value for all cases. For example, if a variable always has a value around 5, then there is no need to keep it in the data because it will not affect the target variable.
We can find these features by calculating the variance. Those attributes that have minimal variance can be removed from the data without too much of a negative impact.
The following command can be used to find the variance of the different features.
A high correlation between features means that they follow the same trends and have the same patterns. Therefore, if we keep only one of these highly correlated variables, it will give us the same information. To find the correlation between the different features in a dataset, we can use the following command:
This command allows us to see the correlation between every pair of features. If this correlation is greater than 0.6, we may drop the attribute.
We should keep the features that have high correlation with the target variable.
Decision trees select the most important features in a dataset, which allows us to then remove the less important ones. Many trees are created against the target variable, with each tree trained on a small subset of the total attributes. After this process, the attributes which were most frequently selected as the best split can be retained.
In the backward feature elimination method, we drop different features one by one to see the effect of each. The features that contribute the least towards the error are removed.
In the beginning, the model is trained using all n variables. Then, we remove each feature one by one and train the model using the remaining n-1 features. The column which contributed the least towards the error is removed at the end. Then, we are left with a total of n-1 columns.
In the next iteration, we remove each feature one by one and train the model with the remaining n-2 features. This process carries on until no more features can be removed.
The forward feature selection method is the opposite of backward feature elimination. In this technique, we start use only one feature to begin training, and progressively add features that cause the greatest performance increase.
Principal component analysis, or PCA, is a technique that reduces the dimension of the data while it retains most of its variance. This is done by the creation of a new set of orthogonal dimensions from the original ones, where each new component is a linear combination of the original ones. The transformation is done so that the first principal component explains the highest variability of the dataset.
To reduce the dimension from n to k, the first k principal components can be selected for training.
View all Courses