Feature Selection (Filter Methods)

In this lesson, you will learn about Feature Selection which refers to the process of selecting the most appropriate features for making the model.

Feature Selection

Feature or Variable Selection refers to the process of selecting features that are used in predicting the target or output. The purpose of Feature Selection is to select the features that contribute the most to output prediction. The following line from the abstract of Machine Learning Journal sums up the purpose of Feature Selection.

The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.

Usually these benefits of Feature Selection are quoted :

  • Reduces overfitting: Overfitting has been explained in the previous lessons. If the model is overfitting, reducing the number of features is one of the methods to reduce overfitting.

  • Improves accuracy: Less overfitting would perform well on the unseen dataset, so it ultimately leads to the improved accuracy of the model.

  • Reduces training time: A smaller number of features means that there will be less data to train and training will be faster.

Feature Selection Methods are categorized into the following methods.

Filter Methods

The Filter Methods involve selecting features based on their various statistical scores with the output column. The selection of features is independent of any Machine Learning algorithm. The following rules of thumb are as follows.

  • The more the features are correlated with the output column or the column to be predicted, the better the performance of the model.

  • Features should be least correlated with each other. If some of the input features are correlated with some additional input features, this situation is known as Multicollinearity. We recommended getting rid of such a situation for better performance of the model.

Removing features with low variance

In Scikit Learn, VarianceThreshold is a simple baseline approach to Feature Selection. It removes all features whose variance doesn’t meet some threshold. This involves removing the features having the same value (zero-variance features) in all of their rows or to a specified number of rows. Such features provide no value in building the machine learning predictive model.

Univariate Selection Methods

Univariate Feature Selection methods selects the best features based on Univariate Statistical tests. Scikit Learn provides the following methods in the Univariate Feature Selection methods.

  • SelectKBest
  • SelectPercentile

These two above are the most commonly used methods.

  • SelectKBest: It gives us the K best features. It takes in two arguments, k which tells how many features to select and score_func which tells us which statistical test to use. The k value can be specified as all, which gives us the score of all the input features with the output column.

  • score_func: It can be one of the following scoring functions:

  1. For regression : f_regression, and mutual_info_regression

  2. For classification : chi2, f_classif, and mutual_info_classif

f_regression: It is meant for Regression Problems. It calculates the correlation between each input feature and the output column. It is converted to an F score then to a p-value.

mutual_info_regression: Mutual Information (MI) is the measure of mutual dependence between two variables. It is a non-negative value. If Mutual Information (MI) value is zero, it shows that the variables under consideration are independent. If the Mutual Information (MI) value is high, it shows a high dependence between the variables. This scoring function is meant for Regression Problems only, meaning when the output variable is a continuous-valued output. Mutual Information requires a greater number of samples to better estimate dependence between two variables.

chi2: It is meant for classification problems. It computes the chi-squared stats χ2\chi^2 between each non-negative feature and the output categorical variable. Chi-square test measures dependency between variables and helps us exclude the variable that are independent of the target variables.

f_classif: It is the alternative for f_regression in classification problems. It is best recommended for numerical inputs and categorical output.

mutual_info_classif: It is the alternative for mutual_info_regression in classification problems.

  • SelectPercentile: It selects features according to a percentile of the highest scores. It takes in two arguments, score_func which is the same argument as seen above in the case of SelectKBest. The second argument is percentile which denotes how many percentiles of features to keep.

  • Correlation Matrix: It simply gives us the pairwise correlation measure between the features.

Coding examples

The coding examples below demonstrate use cases for the above measures.

Code 1

Get hands-on with 1200+ tech skills courses.