What are the parameters for feature selection?

Dimensionality reduction involves finding a collection of major variables from a large number of random variables under consideration. We use dimensionality reduction to manage a large number of variables in fields such as signal processing, speech recognition, bioinformatics, among others.

One reason for dimensionality reduction is that larger data sets take more time and space. Furthermore, not all the features may be important. Some features may provide no information at all, while others may provide information that is comparable to that provided by other features. Selecting the best set of features will help us decrease the space and time complexity as well as improve the accuracy of classification for supervised and unsupervised learning.

Take a simple email classification problem where we want to classify whether a specific mail is spam or not. A list of features (such as the contents of the mail, uniqueness of the title or not, sender domain address, etc.) can be considered here. We can choose to reduce the number of features used to determine if a mail is spam by joining the title and domain address as one feature. This checks both the title and domain address and classifies them as one class. Our number of features reduces from $x$ to $x-1$ .

There are multiple ways to perform dimensionality reduction. Some of these methods include:

Feature Selection Methods: In order to predict the target variable, feature selection methods are used to reduce the number of input variables to those that are thought to be most relevant to a model. It involves the selection of the most important features in a dataset. It necessitates an understanding of which features of the dataset are relevant to the model and which are not. There are several feature selection methods and a few will be discussed in this shot.
Matrix Factorization : Matrix factorization is used in many dimensionality reduction approaches. Finding two (or more) matrices whose product best approximates the original matrix is the basic idea. There can be very large, sparse, and unordered matrices for some domains. These matrices can be factored to provide a set of more manageable, compact, and ordered matrices.
Manifold Learning : Manifold learning is a type of unsupervised estimator that attempts to represent datasets as low-dimensional manifolds embedded in high-dimensional spaces. High-dimensional datasets can be difficult to visualize. The dimension of a dataset must be decreased in some way to improve visualization of its structure. Taking a random projection of the data is the simplest technique to achieve this dimensionality reduction. Though this allows for some visibility of the data structure, the choice’s randomness leaves a lot to be desired. Some manifold learning approaches include multidimensional scaling (MDS), locally linear embedding (LLE), and isometric mapping (IsoMap).
Autoencoder Methods : Autoencoder is an unsupervised artificial neural network that compresses data to a smaller dimension before reconstructing back the input. Autoencoder finds a lower-dimensional representation of the data by focusing on the most relevant features and eliminating noise and redundancy. As an autoencoder attempts to reconstruct the input data, the number of output units must be equal to the number of input units. An encoder and a decoder are usually employed here. The encoder compresses the data delivered, while the decoder restores it to its original state. Autoencoder is a type of feature extraction method for dimensionality reduction.

Components of dimensionality reduction

Dimensionality reduction is made up of two components:

• Feature selection: In this step, we strive to locate a subset of the original set of variables or features so that we can model the problem with a smaller set. There are three common methods:

Filter methods: In place of the error rate, filter methods score a feature subset using a substitute measure. This metric is simple to compute while still capturing the utility of the feature set. Filters are less computationally costly than wrappers, but they produce a feature set that isn’t tailored to a certain sort of predictive model. A filter’s feature set is more general than a wrapper’s, hence it has worse prediction performance. The feature set, on the other hand, does not include the assumptions of a prediction model and is thus better suited to revealing the correlations between the characteristics. Filter methods have also been used as a pre-processing step for wrapper methods, allowing them to be employed on more complex situations.
Wrapper methods: Wrapper methods score feature subsets using a predictive model. Each fresh subset is used to train a model, which is then put to the test on a control set. The score for that subset is calculated by counting the number of errors made on that hold-out set. The aim is to employ a subset of features in wrapper techniques and train a model with them. Based on the inferences obtained from the prior model, a decision can be made whether to add or subtract characteristics from the subset. The problem can be reduced to a simple search problem. These approaches are frequently quite time-consuming to compute.
Embedded methods : Embedded methods incorporate the qualities of both filtering and wrapping methods of feature selection. Algorithms with built-in feature selection methods are used to make it happen. LASSO and RIDGE regression, which contain built-in penalization factors to reduce overfitting, are two of the most common examples of these algorithms.

• Feature extraction: This reduces data from a high-dimensional space to a lower-dimensional space with fewer dimensions.

In this shot, we will focus on feature selection.

What is feature selection?

Feature selection entails the process of distinguishing between highly predictive and redundant information. Machine learning and statistics is used to pick a subset of relevant features and use them in creating models. The following are some of the reasons why feature selection approaches are used:

• Models are simplified to make them easier to interpret for researchers and users.

• Training sessions are shorter.

• To prevent the curse of dimensionality.

• Makes data more compatible with a learning model class.

When utilizing a feature selection technique, the key concept is that the data contains some features that are either redundant or irrelevant and hence may be deleted without causing significant information loss.

Filter metrics for feature selection

The following filter metrics can be used for feature selection:

correlation
entropy

Feature Correlation

Feature correlation is considered one important step in the feature selection phase of data pre-processing, especially if the features’ data type is continuous. Correlation is a technique for figuring out how various variables and qualities in a dataset are related. If $x1$ and $x2$ are two correlated features of a data set, then models with such data set will give similar output data, as $x1$ and $x2$ will contribute similar information regarding the model. Hence only one representation of such features will be required. For dimensionality reduction purposes, a correlation algorithm is used to select a representation of features in each correlated group and the redundant features are ignored. The Pearson correlation coefficient can be used to determine the correlation between features. To calculate the Pearson correlation coefficient, take the covariance of the input feature $x1$ and output feature $x2$ and divide it by the product of the two features’ standard deviation as shown below:

P( ${x1}$ , ${x2}$ ) = cov( ${x1}$ , ${x2}$ ) / (σ ${x1}$ σ ${x2}$ )

If two variables are correlated, we can predict one from the other. As a result, if two features are correlated, the model only requires one of them, as the other does not provide any extra information.

Entropy(H)

The entropy of a feature $f1$ is by removing $f1$ and then computing the entropy of the remaining features. Now, the bigger the information content of $f1$ is, the lower the entropy value (excluding $f1$ ). The entropy of all the features is determined in this way. Finally, the optimality of the features based on which characteristics are selected is determined by either a threshold value or a further relevancy check. Entropy is commonly utilized in unsupervised learning since the dataset has a class field, and so the entropy of the features might provide significant information.

Conclusion

There is no one-size-fits-all approach to selecting features. There is no best sets of input variables or machine learning algorithms, just as there is no best set of input variables. Not universally, at least.

Instead, you must do careful, methodical experimentation to determine what works best for your individual problem. Experiment with a variety of alternative models based on different subsets of data selected using various statistical measures to see which one works best for your specific problem.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)