How to remove irrelevant features in machine learning

In machine learning, the quality of the input data significantly impacts model performance. Irrelevant features that do not bring meaning to the prediction task can hinder accuracy, increase computational overhead, and lead to overfitting. In contrast, efficient feature selection optimizes computing resources while improving model interpretability and generalization. 

How are irrelevant features removed?

Machine learning engineers use feature selection to select the most relevant features in a dataset. It enhances the model’s efficiency, making it easier to understand. Irrelevant features can be purged with many techniques. They are stated as follows:

  1. Filter-based techniques: In this technique, we decide which features to select and which to drop based on mathematical criteria. These methods may include variance threshold, chi-squared test, and correlation-based. 

  2. Embedded techniques: Embedded techniques, as the name suggests, embed the feature selection techniques within the model training. These may include Lasso and Ridge regressions and decision trees, etc.

  3. Wrapper method: Features are removed by training the machine learning model multiple times on different feature subsets. Examples include forward selection and backward elimination.

  4. Hybrid techniques: A hybrid technique mixes and matches other feature selection techniques. For example, we might preselect features using the filter-based method and then move on toward further refinement with embedded techniques.

Code example

Let's dive into the code below and explore the different methods to remove irrelevant features from the dataset:

# Perform imports and load the dataset
import pandas as pd     # Importing the pandas library
import numpy as np      # Importing the numpy library
from sklearn.model_selection import train_test_split # Importing train_test_split function for data splitting
from sklearn.feature_selection import VarianceThreshold # Importing VarianceThreshold method to remove features with low/specified variance
standard_customer_dataset = pd.read_csv(r"Standard_Customer_Data.csv", nrows=66)    # Reading only 66 rows of data 
standard_customer_dataset.shape  # Shape of data (rows x cols)

# Perform train test split
standard_customer_dataset.drop("TARGET",axis=1)    # Dropping target column
predict_y = standard_customer_dataset.TARGET       # Assigning/Extracting target column values
# Splitting the data into train and test set with 70-30 ratio
train_customer_data_x, test_customer_data_x, train_customer_data_y, test_customer_data_y = train_test_split(standard_customer_dataset,predict_y,test_size=0.3,random_state=41)

# Remove constant features
filter_constant_values = VarianceThreshold(threshold=0)  # Removing features with variance below zero
constant_data_values = filter_constant_values.fit_transform(train_customer_data_x)  # Identifying and removing features from the training data 
# Iterate through train data and by using get_support function to get all the constant columns
constant_columns = []
for col in train_customer_data_x:
    if col not in train_customer_data_x.columns[filter_constant_values.get_support()]:
        constant_columns.append(col)
print("\nConstant column: \n", constant_columns)  
all_constants = train_customer_data_x.drop(constant_columns,axis=1) # Dropping/removing constant features columns
all_constants.shape   # Shape after removal

# Remove quasi-constant features
# Same process as above with diffrent threshold
my_quasi_constant_filter = VarianceThreshold(threshold=0.01)
quasi_constant_customer_data = my_quasi_constant_filter.fit_transform(train_customer_data_x)
quasi_constant_columns = []
for col in train_customer_data_x.columns:
    if col not in train_customer_data_x.columns[my_quasi_constant_filter.get_support()]:
        quasi_constant_columns.append(col)
print("\nQuasi constant column: \n", quasi_constant_columns) 
quasi_constant_columns_to_drop = train_customer_data_x.drop(quasi_constant_columns, axis=1)
quasi_constant_columns_to_drop.shape

# Remove duplicate columns
transpose_quasi_constant_data = quasi_constant_columns_to_drop.T  # Finding transpose to detect duplicate columns
transpose_quasi_constant_data.shape

# Print sum of duplicated columns
duplicated_columns = transpose_quasi_constant_data.duplicated()     #Extracting duplicate column
sum_duplicated_columns = duplicated_columns.sum()   # Calculating the number of duplicated columns
print("Number of duplicated columns:", sum_duplicated_columns)
dropped_duplicates = transpose_quasi_constant_data.drop_duplicates(keep='first').T  # Dropping duplicates column
print("Shape after dropping duplicates:", dropped_duplicates.shape)
Dataset preprocessing and filter based feature selection

Benefits of removing irrelevant features

  • Reduces computation complexity: By utilizing the feature selection methods, we can significantly reduce the dataset, lowering the computational complexity.

  • Improved performance: It plays an important role in reducing the dataset’s overfitting and noise, enhancing the model’s accuracy and performance. 

  • Efficient training and testing: Testing and training of models may be completed more quickly by reducing the dataset.

Summary

Feature selection is a useful technique that improves the machine learning model’s efficiency, accuracy, and interpretability. If we timely separate the most important dataset features from the irrelevant ones, we can overcome the disaster of high dimensionality and, eventually, overfitting, improving the model’s performance.

Copyright ©2024 Educative, Inc. All rights reserved