Search⌘ K

Feature Space

Learn feature engineering techniques by implementing feature space exploration, subspace analysis, and feature transformation.

Feature engineering techniques are crucial for improving model performance, particularly when dealing with non-linear data or high-dimensional datasets. This lesson explores the theoretical foundation of feature space, the practical concept of data subsetting (subspace), and the power of feature transformation.

In both supervised learning and clustering, each data point xx within the training dataset is represented as a dd-dimensional vector in Rd{R}^d, such that xRd{x} \in {R}^d. The elements of x{x} are referred to as features. Therefore, x{x} is also referred to as a feature vector.

Feature space

A feature space is a mathematical space that represents the features or attributes of a given dataset. Each observation in the dataset is represented by a vector in this space, where each dimension of the vector corresponds to a specific feature.

For example, we have a dataset of cars, and each car is described by its make, model, year, horsepower, and fuel efficiency. The feature space for this dataset would be a five-dimensional space, where each dimension corresponds to one of these features. By analyzing the patterns and relationships among the feature vectors in the feature space, we can gain insights into the underlying structure and characteristics of the dataset.

Subspace example

The concept of a subspace in machine learning often refers to the practical process of data subsetting or data filtering, where we select a specific group of data points based on feature values.

Note: It is important to remember that, mathematically, a vector subspace must be closed under vector addition and scalar multiplication, and must contain the zero vector. The operation below creates a subset of data points, not a mathematical vector subspace.

Consider a dataset of cars D={(x1,y1),,(xn,yn)}D = \{(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)\}, where each xiRd\mathbf{x}_i \in \mathbb{R}^d is a feature vector. Let’s define a subset that only includes cars with horsepower greater than 250:

Python 3.10.4
import numpy as np
# Example feature vectors for cars dataset
# [Make, Model, Year, Horsepower, Fuel efficiency (mpg)]
car1 = np.array(['Toyota', 'Corolla', 2015, 132, 29])
car2 = np.array(['Honda', 'Accord', 2020, 252, 33])
car3 = np.array(['Tesla', 'Model S', 2018, 518, 98])
car4 = np.array(['Ford', 'Mustang', 2010, 315, 22])
car5 = np.array(['Chevrolet', 'Impala', 2012, 300, 23])
# Create array of feature vectors
cars = np.array([car1, car2, car3, car4, car5])
# Define subspace of cars with horsepower greater than 250
horsepower_subspace = cars[cars[:, 3].astype(int) > 250, :]
# Print subspace
print("Subspace of cars with horsepower greater than 250:\n")
print(horsepower_subspace)

Here is the explanation for the code above:

  • Lines 5–9: Here, we generate a random dataset of car features.
  • Line 12: We create a vector where each row represents a car’s features.
  • Line 15: We define a subspace of cars with horsepower greater than 250 by filtering the rows of the vector where the horsepower feature is greater than 250.
  • Lines 18–19 : We print the subspace.

Note: In real-world datasets, feature space is a subspace, but its dimension refers to the number of components in the feature vector, say dd. Therefore, we can think of it as a dd-dimensional vector space that contains feature vectors.

Feature transformations

Feature transformation is a powerful technique used to change the nature of the features to enable simpler models, like linear regression, to handle non-linear relationships in the original data.

Consider the regression dataset with d=1d=1. The feature vectors are, in fact, real numbers here. Assume the targets yiy_i are also real numbers. Also consider two different models fw1(xi)=w1xif_{w_1}(x_i)=w_1x_i ...