...

>

Feature Selection and Feature Engineering

Feature Selection and Feature Engineering

Learn how companies like Facebook, Twitter, Airbnb, Uber, and DoorDash design feature selection and feature engineering pipelines to build scalable, high-performance machine learning systems.

Feature engineering is the process of transforming raw data into meaningful, model-ready features. In real-world ML system design, good features often matter more than the choice of model.

This lesson introduces the most widely used feature engineering techniques in production systems and explains when, why, and how to use each one.

1. One hot encoding for categorical features

One-hot encoding converts categorical variables into binary vectors where each category is represented by a 0 or 1.

When to use one-hot encoding

  • Categorical features with low to medium cardinality

  • Linear models and tree-based models

  • Structured data (not text)

One hot encoding example
One hot encoding example

Common problems

  • Expansive computation and high memory consumption are major problems with one hot encoding. High numbers of values will create high-dimensional feature vectors. For example, if there are one million unique values in a column, it will produce feature vectors that have a dimensionality of one million.

  • One hot encoding is not suitable for Natural Language Processing tasks. Microsoft Word’s dictionary is usually large, and we can’t use one hot encoding to represent each word as the vector is too big to store in memory.

Best practices

  • Depending on the application, some levels/categories that are not important, can be grouped together in the “Other” class.
  • Make sure that the pipeline can handle unseen data in the test set.

In Python, there are many ways to do one hot encoding, for example, pandas.get_dummies and sklearn OneHotEncoder. pandas.get_dummies does not “remember” the encoding during training, and if testing data has new values, it can lead to inconsistent mapping. OneHotEncoder is a Scikitlearn Transformer; therefore, you can use it consistently during training and predicting.

One hot encoding in tech companies

  • It’s not practical to use one hot encoding to handle large cardinality features, i.e., features that have hundreds or thousands of unique values. Companies like Instacart and DoorDash use more advanced techniques to handle large cardinality features.

2. Feature hashing

Feature hashing maps high-cardinality categorical features into a fixed-size vector using a hash function.

Why feature hashing is useful

  • Handles thousands or millions of categories

  • Fixed memory footprint

  • No need to store a category dictionary

Feature hashing example

  • First, you chose the dimensionality of your feature vectors. Then, using a hash function, you convert all values of your categorical attribute (or all tokens in your collection of documents) into a number. Then you convert this number into an index of your feature vector. The process is illustrated in the diagram below.
An illustration of the hashing trick for desired dimensionality of 5 for the originality of K of values of an attributes
An illustration of the hashing trick for desired dimensionality of 5 for the originality of K of values of an attributes
  • Let’s illustrate what it would look like to convert the text “The quick brown fox” into a feature vector. The values for each word in the phrase are:

    the = 5
    quick = 4
    brown = 4
    fox = 3
    
  • Let define a hash function, hh, that takes a string as input and outputs a non-negative integer. Let the desired dimensionality be 5. By applying the hash function to each word and applying the modulo of 5 to obtain the index of the word, we get:

    h(the) mod 5 = 0
    h(quick) mod 5 = 4
    h(brown) mod 5 = 4
    h(fox) mod 5 = 3
    
  • In this example:

    • h(the) mod 5 = 0 means that we have one word in dimension 0 of the feature vector.

    • h(quick) mod 5 = 4 and h(brown) mod 5 = 4 means that we have two words in dimension 4 of the feature vector.

    • h(fox) mod 5 = 3 means that we have one word in dimension 3 of the feature vector.

    • As you can see, that there are no words in dimensions 1 or 2 of the vector, so we keep them as 0.

  • Finally, we have the feature vector as: [1, 0, 0, 1, 2].

  • As you can see, there is a collision between words “quick” and “brown.” They are both represented by dimension 4. The lower the desired dimensionality, the higher the chances of collision. To reduce the probability of collision, we can increase the desired dimensions. This is the trade-off between speed and quality of learning.

Commonly used hash functions are MurmurHash3, Jenkins, CityHash, and ...