Feature Selection and Feature Engineering

Learn how tech companies like Facebook, Twitter, and Airbnb design their feature selection and feature engineering to serve billions of users.

Press + to interact

Common problems

Expansive computation and high memory consumption are major problems with one hot encoding. High numbers of values will create high-dimensional feature vectors. For example, if there are one million unique values in a column, it will produce feature vectors that have a dimensionality of one million.
One hot encoding is not suitable for Natural Language Processing tasks. Microsoft Word’s dictionary is usually large, and we can’t use one hot encoding to represent each word as the vector is too big to store in memory.

Best practices

Depending on the application, some levels/categories that are not important, can be grouped together in the “Other” class.
Make sure that the pipeline can handle unseen data in the test set.

In Python, there are many ways to do one hot encoding, for example, pandas.get_dummies and sklearn OneHotEncoder. pandas.get_dummies does not “remember” the encoding during training, and if testing data has new values, it can lead to inconsistent mapping. OneHotEncoder is a Scikitlearn Transformer; therefore, you can use it consistently during training and predicting.

One hot encoding in tech companies

It’s not practical to use one hot encoding to handle large cardinality features, i.e., features that have hundreds or thousands of unique values. Companies like Instacart and DoorDash use more advanced techniques to handle large cardinality features.

2. Feature hashing

Feature hashing, called the hashing trick, converts text data or categorical attributes with high cardinalities into a feature vector of arbitrary dimensionality.

Benefits

Feature hashing is very useful for features that have high cardinality with hundreds and thousands of unique values. Hashing trick is a way to reduce the increase in dimension and memory by allowing multiple values to be present/encoded as the same value.

Feature hashing example

First, you chose the dimensionality of your feature vectors. Then, using a hash function, you convert all values of your categorical attribute (or all tokens in your collection of documents) into a number. Then you convert this number into an index of your feature vector. The process is illustrated in the diagram below.

Press + to interact

Let’s illustrate what it would look like to convert the text “The quick brown fox” into a feature vector. The values for each word in the phrase are:
```
the = 5
quick = 4
brown = 4
fox = 3
```
Let define a hash function, $h$ , that takes a string as input and outputs a non-negative integer. Let the desired dimensionality be 5. By applying the hash function to each word and applying the modulo of 5 to obtain the index of the word, we get:
```
h(the) mod 5 = 0
h(quick) mod 5 = 4
h(brown) mod 5 = 4
h(fox) mod 5 = 3
```
In this example:
- h(the) mod 5 = 0 means that we have one word in dimension 0 of the feature vector.
- h(quick) mod 5 = 4 and h(brown) mod 5 = 4 means that we have two words in dimension 4 of the feature vector.
- h(fox) mod 5 = 3 means that we have one word in dimension 3 of the feature vector.
- As you can see, that there are no words in dimensions 1 or 2 of the vector, so we keep them as 0.
Finally, we have the feature vector as: [1, 0, 0, 1, 2].
As you can see, there is a collision between words “quick” and “brown.” They are both represented by dimension 4. The lower the desired dimensionality, the higher the chances of collision. To reduce the probability of collision, we can increase the desired dimensions. This is the trade-off between speed and quality of learning.

Commonly used hash functions are MurmurHash3, Jenkins, CityHash, and MD5. ...

Feature hashing in tech companies

Feature hashing is popular in many tech companies like Booking, Facebook, Yahoo, Yandex, Avazu and Criteo.
One problem with hashing is collision. If the hash size is too small, more collisions will happen

Machine Learning Primer

Video Recommendation

Feed Ranking

Ad Click Prediction

Rental Search Ranking

Estimate Food Delivery Time

Machine Learning Knowledge

Machine Learning Model Diagnosis

Conclusion