Feature Selection and Feature Engineering
Learn how tech companies like Facebook, Twitter, and Airbnb design their feature selection and feature engineering to serve billions of users.
1. One hot encoding
One hot encoding is a very common technique in feature engineering. It converts categorical variables into a one-hot numeric array.
- One hot encoding is very popular when you have to deal with categorical features that have medium cardinality.
Common problems
-
Expansive computation and high memory consumption are major problems with one hot encoding. High numbers of values will create high-dimensional feature vectors. For example, if there are one million unique values in a column, it will produce feature vectors that have a dimensionality of one million.
-
One hot encoding is not suitable for Natural Language Processing tasks. Microsoft Word’s dictionary is usually large, and we can’t use one hot encoding to represent each word as the vector is too big to store in memory.
Best practices
- Depending on the application, some levels/categories that are not important, can be grouped together in the “Other” class.
- Make sure that the pipeline can handle unseen data in the test set.
In Python, there are many ways to do one hot encoding, for example,
pandas.get_dummies
and sklearnOneHotEncoder
.pandas.get_dummies
does not “remember” the encoding during training, and if testing data has new values, it can lead to inconsistent mapping.OneHotEncoder
is a Scikitlearn Transformer; therefore, you can use it consistently during training and predicting.
One hot encoding in tech companies
- It’s not practical to use one hot encoding to handle large cardinality features, i.e., features that have hundreds or thousands of unique values. Companies like Instacart and DoorDash use more advanced techniques to handle large cardinality features.
2. Feature hashing
Feature hashing, called the hashing trick, converts text data or categorical attributes with high cardinalities into a feature vector of arbitrary dimensionality.
Benefits
- Feature hashing is very useful for features that have high cardinality with hundreds and thousands of unique values. Hashing trick is a way to reduce the increase in dimension and memory by allowing multiple values to be present/encoded as the same value.
Feature hashing example
- First, you chose the dimensionality of your feature vectors. Then, using a hash function, you convert all values of your categorical attribute (or all tokens in your collection of documents) into a number. Then you convert this number into an index of your feature vector. The process is illustrated in the diagram below.
-
Let’s illustrate what it would look like to convert the text “The quick brown fox” into a feature vector. The values for each word in the phrase are:
the = 5 quick = 4 brown = 4 fox = 3
-
Let define a hash function, , that takes a string as input and outputs a non-negative integer. Let the desired dimensionality be 5. By applying the hash function to each word and applying the modulo of 5 to obtain the index of the word, we get:
h(the) mod 5 = 0 h(quick) mod 5 = 4 h(brown) mod 5 = 4 h(fox) mod 5 = 3
-
In this example:
-
h(the) mod 5 = 0
means that we have one word in dimension 0 of the feature vector. -
h(quick) mod 5 = 4
andh(brown) mod 5 = 4
means that we have two words in dimension 4 of the feature vector. -
h(fox) mod 5 = 3
means that we have one word in dimension 3 of the feature vector. -
As you can see, that there are no words in dimensions 1 or 2 of the vector, so we keep them as 0.
-
-
Finally, we have the feature vector as:
[1, 0, 0, 1, 2]
. -
As you can see, there is a collision between words “quick” and “brown.” They are both represented by dimension 4. The lower the desired dimensionality, the higher the chances of collision. To reduce the probability of collision, we can increase the desired dimensions. This is the trade-off between speed and quality of learning.
Commonly used hash functions are MurmurHash3, Jenkins, CityHash, and MD5. ...