Feature Extraction

In this lesson, let's learn how to extract features from raw txt.

What is feature extraction

Feature extraction is different from feature selection. Feature extraction focuses on how to extract data from complicated data, such as text or images, to numerical features. Image processing and text are complex structured data and traditional Machine Learning algorithms cannot directly process both these data types. Such data must be preprocessed to extract the corresponding features and prepare for downstream tasks. Deep Learning supports end-to-end training; for example, a neural network can process raw JPEG files without any manual processing.

The sklearn provides some functions to process the image and text, but in this lesson, we only focus on the text.

Text processing is an important field of Machine Learning algorithms. However, raw data (a sequence of tokens) can not be processed directly by models. We need to process the raw data and extract some kind of fixed size numerical feature vector for the model. We call the general process of converting the raw text documents into numerical feature vectors, vectorization.

What is sparsity?

Sparsity is a feature of natural language. For vectorization, its length is generally the size of the vocabulary in the corpus. If the size of the vocabulary is ten thousand, then the vector length is ten thousand. But for a relatively shorter text, because the tokens are limited, there is a very limited number of ones, and everything else is zero.

How does CountVectorizer work?

CountVectorizer implements both tokenization and occurrence counting in a single class. There are lots of useful parameters. Let’s have a look.

  • strip_accents: Remove the accents.
  • lowercase: Convert all characters to lowercase.
  • preprocessor: A callable function to preprocess the text.
  • tokenizer: A callable function to override default tokenizer.
  • stop_words: Remove those very common words, such as the, a, and and. You can pass a list of words, or just pass english to use the built-in list.
  • ngram_range: A tuple. The default is (1, 1) which means unigrams. If you pass (2, 2), it means only bigrams. If you pass (1, 2), it means unigrams and bigrams.
  • analyzer: The default value is word, which means that the feature is based on the word. If you pass char, this means that the feature is based on character.
  • max_df: Float value between zero and one. When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
  • min_df: Float value between zero and one. When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
  • max_featuresint: If not none, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.

Now, let’s see how to use the CountVectorizer. As seen in the code below, you need to create a CountVectorizer object by CountVectorizer(), fit the corpus, and then transform the corpus. The new feature would be a matrix with the count of each word in each sample.

from sklearn.feature_extraction.text import CountVectorizer

counterVec = CountVectorizer()
# corpus is a list of string in this example, such as:
# corpus = [
#    "I have an apple.",
#    "The apple is red",
#    "I like the apple"
#    ]
counterVec.fit(corpus)
# corpus_data is a matrix with 0/1.
corpus_data = counterVec.transform(corpus)

Get hands-on with 1200+ tech skills courses.