Feature Extraction

Explore the process of feature extraction from text data, focusing on transforming raw text into numerical feature vectors for machine learning models. Understand how CountVectorizer tokenizes and counts word occurrences including handling unigrams and bigrams, along with an introduction to TF-IDF for improved feature representation.

We'll cover the following...

- Extract bigrams by CountVectorizer

What is feature extraction

Feature extraction is different from feature selection. Feature extraction focuses on how to extract data from complicated data, such as text or images, to numerical features. Image processing and text are complex structured data and traditional Machine Learning algorithms cannot directly process both these data types. Such data must be preprocessed to extract the corresponding features and prepare for downstream tasks. Deep Learning supports end-to-end training; for example, a neural network can process raw JPEG files without any manual processing.

The sklearn provides some functions to process the image and text, but in this lesson, we only focus on the text.

Text processing is an important field of Machine Learning algorithms. However, raw data (a sequence of tokens) can not be processed directly by models. We need to process the raw data and extract some kind of fixed size numerical feature vector for the model. We call the general process of converting the raw text documents into numerical feature ...

1.Preliminaries

2.Working with Datasets

3.Feature Engineering

4.General Concepts

5.Linear Regression

6.Logistic Regression

7.Support Vector Machine

8.Tree Model and Ensemble Method

9.Unsupervised Learning

10.Deep Learning

11.Others

12.What's Next

Feature Extraction

What is feature extraction