In order to use textual data for predictive modeling, the text must be parsed to remove certain words – this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).
Scikit-learn’s CountVectorizer
is used to convert a collection of text documents to a vector of term/token counts. It also enables the pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.
The code below shows how to use CountVectorizer
in Python.
from sklearn.feature_extraction.text import CountVectorizer # list of text documents text = ["John is a good boy. John watches basketball"] vectorizer = CountVectorizer() # tokenize and build vocab vectorizer.fit(text) print(vectorizer.vocabulary_) # encode document vector = vectorizer.transform(text) # summarize encoded vector print(vector.shape) print(vector.toarray())
RELATED TAGS
View all Courses