We use cookies to ensure you get the best experience on our website. Please review our Privacy Policy to learn more.
In order to use textual data for predictive modeling, the text must be parsed to remove certain words – this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).
Scikit-learn’s CountVectorizer
is used to convert a collection of text documents to a vector of term/token counts. It also enables the pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.
The code below shows how to use CountVectorizer
in Python.
from sklearn.feature_extraction.text import CountVectorizer # list of text documents text = ["John is a good boy. John watches basketball"] vectorizer = CountVectorizer() # tokenize and build vocab vectorizer.fit(text) print(vectorizer.vocabulary_) # encode document vector = vectorizer.transform(text) # summarize encoded vector print(vector.shape) print(vector.toarray())