Trusted answers to developer questions

CountVectorizer in Python

Free System Design Interview Course

Many candidates are rejected or down-leveled due to poor performance in their System Design Interview. Stand out in System Design Interviews and get hired in 2024 with this popular free course.

In order to use textual data for predictive modeling, the text must be parsed to remove certain words – this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).

Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

svg viewer

Code

The code below shows how to use CountVectorizer in Python.

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["John is a good boy. John watches basketball"]
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

RELATED TAGS

count
vectorizer
sklearn
machine
learning
Copyright ©2024 Educative, Inc. All rights reserved
Did you find this helpful?