What is the bag-of-words model?

Machine Learning algorithms require well defined fixed length inputs and outputs. When modeling text, it must be converted into a number or a vector of numbers. The only problem with modeling text is that it is messy by nature.

What is a bag-of-Words?

A bag-of-words model, or BoW for short, extracts features from the text for modeling using various techniques (e.g., machine learning algorithms). It is a straightforward and flexible approach used to extract text features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It requires two things:

A vocabulary of known words
A measure of the presence of known words

The model is called a “bag” of words because it’s only concerned with whether or not specific words occur in the document.

Example model

Let’s make the bag-of-words model using sentences as documents.

Step 1: Collect data

Below is a snippet of the first few lines of text from the book “A Tale of Two Cities” by Charles Dickens:

It was the best of times.
It was the worst of times.
It was the age of wisdom.
It was the age of foolishness.

For this example, let’s treat each line as a separate document.

Step 2: Design the Vocabulary

Now we can make a list of all of the unique words in our model vocabulary, ignoring case and punctuation:

“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”

Among the 24 words in 4 sentences, there are 10 unique words.

Step 3: Creating document cectors

Next, we assign the score to the words in each document. We know that our vocabulary design has 10 words, so we create a fixed-length boolean vector of 10 for each document. The simplest scoring method is to mark words as a boolean value, 0 for absent and 1 for present.

Using this notation, the following sentence,

“It was the best of times

can be converted into a binary vector.

The scoring of the document would look like:

“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0

As a binary vector, this would look like:

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

The other three documents would look like:

"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

This method ensures that any document is modeled properly (disregarding the words’ actual ordering) so that the text can be better-used for negotiation with Machine Learning models.

Free Resources