B. Tokenization

We can use the vocabulary to find the number of times each word appears in the corpus, figure out which words are the most common or uncommon, and filter each text document based on the words that appear in it. However, the most important part of the vocabulary is that it allows us to represent each piece of text by the specific words that appear in it.

Rather than being represented as one long string, a piece of text can be represented as a vector/list of its vocabulary words. This process is known as tokenization, where each individual vocabulary word in a piece of text is a token.

Below we show an example of tokenization on a text corpus.

D. Tokenizer parameters

The Tokenizer object can be initialized with a number of optional parameters. By default, the Tokenizer filters out any punctuation and white space. You can specify custom filtering with the filters parameter. The parameter takes in a string, where each character in the string is filtered out.

When a new text contains words not in the corpus vocabulary, those words are known as out-of-vocabulary (OOV) words. The texts_to_sequences automatically filters out all OOV words. However, if we want to specify each OOV word with a special vocabulary token (e.g. 'OOV'), we can initialize the Tokenizer with the oov_token parameter.

Python 3.5

import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=2)
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)
# the two most common words are 'ate' and 'apples'
# the tokenizer will filter out all other words
# for the sentence 'bob ate pears', only 'ate' will be kept
# since 'ate' maps to an integer ID of 1, the only value 
# in the token sequence will be 1
print(tokenizer.texts_to_sequences(['bob ate pears']))

Time to Code!

The code for this section of the course involves building up an embedding model. Specifically, you will be building out the EmbeddingModel object. In this chapter, you’ll be completing the tokenize_text_corpus function.

You’ll notice that in the model initialization, the Tokenizer object is already set, with its maximum vocabulary size set to vocab_size. However, the Tokenizer object has not yet been initialized with a text corpus.

In the tokenize_text_corpus function, we’ll first initialize the Tokenizer with the text corpus, texts.

Call self.tokenizer.fit_on_texts on texts.

After initializing the Tokenizer with the text corpus, we can use it to convert the text corpus into tokenized sequences.

Set sequences equal to self.tokenizer.texts_to_sequencesapplied to texts. Then return sequences.

What you'll learn from this course

Word Embeddings

Language Model

Text Classification

Seq2Seq Model

Fundamentals of NLP

Vocabulary

Chapter Goals:

A. Corpus vocabulary

B. Tokenization

C. Tokenizer object

D. Tokenizer parameters

Time to Code!