Chapter Goals:

Learn about the text corpus and vocabulary in NLP tasks
Create a function that tokenizes a text corpus

A. Corpus vocabulary

In the context of NLP tasks, the text corpus refers to the set of texts used for the task. For example, if we were building a model to analyze news articles, our text corpus would be the entire set of articles or papers we used to train and evaluate the model.

The set of unique words used in the text corpus is referred to as the vocabulary. When processing raw text for NLP, everything is done around the vocabulary.

B. Tokenization

We can use the vocabulary to find the number of times each word appears in the corpus, figure out which words are the most common or uncommon, and filter each text document based on the words that appear in it. However, the most important part of the vocabulary is that it allows us to represent each piece of text by the specific words that appear in it.

Rather than being represented as one long string, a piece of text can be represented as a vector/list of its vocabulary words. This process is known as tokenization, where each individual vocabulary word in a piece of text is a token.

Below we show an example of tokenization on a text corpus.

D. Tokenizer parameters

The Tokenizer object can be initialized with a number of optional parameters. By default, the Tokenizer filters out any punctuation and white space. You can specify custom filtering with the filters parameter. The parameter takes in a string, where each character in the string is filtered out.

When a new text contains words not in the corpus vocabulary, those words are known as out-of-vocabulary (OOV) words. The texts_to_sequences automatically filters out all OOV words. However, if we want to specify each OOV word with a special vocabulary token (e.g. 'OOV'), we can initialize the Tokenizer with the oov_token parameter.

Python 3.5

import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=2)
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)
# the two most common words are 'ate' and 'apples'
# the tokenizer will filter out all other words
# for the sentence 'bob ate pears', only 'ate' will be kept
# since 'ate' maps to an integer ID of 1, the only value 
# in the token sequence will be 1
print(tokenizer.texts_to_sequences(['bob ate pears']))

Time to Code!

The code for this section of the course involves building up an embedding model. Specifically, you will be building out the EmbeddingModel object. In this chapter, you’ll be completing the tokenize_text_corpus function.

You’ll notice that in the model initialization, the Tokenizer object is already set, with its maximum vocabulary size set to vocab_size. However, the Tokenizer object has not yet been initialized with a text corpus.

In the tokenize_text_corpus function, we’ll first initialize the Tokenizer with the text corpus, texts.

Call self.tokenizer.fit_on_texts on texts.

After initializing the Tokenizer with the text corpus, we can use it to convert the text corpus into tokenized sequences.

Set sequences equal to self.tokenizer.texts_to_sequencesapplied to texts. Then return sequences.

Python 3.5

import tensorflow as tf
# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)
    # Convert a list of text strings into word sequences
    def tokenize_text_corpus(self, texts):
        # CODE HERE
        pass

1.What you'll learn from this course

2.Word Embeddings

3.Language Model

4.Text Classification

5.Seq2Seq Model

Mock Interview

Vocabulary

Chapter Goals:

A. Corpus vocabulary

B. Tokenization

C. Tokenizer object

D. Tokenizer parameters

Time to Code!