Subword Tokenization Algorithms

Subword tokenization is popularly used in many state-of-the-art natural language models, including BERT and GPT-3. It is very effective in handling OOV words. In this section, we will understand how subword tokenization works in detail. Before directly looking into subword tokenization, let's first take a look at word-level tokenization.

Word-level tokenization

Let's suppose we have a training dataset. Now, from this training set, we build a vocabulary.

Building the vocabulary

To build the vocabulary, we split the text present in the dataset by white space and add all the unique words to the vocabulary. Generally, the vocabulary consists of many words (tokens), but just for the sake of an example, let's suppose our vocabulary consists of just the following words:

Get hands-on with 1200+ tech skills courses.