Subword Tokenization Algorithms
Learn about word-level tokenization and subword tokenization.
Subword tokenization is popularly used in many state-of-the-art natural language models, including BERT and GPT-3. It is very effective in handling OOV words. In this section, we will understand how subword tokenization works in detail. Before directly looking into subword tokenization, let's first take a look at word-level tokenization.
Word-level tokenization
Let's suppose we have a training dataset. Now, from this training set, we build a vocabulary.
Building the vocabulary
To build the vocabulary, we split the text present in the dataset by white space and add all the unique words to the vocabulary. Generally, the vocabulary consists of many words (tokens), but just for the sake of an example, let's suppose our vocabulary consists of just the following words:
Get hands-on with 1200+ tech skills courses.