Byte Pair Encoding
Explore how Byte Pair Encoding (BPE) is used in BERT to create vocabularies by merging frequent symbol pairs. Learn the step-by-step process of building vocabularies and how BPE tokenizes words into subwords, handling rare and unseen words efficiently to improve NLP model performance.
Subword tokenization algorithms
Let's learn about several interesting subword tokenization algorithms that are used to create the vocabulary. After creating the vocabulary, we can use it for tokenization. We'll go over the following three popularly used subword tokenization algorithms:
Byte pair encoding
Byte-level byte pair encoding
WordPiece
Byte pair encoding
Let's understand how Byte Pair Encoding (BPE) works with the help of an example. Let's suppose we have a dataset. First, we extract all the words from the dataset along with their count. Suppose the words extracted from the dataset along with the count are (cost, 2), (best, 2), (menu, 1), (men, 1), and (camel, 1).
Splitting the words into characters
Now, we split all the words into characters and create a character sequence. The following table shows the character sequence along with the wordcount:
Defining vocabulary size
Next, we define a vocabulary size. Let's suppose we build a vocabulary of size 14. This implies that we create a vocabulary with 14 tokens. Now, let's understand how to create vocabulary using BPE.
Creating the vocabulary
First, we add all the unique characters present in the character sequence to the ...