Byte Pair Encoding

Explore how Byte Pair Encoding (BPE) is used in BERT to create vocabularies by merging frequent symbol pairs. Learn the step-by-step process of building vocabularies and how BPE tokenizes words into subwords, handling rare and unseen words efficiently to improve NLP model performance.

We'll cover the following...

Subword tokenization algorithms
Byte pair encoding

Subword tokenization algorithms

Let's learn about several interesting subword tokenization algorithms that are used to create the vocabulary. After creating the vocabulary, we can use it for tokenization. We'll go over the following three popularly used subword tokenization algorithms:

Byte pair encoding
Byte-level byte pair encoding
WordPiece

Byte pair encoding

Let's understand how Byte Pair Encoding (BPE) works with the help of an example. Let's suppose we have a dataset. First, we extract all the words from the dataset along with their count. Suppose the words extracted from the dataset along with the count are (cost, 2), (best, 2), (menu, 1), (men, 1), and (camel, 1). $\text{(cost, 2)}$

Splitting the words into characters

Now, we split all the words into characters and create a character sequence. The following table shows the character sequence along with the wordcount:

1.Before We Start

2.Starting Off with BERT

3.A Primer on Transformers

Project

4.Understanding the BERT Model

5.Getting Hands-On with BERT

6.Exploring BERT Variants

7.Different BERT Variants

8.BERT Variants—Based on Knowledge Distillation

9.Applications of BERT

10.Exploring BERTSUM for Text Summarization

11.Applying BERT to Other Languages

12.Exploring Sentence and Domain-Specific BERT

13.Working with VideoBERT, BART, and More

14.Conclusion

Project

Byte Pair Encoding

Subword tokenization algorithms

Byte pair encoding

Splitting the words into characters

Defining vocabulary size

Creating the vocabulary