The WordPiece Tokenizer
Discover how the WordPiece tokenizer breaks down words into manageable subwords by checking against BERT's vocabulary of 30,000 tokens. Understand the process of token splitting, including handling out-of-vocabulary words, and how special tokens like [CLS] and [SEP] integrate into BERT's input preparation.
We'll cover the following...
BERT uses a special type of tokenizer called a WordPiece tokenizer. The WordPiece tokenizer follows the subword tokenization scheme. Let's understand how the WordPiece tokenizer works with the help of an example. Consider the following sentence:
Tokenize the sentence
Now, if we tokenize the sentence using the WordPiece tokenizer, then we obtain the tokens as shown here:
We can observe that while tokenizing the sentence using the WordPiece tokenizer, the word 'pretraining' is split into the ...