Byte-level Byte Pair Encoding & WordPiece Algorithms

Now, let's discuss two other subword tokenization algorithms—Byte-level byte pair encoding and WordPiece.


Byte-level byte pair encoding algorithm

Byte-level byte pair encoding (BBPE) is another popularly used algorithm. It works very similar to BPE, but instead of using a character-level sequence, it uses a byte-level sequence.

Tokenizing with byte-level byte pair encoding

Let's understand how BBPE works with the help of an example.

Example: best

Let's suppose our input text consists of just the word 'best'. We know that in BPE, we convert the word into a character sequence, so we will have the following:

Get hands-on with 1200+ tech skills courses.