Tokenization Methods and Vocabularies
Explore how tokenization methods like BPE, WordPiece, and SentencePiece split text into manageable units for language models. Understand the trade-offs in vocabulary size affecting sequence length, memory use, and inference cost. This lesson guides you through foundational tokenization strategies and their impact on model performance and multilingual handling.
The previous lesson showed that a transformer’s decoder projects its output into a probability distribution over a vocabulary of tokens. But how is that vocabulary constructed, and how is raw text split into those tokens in the first place? These questions sit at the foundation of every language model, because the vocabulary defines the entire set of symbols a model can read and produce.
Language has an enormous surface area. Misspellings, morphological variants like “running” and “ran,” compound words, and multilingual scripts all contribute to a practically infinite number of distinct word forms. A model’s vocabulary must balance coverage of this diversity against tractability, keeping the token set small enough to learn effectively. Subword tokenization is the dominant solution in modern LLMs, occupying the sweet spot between two extremes. Word-level tokenization creates huge vocabularies and cannot handle words it has never seen during training. Character-level tokenization covers everything but produces extremely long sequences, making it harder for the model to learn meaningful patterns. Subword methods split text into pieces that are larger than individual characters but smaller than full words, capturing common roots, prefixes, and suffixes as reusable units.
Consider the word “unhappiness.” A subword tokenizer might split it into [‘un’, ‘happi’, ‘ness’], allowing the model to generalize across morphological variants it has never encountered as whole words. This lesson covers the three methods that make this possible: BPE, WordPiece, and SentencePiece.
The following diagram illustrates how these three levels of granularity compare when applied to a single sentence.