Tokenization Methods and Vocabularies

Explore how tokenization methods like BPE, WordPiece, and SentencePiece split text into manageable units for language models. Understand the trade-offs in vocabulary size affecting sequence length, memory use, and inference cost. This lesson guides you through foundational tokenization strategies and their impact on model performance and multilingual handling.

We'll cover the following...

Byte pair encoding
- How BPE builds its vocabulary
  - A concrete walk-through
WordPiece and SentencePiece
- WordPiece
- SentencePiece
How vocabulary size shapes model behavior
Conclusion

The previous lesson showed that a transformer’s decoder projects its output into a probability distribution over a vocabulary of tokens. But how is that vocabulary constructed, and how is raw text split into those tokens in the first place? These questions sit at the foundation of every language model, because the vocabulary defines the entire set of symbols a model can read and produce.

Language has an enormous surface area. Misspellings, morphological variants like “running” and “ran,” compound words, and multilingual scripts all contribute to a practically infinite number of distinct word forms. A model’s vocabulary must balance coverage of this diversity against tractability, keeping the token set small enough to learn effectively. Subword tokenization is the dominant solution in modern LLMs, occupying the sweet spot between two extremes. Word-level tokenization creates huge vocabularies and cannot handle words it has never seen during training. Character-level tokenization covers everything but produces extremely long sequences, making it harder for the model to learn meaningful patterns. Subword methods split text into pieces that are larger than individual characters but smaller than full words, capturing common roots, prefixes, and suffixes as reusable units.

Consider the word “unhappiness.” A subword tokenizer might split it into [‘un’, ‘happi’, ‘ness’], allowing the model to generalize across morphological variants it has never encountered as whole words. This lesson covers the three methods that make this possible: BPE, WordPiece, and SentencePiece.

The following diagram illustrates how these three levels of granularity compare when applied to a single sentence.

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

Tokenization Methods and Vocabularies

Byte pair encoding