From Words to Meaning: Tokenization and Embeddings
Get introduced to how tokenization and embeddings transform raw text into numerical representations that language models can process.
Let’s begin the journey with our guide prompt: “Twinkle, twinkle, little.” Our prompt is written in English, but computers don’t understand words; they understand numbers. So, what is the very first thing that happens when we send our prompt to an LLM? How does it translate our language into its own?
In this lesson, we will explore the first two crucial steps of this translation process. We’ll learn about tokenization, the process of breaking text into smaller pieces, and embeddings, the process of converting those pieces into meaningful numerical representations.
Breaking text into pieces with tokenization
When an LLM is given text as a string of characters, it first converts the string into a sequence of chunks. Tokenization is the process of breaking a piece of text into smaller units called tokens. These tokens can be words, parts of words (subwords), or even individual characters and punctuation.
Byte pair encoding
But how does the tokenizer determine what constitutes a token? Most modern tokenizers, including the one we’re about to use, are based on an algorithm called Byte Pair Encoding (BPE).
At a high level, BPE is a clever data compression algorithm that “learns” the most efficient way to represent a language. It begins by examining a massive corpus of text and treating every individual character as a token. Then, it iteratively finds the most frequently occurring pair of adjacent tokens and merges them into a single new token, adding this new token to its vocabulary. It repeats this process thousands of times.
Imagine a BPE algorithm learning from an English text that contains common words like "the," "this", "that," "them," "they," "thing," etc. When processing the text:
It might first merge
't'and'h'into a newtoken'th'.Then it might merge
'th'and ...