Tokenization and Embeddings

Explore how raw text is transformed into numerical vectors via tokenization and embedding processes, the foundation of AI language models. Understand why tokenization strategies matter, including byte-pair encoding, static versus contextual embeddings, and how these concepts impact model efficiency and semantic understanding. Gain insights into the text-to-vector pipeline critical for AI engineering interviews and production search systems.

We'll cover the following...

Why can’t you feed raw text directly into a neural network?
How does Byte-Pair Encoding work?
What is an embedding and what do its dimensions represent?
What is the difference between static and contextual embeddings?
When would you choose sparse over dense embeddings?
How would you implement BPE tokenization from scratch?
What’s next?

Before a transformer processes a single word, that word must stop being a word. It must become a number: specifically, a vector of floating-point numbers that the network can multiply, add, and differentiate through. The entire pipeline from raw text to numerical representation is what this lesson is about, and interviewers probe it relentlessly because it exposes whether you understand the system at its seams, not just at its headline layer.

A candidate who cannot explain why word-level tokenization breaks down, or who thinks embeddings encode meaning dimension-by-dimension, will struggle to answer follow-up questions about context windows, vocabulary size trade-offs, or retrieval system design. These questions directly determine how you chunk documents for RAG, why multilingual models behave the way they do, and why some queries return semantically wrong but superficially similar results.

This lesson follows the text-to-vector pipeline in the order it actually executes: first tokenization, then static embeddings, then contextual embeddings, and finally the retrieval trade-offs that determine how those embeddings get used in production.

Tokenization and embeddings are not preprocessing steps that happen before the “real” model. They are the input layer of the model. GPT-5.2, Claude Opus 4.6, and Gemini 3 all begin by running your text through a tokenizer and an embedding lookup table. Understanding this pipeline is not optional background knowledge. It determines how these models handle long documents, rare words, code, and non-English text. Every frontier model behavior you will be asked about in an interview traces back to decisions made here.

Why can’t you feed raw text directly into a neural network?

Neural networks are mathematical functions. They perform matrix multiplications, compute gradients, and apply nonlinearities. None of these operations are defined over the string “hello”. Text must be converted into numbers before a network can do anything with it. The conversion happens in two stages: tokenization maps text to integer IDs, and embedding maps those IDs to dense vectors.

The first question is: what unit of text should each integer ID represent?

Character-level tokenization is the most obvious starting point. Map every character to an ID and feed the sequence to the network. It works, but it is deeply inefficient. The word “unbelievable” becomes twelve tokens. A 500-word paragraph becomes 2,500+ tokens. Context windows fill up fast, sequences become harder to model because the relationships between meaningful units are now stretched across many steps, and the model must learn to compose meaning from scratch at the character level rather than building on pre-existing word knowledge.

Word-level tokenization is the opposite extreme. Assign an ID to every word. Now “unbelievable” is one token, which is efficient. But this approach breaks in three ways:

Vocabulary explosion: the vocabulary must include every word the model might encounter. English alone has over a million ...

1.How AI Models Work

2.LLM Training, Fine-Tuning, and Optimization

3.AI System Design

Tokenization and Embeddings

Why can’t you feed raw text directly into a neural network?