Search⌘ K

The Heart of the LLM: The Attention Mechanism

Learn how self-attention works and how each token absorbs meaning from its relevant neighbours.

In our last lesson, we crafted the perfect input for our model. We started with a matrix of semantic embeddings, injected the concept of order by adding positional encodings, and finally, stabilized the result with layer normalization. We now have a fully prepared matrix, rich with meaning and position.

But our vectors, as prepared as they are, are still isolated. They exist in parallel but have no awareness of each other. The vector for “little” has no idea that the vector for “Twinkle” is its most important neighbor. How do we enable these vectors to communicate and build a true, contextual understanding of the prompt?

In this lesson, we will explore the brilliant solution to this problem: the self-attention mechanism. We’ll learn how tokens “talk” to each other to build a context-aware representation of our prompt.

The self-attention mechanism

The idea that solved this is called self-attention. This mechanism enables the model to dynamically weigh the importance of all other tokens in the input sequence when processing a single token.

A good analogy is to think of it like a networking event. When you’re trying to explain your role, you don’t give the same generic speech to everyone. You “pay attention” to who you’re talking to. You might emphasize the technical aspects when talking to an engineer and the business aspects when talking to a project manager. Self-attention allows each token to do the same thing. It refines its own meaning based on the other tokens in the room (the prompt).

So how does the model implement this idea mathematically? It uses a framework borrowed from information retrieval systems. For each token’s embedding vector, the model generates three new, smaller vectors: a query, a key, and a value.

  • Query (Q): This vector is what the current token is looking for. It’s your search term in the YouTube search bar, like “how to bake bread.” It represents the question: “Who in this sequence of words is relevant to me?”

  • Key (K): This vector is what a token has to offer. It’s like the title and tags of a YouTube video (e.g., “baking,” “sourdough,” “beginner recipe”). It’s an advertisement of its own identity: “This is the kind of information I contain.”

  • Value (V): This vector contains the actual information the model retrieves once a match between a query and a key is found. In our YouTube analogy, it’s the video’s content itself, the part you actually watch and learn from after searching.

The process is like a search: for each token, its Query scans the Keys of all other tokens. A strong match between a Query and a Key means that the corresponding Value is highly relevant and should be blended into the current token’s new representation.

Since our goal is to build a generative model that predicts the next word using the attention mechanism, we must enforce one critical rule: it cannot “cheat” by looking ahead. During training, a token must not be allowed to see the tokens that come after it. The solution is a causal mask, which ensures a token can only look at itself and the tokens that came before it. This constraint is the defining feature of a model, and it is what makes step-by-step generation possible.

At the heart of self-attention lies one elegant formula that captures everything we’ve just described:

Here:

  • Q,K,VQ, K, V ...