Attention Mechanisms

Learn how attention mechanisms work to enable models like GPT and Claude to understand context. This lesson covers the math behind attention, differences between self and cross-attention, multi-head attention, and practical variants like GQA and MQA. Master the core concept that underpins modern transformers and prepares you for technical AI interview questions.

We'll cover the following...

Why did self-attention replace recurrent networks?
How does scaled dot-product attention work?
Why do we use multiple attention heads?
What is the difference between self-attention and cross-attention?
What is GQA and why does every modern LLM use it?
How would you implement scaled dot-product attention from scratch?
What’s next?

Attention is the single most important concept in modern AI engineering interviews. It is the mechanism that powers every frontier model from GPT to Claude to Gemini, and interviewers test it at every level: intuition, math, architecture, and production trade-offs. A shallow answer here will end an interview fast.

This lesson builds attention from the ground up, covers multi-head attention, explains the self-attention versus cross-attention distinction, and introduces GQA and MQA: the attention variants used in almost every modern open-weight model that most candidates have never heard of.

Attention in simple terms is weighted average. Every token produces three vectors (Q, K, V), dot products between Q and K determine how much each token attends to every other, and those weights aggregate the V vectors. Everything else in this lesson is a structured elaboration of that one sentence.

Why did self-attention replace recurrent networks?

Before transformers, sequence models were recurrent. An RNN reads tokens one at a time, left to right, maintaining a hidden state that is passed from step to step. The problem is fundamental: to relate token 1 to token 100, the signal from token 1 must survive 99 sequential updates to the hidden state. In practice it rarely does. Gradients vanish over long distances and the hidden state becomes a bottleneck that cannot carry everything.

Self-attention solves this with a direct connection. Every token attends to every other token in a single operation, regardless of how far apart they are. Token 1 and token 100 interact directly, with no intermediate steps.

There is a second advantage: parallelism. RNNs are inherently sequential and you cannot compute step 5 until step 4 is done. Self-attention processes the entire sequence simultaneously as a set of matrix multiplications, which maps perfectly to GPU hardware. This is why transformers scaled and RNNs did not.

One important consequence of treating the input as a set rather than a sequence: self-attention is permutation-equivariant. Shuffle the input tokens and you get the same attention outputs, just reordered. The model has no built-in sense of order.

This is why positional encodings are required, which we will cover in the coming lessons.

How does scaled dot-product attention work?

To build a contextual embedding, a model needs to figure out which surrounding words matter most. It does this by projecting every token into three distinct vectors through learned linear transformations:

Query (Q): "What am I looking for?" (What the current token wants to find in others).
Key (K): "What do I offer?" (What each token advertises about itself for matching).
Value (V): "What information do I actually carry?" (The content that gets retrieved).

To determine how much focus Token A should give to Token B, the network takes the dot product of A's Query and B's Key ( $Q_i \cdot K_j$ ...

1.How AI Models Work

2.LLM Training, Fine-Tuning, and Optimization

3.AI System Design

Attention Mechanisms

Why did self-attention replace recurrent networks?

How does scaled dot-product attention work?