Scaled Dot-Product Attention
Explore the scaled dot-product attention formula which drives transformer-based language models. Understand how queries, keys, and values interact through a four-step process involving similarity calculations, scaling to prevent saturation, softmax probability distribution, and weighted summation for context-aware outputs. Gain insight essential for debugging, optimizing attention heads, and improving training stability.
In the previous lesson, you built an intuition for queries, keys, and values using the library analogy: a query represents what you are looking for, keys represent the labels on each book, and values represent the actual content inside those books. That intuition now becomes a precise mathematical formula. The full scaled dot-product attention equation is:
This single line of math is the engine behind every modern large language model. When a production LLM deployed on a platform like Amazon SageMaker generates a token during inference, it executes this exact computation across every attention head, billions of times over the course of a conversation. Understanding each piece of this formula is essential for debugging model behavior, choosing attention head sizes, and diagnosing training instability.
This lesson dissects the formula into four steps, each building on the last.
Step 1 computes raw similarity scores through the dot product
. Step 2 applies the scaling factor
to stabilize gradients. Step 3 uses softmax to convert raw scores into a probability distribution.
Step 4 multiplies those probabilities by
...
With the full picture in view, the following sections examine each step.