Search⌘ K
AI Features

Scaled Dot-Product Attention

Explore the scaled dot-product attention formula which drives transformer-based language models. Understand how queries, keys, and values interact through a four-step process involving similarity calculations, scaling to prevent saturation, softmax probability distribution, and weighted summation for context-aware outputs. Gain insight essential for debugging, optimizing attention heads, and improving training stability.

In the previous lesson, you built an intuition for queries, keys, and values using the library analogy: a query represents what you are looking for, keys represent the labels on each book, and values represent the actual content inside those books. That intuition now becomes a precise mathematical formula. The full scaled dot-product attention equation is:

This single line of math is the engine behind every modern large language model. When a production LLM deployed on a platform like Amazon SageMaker generates a token during inference, it executes this exact computation across every attention head, billions of times over the course of a conversation. Understanding each piece of this formula is essential for debugging model behavior, choosing attention head sizes, and diagnosing training instability.

This lesson dissects the formula into four steps, each building on the last.

  • Step 1 computes raw similarity scores through the dot product QKTQK^T.

  • Step 2 applies the scaling factor 1dk\frac{1}{\sqrt{d_k}} to stabilize gradients.

  • Step 3 uses softmax to convert raw scores into a probability distribution.

  • Step 4 multiplies those probabilities by ...

Scaled dot-product attention mechanism showing Q, K, V tensor transformations through matrix multiplication, scaling, and softmax operations
Scaled dot-product attention mechanism showing Q, K, V tensor transformations through matrix multiplication, scaling, and softmax operations

With the full picture in view, the following sections examine each step.

The dot product computes similarity

How

...