Search⌘ K
AI Features

Intuition Behind Attention: Why It Works

Understand how the attention mechanism in transformers works by learning its role in focusing on relevant tokens dynamically. This lesson uses analogies and examples to reveal how queries, keys, and values interact to overcome information bottlenecks, allowing efficient parallel computation and improved language generation. Grasping this concept lays the foundation for more advanced transformer details.

Every token in a transformer carries a rich vector that encodes both its meaning and its position in the sequence. The previous lesson showed how embeddings and positional encodings produce these vectors. But there is a critical gap: the model still has no way to decide which tokens matter most for the task it is performing right now. Consider a concrete translation example. The French sentence “Le chat noir dort sur le canapé” needs to become “The black cat sleeps on the couch” in English. When the model is generating the word “black,” it must focus heavily on “noir” while largely ignoring “sur” and “le.” Without a focusing mechanism, every token contributes equally, and the relevant signal drowns in noise. Attention is the transformer’s learned ability to dynamically assign relevance, acting like a spotlight that shifts depending on what the model is currently trying to produce. This lesson builds the full intuition visually and narratively, while the next lesson on Scaled Dot-Product Attention will formalize the math.

The following diagram illustrates how attention creates selective connections between source and target tokens during translation.

Translation alignment showing attention weights between French and English tokens with thicker lines indicating stronger relevance
Translation alignment showing attention weights between French and English tokens with thicker lines indicating stronger relevance

Why equal weighting fails

Early sequence-to-sequence (seq2seq) modelsNeural architectures that map an input sequence to an output sequence by first encoding the input into a fixed-length vector and then decoding that vector into the target sequence. took a naive approach. They compressed the entire input into a single fixed-length context vector by averaging or summarizing all token representations. This created a severe information bottleneck. Compressing a full paragraph into one vector forces the ...