Multi-Head Attention, GQA, and MQA
Understand multi-head attention, the mechanism that allows an LLM to analyze language from multiple perspectives simultaneously.
In our last lesson, we built the complete, masked self-attention mechanism. We successfully created a process where tokens can communicate with past tokens to build a rich, contextual understanding of our prompt.
However, the mechanism we built, as powerful as it is, represents a single, unified perspective. It’s like having one expert review a complex document. They might be a brilliant grammarian but completely miss the subtle semantic themes. Language is layered and complex. Is one perspective truly enough to understand it? What if we could have a whole “committee of experts” analyze our prompt at the same time?
The solution to the “single perspective” problem is multi-head attention.
Multi-head attention
This concept is much simpler than it sounds. Instead of having just one set of W_Q, W_K, and W_V weight matrices, multi-head attention has multiple independent sets. It runs the entire self-attention process we learned earlier multiple times in parallel, once for each “head.”
This is a fundamental learning technique. We do not pre-program Head 1 to be the “grammar expert” or Head 7 to be the “pronoun expert.” Instead, we provide the architectural capacity for specialization, and the model itself discovers the most effective way to use that capacity during training. Over time, driven by the single goal of making better predictions, different heads naturally learn to specialize in tracking different types of relationships:
Head 1 (The grammarian): Might focus on grammatical relationships. It might link “little” to “twinkle” as an adjective to a noun.
Head 2 (The semantic expert): Might focus on conceptual similarity. It might be noted that “Twinkle” and “twinkle” refer to the same concept.
Head 3 (The positional analyst): Might focus on word order and proximity.
…and so on, for 8, 12, or even more heads.
By having these experts work in parallel, the model can capture a rich, multi-faceted understanding of the language in a single pass. The technical implementation of this involves a two-part transformation: first, we split our single input into multiple parallel “workspaces” for our experts, and then, after they have completed their work, we synthesize their findings into a single, unified output.
Output weight matrix (W_O)
In multi-head attention, the W_O is absolutely essential and ...