Multi-Head Attention, GQA, and MQA

Learn how multi-head attention allows language models to analyze text from multiple perspectives simultaneously, creating richer contextual understanding. Understand the roles of the output weight matrix and efficiency challenges such as the key-value cache. Explore grouped-query attention and multi-query attention as optimized approaches to reduce memory usage and speed up inference with minimal quality trade-offs.

We'll cover the following...

Multi-head attention
- Output weight matrix (W_O)
Implementing multi-head attention
- Split, attend, and synthesize
The efficiency bottleneck: The Key-Value cache
- Grouped-query attention (GQA)
- Multi-query attention (MQA)
Conclusion

In our last lesson, we built the complete, masked self-attention mechanism. We successfully created a process where tokens can communicate with past tokens to build a rich, contextual understanding of our prompt.

However, the mechanism we built, as powerful as it is, represents a single, unified perspective. It’s like having one expert review a complex document. They might be a brilliant grammarian but completely miss the subtle semantic themes. Language is layered and complex. Is one perspective truly enough to understand it? What if we could have a whole “committee of experts” analyze our prompt at the same time?

The solution to the “single perspective” problem is multi-head attention.

Multi-head attention

This concept is much simpler than it sounds. Instead of having just one set of W_Q, W_K, and W_V weight matrices, multi-head attention has multiple independent sets. It runs the entire self-attention process we learned earlier multiple times in parallel, once for each “head.”

This is a fundamental learning technique. We do not pre-program Head 1 to be the “grammar expert” or Head 7 to be the “pronoun expert.” Instead, we provide the architectural capacity for specialization, and the model itself discovers the most effective way to use that capacity during training. Over time, driven by the single goal of making better predictions, different heads naturally learn to specialize in tracking different types of relationships:

Head 1 (The grammarian): Might focus on grammatical relationships. It might link “little” to “twinkle” as an adjective to a noun.
Head 2 (The semantic expert): Might focus on conceptual similarity. It might be noted that “Twinkle” and “twinkle” refer to the same concept.
Head 3 (The positional analyst): Might focus on word order and proximity.
…and so on, for 8, 12, or even more heads.

By having these experts work in parallel, the model can capture a rich, multi-faceted understanding of the language in a single pass. The technical implementation of this involves a two-part transformation: first, we split our single input into multiple parallel “workspaces” for our experts, and then, after they have completed their work, we synthesize their findings into a single, unified output. ...

1.Course Overview

2.The Inference Journey

3.The Training Journey

4.Building with LLMs: The Developer’s Toolkit

5.Wrap Up

Multi-Head Attention, GQA, and MQA

Multi-head attention