Multi-Head Self-Attention
Learn how transformers capture multiple relationships in parallel.
We'll cover the following...
We saw how self-attention finds relationships between words in a sequence — for example, linking “love” more strongly with “I” and “you” than with “Hello.”
But here’s the thing: a single self-attention layer focuses on one set of relationships at a time. Language is richer than that.
A sentence might have:
Grammatical dependencies (“I” → “love”)
Semantic connections (“love” ↔ “you”)
Positional cues (“Hello” at the start indicates a greeting)
A single attention “head” might latch onto one of these, but we want our model to notice all of them at once.
Why multiple heads?
Multi-head self-attention runs several self-attention operations in parallel, each with its own learnable projection of Queries, Keys, and Values.
Think of it as giving the model multiple sets of eyes — one head might pay attention to subject–verb links, another to nearby words, another to long-distance context.
Here’s the process:
Project the input embeddings into multiple smaller spaces — one set for each head.
Apply self-attention independently in each head.
Concatenate the outputs from all heads.
Project back into the original embedding size.
Mathematically:
where each head is:
Here,