Multi-Head Self-Attention
Learn how multi-head self-attention lets transformer models analyze various relationships in text sequences simultaneously. Discover the mechanism of parallel attention heads that focus on different linguistic aspects, improving model comprehension and performance in NLP tasks. This lesson explains the theory and practical PyTorch implementation to empower your deep learning skills.
We'll cover the following...
We saw how self-attention finds relationships between words in a sequence — for example, linking “love” more strongly with “I” and “you” than with “Hello.”
But here’s the thing: a single self-attention layer focuses on one set of relationships at a time. Language is richer than that.
A sentence might have:
Grammatical dependencies (“I” → “love”)
Semantic connections (“love” ↔ “you”)
Positional cues (“Hello” at the start indicates a greeting)
A single attention “head” might latch onto one of these, but we want our model to notice all of them at once.
Why multiple heads?
Multi-head self-attention runs several self-attention operations in parallel, each with its own learnable projection of Queries, Keys, and Values.
Think of it as giving the model multiple sets of eyes — one head might pay attention to subject–verb links, another to nearby words, another to long-distance context.
Here’s the process:
Project the input embeddings into multiple smaller spaces — one set for each head.
Apply self-attention independently in each head.
Concatenate the outputs from all heads.
Project back into the original embedding size.
Mathematically:
where each head is:
Here,