Multi-Head Self-Attention

Learn how transformers capture multiple relationships in parallel.

We'll cover the following...

Why multiple heads?
- Key advantage
PyTorch implementation of multi-head attention
- What’s next?

We saw how self-attention finds relationships between words in a sequence — for example, linking “love” more strongly with “I” and “you” than with “Hello.”

But here’s the thing: a single self-attention layer focuses on one set of relationships at a time. Language is richer than that.
A sentence might have:

Grammatical dependencies (“I” → “love”)
Semantic connections (“love” ↔ “you”)
Positional cues (“Hello” at the start indicates a greeting)

A single attention “head” might latch onto one of these, but we want our model to notice all of them at once.

Why multiple heads?

Multi-head self-attention runs several self-attention operations in parallel, each with its own learnable projection of Queries, Keys, and Values.
Think of it as giving the model multiple sets of eyes — one head might pay attention to subject–verb links, another to nearby words, another to long-distance context.

Here’s the process:

Project the input embeddings into multiple smaller spaces — one set for each head.
Apply self-attention independently in each head.
Concatenate the outputs from all heads.
Project back into the original embedding size.

Mathematically:

1.Learn Deep Learning

2.Neural Networks

3.Training Neural Networks

4.Convolutional Neural Networks

5.Recurrent Neural Networks

6.Autoencoders

7.Generative Adversarial Networks

8.Attention and Transformers

9.Graph Neural Networks

10.Conclusion

Assessment

Multi-Head Self-Attention

Why multiple heads?