Multi-Head Self-Attention

Learn how transformers capture multiple relationships in parallel.

We saw how self-attention finds relationships between words in a sequence — for example, linking “love” more strongly with “I” and “you” than with “Hello.”

But here’s the thing: a single self-attention layer focuses on one set of relationships at a time. Language is richer than that.
A sentence might have:

  • Grammatical dependencies (“I” → “love”)

  • Semantic connections (“love” ↔ “you”)

  • Positional cues (“Hello” at the start indicates a greeting)

A single attention “head” might latch onto one of these, but we want our model to notice all of them at once.

Why multiple heads?

Multi-head self-attention runs several self-attention operations in parallel, each with its own learnable projection of Queries, Keys, and Values.
Think of it as giving the model multiple sets of eyes — one head might pay attention to subject–verb links, another to nearby words, another to long-distance context.

Here’s the process:

  1. Project the input embeddings into multiple smaller spaces — one set for each head.

  2. Apply self-attention independently in each head.

  3. Concatenate the outputs from all heads.

  4. Project back into the original embedding size.

Mathematically:

where each head is:

Here,WiQW_i^Q,WiKW_iK ...