# Multi-Head Self-Attention

Explore how multi-head attention expands upon self-attention.

The idea of self-attention can be expanded to multi-head attention. In essence, we run through the attention mechanism several times.

Each time, we map the independent set of Key, Query, Value matrices into different lower-dimensional spaces and compute the attention there. The individual output is called a “head”. The mapping is achieved by multiplying each matrix with a separate weight matrix, which is denoted as ${W}_{i}^{K} , {W}_{i}^{Q} \in R^{d_{model} \times d_{k} }$ and ${W}_{i}^{V} \in R^{d_{model} \times d_{k}}$, where $i$ is the head index.

To compensate for the extra complexity, the output vector size is divided by the number of heads. Specifically, in the vanilla transformer, they use $d_{model}=512$ and $h=8$ heads, which gives us vector representations of $d_k = 64$.

With multi-head attention, the model has multiple independent paths (ways) to understand the input.

The heads are then concatenated and transformed using a square weight matrix ${W}^{O} \in R^{d_{model} \times d_{model}}$, since $d_{model}=h d_{k}$.

Putting it all together, we get:

$MultiHead ({Q}, {K}, {V}) = { Concat (head }_{1}, \ldots, { head } \left._{{h}}\right) {W}^{O}$

where $head_{{i}} = { Attention }\left({Q} {W}_{i}^{Q}, {K} {W}_{i}^{K},{V} {W}_{i}^{V}\right)$

where again:

${W}_{i}^{Q}, {W}_{i}^{K}, {W}_{i}^{V} \in {R}^{d_{\text{model}} \times d_{k}}$

Since heads are independent of each other, we can perform the self-attention computation in parallel on different workers:

Get hands-on with 1200+ tech skills courses.