Multi-Head Attention
Explore the multi-head attention mechanism within transformer architectures. Understand how query, key, and value matrices are derived from encoder and decoder outputs, compute attention scores, and create the final attention matrix used for language understanding tasks.
The following figure shows the transformer model with both the encoder and decoder. As we can observe, the multi-head attention sublayer in each decoder receives two inputs: one is from the previous sublayer, masked multi-head attention, and the other is the encoder representation:
Let's represent the encoder representation with
How the multi-head attention layer works
Now, let's look into the details and learn how exactly this multi-head attention layer works. The first step in the multi-head attention mechanism is creating the query, key, and value matrices. We learned that we can create the query, key, and value matrices by multiplying the input matrix by the weight matrices. But in this layer, we have two input matrices: one is
Computing query, key, and value matrices
We create the query matrix,
The query matrix,
, is created by multiplying the attention matrix, , by the weight matrix, . The key and value matrices are created by multiplying the encoder representation
, by the weight matrices, and , respectively. This is shown in the following figure: ...