Multi-Head Attention

Explore the multi-head attention mechanism within transformer architectures. Understand how query, key, and value matrices are derived from encoder and decoder outputs, compute attention scores, and create the final attention matrix used for language understanding tasks.

We'll cover the following...

How the multi-head attention layer works

Let's represent the encoder representation with $R$ and the matrix obtained as a result of the masked multi-head attention sublayer with $M$ . Since here we have an interaction between the encoder and decoder, this layer is also called an encoder-decoder attention layer.

How the multi-head attention layer works

Now, let's look into the details and learn how exactly this multi-head attention layer works. The first step in the multi-head attention mechanism is creating the query, key, and value matrices. We learned that we can create the query, key, and value matrices by multiplying the input matrix by the weight matrices. But in this layer, we have two input matrices: one is $R$ (the encoder representation) and the other is $M$ (the attention matrix from the previous sublayer). So, which one should we use?

Computing query, key, and value matrices

We create the query matrix, $Q$ , using the attention matrix, $M$ , obtained from the previous sublayer and we create the key and value matrices using the encoder representation $R$ . Since we are performing the multi-head attention mechanism, for head $i$ , we do the following:

The query matrix, $Q_i$ , is created by multiplying the attention matrix, $M$ , by the weight matrix, $W_i^Q$ .
The key and value matrices are created by multiplying the encoder representation $R$ , by the weight matrices, $W_i^K$ and $W_i^V$ , respectively. This is shown in the following figure: ... ...

1.Before We Start

2.Starting Off with BERT

3.A Primer on Transformers

Project

4.Understanding the BERT Model

5.Getting Hands-On with BERT

6.Exploring BERT Variants

7.Different BERT Variants

8.BERT Variants—Based on Knowledge Distillation

9.Applications of BERT

10.Exploring BERTSUM for Text Summarization

11.Applying BERT to Other Languages

12.Exploring Sentence and Domain-Specific BERT

13.Working with VideoBERT, BART, and More

14.Conclusion

Project

Multi-Head Attention

How the multi-head attention layer works

Computing query, key, and value matrices