Masked Multi-Head Attention

Explore the masked multi-head attention mechanism in transformer decoders used for sequence-to-sequence tasks such as language translation. Understand how masking prevents the model from attending to future tokens during training, enabling it to generate target sequences step-by-step like in testing. Learn about query, key, and value computations, masking implementation, and how attention scores are calculated and combined.

We'll cover the following...

Computing query, key, and value matrices
How masked multi-head attention works

By looking at the preceding dataset, we can understand that we have source and target sentences. We saw how the decoder predicts the target sentence word by word in each time step and that happens only during testing.

During training, since we have the right target sentence, we can just feed the whole target sentence as input to the decoder but with a small modification. We learned that the decoder takes the input $<sos>$ as the first token, and combines the next predicted word to the input on every time step for predicting the target sentence until the $<eos>$ token is reached. So, we can just add the $<sos>$ token to the beginning of our target sentence and send that as an input to the decoder.

Say we are converting the English sentence 'I am good' to the French sentence 'Je vais bien'. We can just add the $<sos>$ token to the beginning of the target sentence and send $<sos> \text{Je vais bien}$ as an input to the decoder, and then the decoder predicts the output as $\text{Je vais bien} <eos>$ ...

Source sentence	Target sentence
I am good	Je vais bien
Good morning	Bonjour
Thank you very much	Merci beaucoup

1.Before We Start

2.Starting Off with BERT

3.A Primer on Transformers

Project

4.Understanding the BERT Model

5.Getting Hands-On with BERT

6.Exploring BERT Variants

7.Different BERT Variants

8.BERT Variants—Based on Knowledge Distillation

9.Applications of BERT

10.Exploring BERTSUM for Text Summarization

11.Applying BERT to Other Languages

12.Exploring Sentence and Domain-Specific BERT

13.Working with VideoBERT, BART, and More

14.Conclusion

Project

Masked Multi-Head Attention

A sample training set