The Transformer's Decoder
Explore the components of the transformer's decoder including masked multi-head self-attention, encoder-decoder attention, and residual connections. This lesson helps you understand how sequential token prediction works in NLP tasks like machine translation, and how the decoder combines input and output sentences to produce accurate predictions.
We'll cover the following...
The decoder consists of all the aforementioned components plus two novel ones. As before:
-
The output sequence is fed in its entirety, and word embeddings are computed.
-
Positional encoding is again applied.
-
The vectors are passed to the first decoder block.
Each decoder block includes:
-
A masked multi-head self-attention layer
-
A normalization layer followed by a residual connection
-
A new multi-head attention layer (known as encoder-decoder attention)
-
A second normalization layer and a residual connection ...