The Transformer's Decoder

Formulate the transformer's decoder and learn about masked multi-head self-attention.

The decoder consists of all the aforementioned components plus two novel ones. As before:

  1. The output sequence is fed in its entirety, and word embeddings are computed.

  2. Positional encoding is again applied.

  3. The vectors are passed to the first decoder block.

Each decoder block includes:

  1. A masked multi-head self-attention layer

  2. A normalization layer followed by a residual connection

  3. A new multi-head attention layer (known as encoder-decoder attention)

  4. A second normalization layer and a residual connection

  5. A linear layer and a third residual connection

The decoder block appears again N=6N=6 repeated times. The final output is transformed through a final linear layer, and the output probabilities are calculated with the standard softmax function.

Get hands-on with 1200+ tech skills courses.