The Decoder: Generating Output

Explore the function of transformer decoders in generating text output. Understand the role of masked self-attention in maintaining causal flow, how cross-attention integrates encoder context, and how the autoregressive loop produces tokens sequentially. Gain insights into decoder architecture and optimizations like KV-caching for efficient generation.

We'll cover the following...

Masked self-attention
Cross-attention bridges encoder and decoder
- How queries meet encoder keys and values
- Decoder-only architectures
Autoregressive generation step by step
- The generation loop
- Computational cost and optimization
Putting the full decoder together
Conclusion

The encoder has done its job. It processed the full input sequence and produced rich, bidirectional contextualized representations for every token. Now the decoder faces a fundamentally different challenge. It must consume those encoder representations and generate output one token at a time, never peeking ahead at tokens it has not yet produced. Think of it like a simultaneous interpreter translating a speech. The interpreter hears the full source sentence (the encoder’s job) but must produce the translation word by word, left to right, without knowing what words they will choose later in the sentence.

Consider a concrete use case in machine translation. An English sentence like “The cat was tired” enters the encoder, which builds contextual representations for all four tokens simultaneously. The decoder then generates the French translation “Le chat était fatigué” one word at a time. When it is producing “était,” it can reference “Le” and “chat” but must not see “fatigué,” because that word has not been generated yet.

This lesson covers the three mechanisms that make this possible. First, masked self-attention prevents the decoder from looking at future tokens. Second, cross-attention connects the decoder to the encoder’s output so it knows what to generate about. Third, the autoregressive generation loop produces tokens one at a time using the full decoder stack. Understanding these mechanisms is essential for grasping how modern LLMs like GPT generate text.

Masked self-attention

The decoder’s first sub-layer is self-attention, but unlike the encoder’s unmasked version from the previous lesson, it applies a causal maskA triangular mask applied to attention scores that prevents each token position from attending to any future position, enforcing a left-to-right information flow.. For each position ...

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

The Decoder: Generating Output

Masked self-attention