Decoding Strategies and Inference Optimization for LLMs

Understand how large language models generate text step-by-step through different decoding strategies such as greedy, beam search, and sampling. Learn the role of KV-caching and speculative decoding in accelerating autoregressive generation. Discover how test-time compute changes scaling and impacts performance in real-world AI systems, preparing you to explain these concepts confidently in AI engineer interviews.

We'll cover the following...

How does a language model actually generate text?
What is beam search and why does sampling usually work better for chat models?
What is the KV-cache and why does it make autoregressive generation fast?
What is speculative decoding and how does it achieve lossless speedups?
What is test-time compute and why did it change the scaling story?
How would you implement greedy decoding and top-p sampling from scratch?
What’s next?

Attention and architecture get the spotlight, but inference is where models live or die in production. A candidate who can explain the attention mechanism but cannot articulate how autoregressive generation actually works, why it's slow, or how modern systems make it fast will raise red flags for any applied AI role. Decoding strategy, KV-caching, and speculative decoding are not optional knowledge. They determine latency, cost, and quality in every production LLM deployment.

Generating text is fundamentally different from running a classifier. A classifier does one forward pass and outputs a label. An autoregressive language model does N forward passes to generate N tokens, each pass conditioned on everything before it. Every optimization in this lesson exists because of that sequential dependency.

How does a language model actually generate text?

At each generation step, the model takes the full sequence of tokens so far, runs a forward pass through all its layers, and outputs a probability distribution over the vocabulary for the next token. That distribution contains one probability per vocabulary entry, often 50,000 to 100,000+ numbers. Decoding is the decision: which token do you pick?

The naive choice is greedy decoding: always pick the highest-probability token. It is fast and deterministic, but it produces bland, repetitive text. Because the model always commits to the locally best option, it can paint itself into a corner where the globally better sentence required a slightly riskier first word. It also degenerates into repetition loops because a sequence that is already high-probability becomes even higher-probability to repeat.

What is beam search and why does sampling usually work better for chat models?

Beam search is greedy decoding with k parallel candidates. At each step, instead of committing to one token, you keep the top k sequences by cumulative log-probability. After N steps, you return the highest-scoring completed sequence. This is better than pure greedy for structured tasks like translation or summarization, where the correct output is narrow and well-defined.

For open-ended generation, beam search produces text that is too safe. Because it selects for high-probability sequences, it gravitates toward the most average, expected continuation. All k beams often converge on the same boring phrase.

Sampling fixes this by introducing controlled randomness. There are three knobs:

Temperature scales the logits before softmax. ${logits}_{i} \leftarrow$ ...

1.How AI Models Work

2.LLM Training, Fine-Tuning, and Optimization

3.AI System Design

Decoding Strategies and Inference Optimization for LLMs

How does a language model actually generate text?

What is beam search and why does sampling usually work better for chat models?