Sampling Strategies
Explore how sampling strategies influence the token selection process in generative AI. Understand beam search, top-k, and nucleus sampling methods, their trade-offs between determinism and diversity, and their practical use cases to optimize text generation for different applications.
We'll cover the following...
When a language model generates text, it does not directly produce a single next word. Instead, at each step, it outputs a probability distribution over all tokens in its vocabulary. This distribution reflects how likely the model believes each token is, given the text generated so far.
For example, consider the prompt:
“The capital of France is”
After processing this prompt, the model might produce a probability distribution like the following:
The model has not committed to a single answer. It has expressed uncertainty by assigning probabilities to multiple possibilities. The question is: how do we turn this distribution into an actual token choice?
This decision is handled by the sampling strategy.
Greedy decoding
The most straightforward approach is greedy decoding. At each step, the model selects the token with the highest probability.
In the example above, greedy decoding would always select: “Paris.”
Greedy decoding is deterministic. Given the same prompt and model, it will always produce the same output. While this can be useful for debugging or tasks where variability is undesirable, it has important limitations.
Why greedy decoding is not enough
Greedy decoding works well for short, factual completions, but it often performs poorly for longer or more open-ended generation.
Consider a storytelling prompt:
“Once upon a time, there was a brave knight who”
At each step, the most probable token is often a common, safe continuation. Over many steps, this leads to outputs that are repetitive, generic, or overly cautious. The model tends to follow high-probability paths that quickly converge on predictable phrasing.
For example, greedy decoding may repeatedly favor tokens like:
“was”
“had”
“the”
This can result in text that feels dull or stuck in loops.
Controlled randomness
Sampling strategies address this issue by introducing controlled randomness into the selection of tokens. Instead of always choosing the most probable token, the model is allowed to sample from the distribution in a structured way.
Returning to the earlier example:
Token | Probability |
Paris | 0.72 |
Lyon | 0.12 |
Marseille | 0.07 |
London | 0.03 |
A sampling-based approach might still choose “Paris” most of the time, but it allows lower-probability tokens like “Lyon” or “Marseille” to be selected occasionally. This variability becomes especially important when generating longer sequences, where early choices strongly influence later ones.
All sampling strategies navigate the same fundamental trade-off:
Determinism vs. diversity
Coherence vs. creativity
Safety vs. exploration
Different strategies resolve this trade-off in different ways. Some prioritize the most likely sequences, while others deliberately allow for variation. In the following sections, we will examine three commonly used approaches, beam search, top-k sampling, and nucleus (top-p) sampling, and see how each one makes this decision differently.
Beam search
Beam search is a decoding strategy that produces more reliable, coherent outputs by simultaneously exploring multiple possible continuations. Instead of committing to a single token choice at each step, beam search keeps track of several promising partial sequences and expands them in parallel.
The key idea is simple: do not put all your probability mass on a single path too early.
The core idea behind beam search
Beam search maintains a fixed number of candidate sequences, called the beam width, usually denoted as BBB.
At each generation step:
Every sequence in the beam is expanded by one token.
All expanded sequences are scored using their cumulative probability.
Only the top B sequences are kept.
The rest are discarded.
This process repeats ...