...

Beam Search

Learn how beam search works in the context of large language model (LLM) decoding, why it’s a widely used strategy for generating coherent text, and how to identify and fix its limitations.

We'll cover the following...

What is search in LLMs?
What is beam search?
What are the limitations of beam search?
How to implement beam search in Python
How to fix length bias in beam search
Conclusion

“What is beam search in the context of LLM decoding, and its limitations?” This is a commonly asked question because decoding strategies are crucial for generating high-quality text with large language models (LLMs). Top AI companies—from research-focused labs like OpenAI and Anthropic to industry teams at Meta AI, Hugging Face, and startups like Mistral AI—care about how their models produce text. Interviewers at these organizations want to ensure you understand not just how to train models, but also how to guide a model’s output. Decoding algorithms like beam search directly impact the coherence, diversity, and quality of an LLM’s generated text, making it a favorite topic in technical interviews.

You’ll likely encounter this question for the AIML engineer role, where one needs to fine-tune or deploy language models. By asking about beam search, the interviewer tests a few things: your grasp of LLM internals, your understanding of sequence generation trade-offs, and whether you know how to improve or modify decoding strategies for better results. In other words, they want to see if you can reason why an LLM might generate dull or repetitive answers and how to fix that.

Before diving into beam search, it helps to understand what “search” means in LLM decoding. We’ll cover this in detail in the next section, including how greedy, beam, and sampling strategies differ in exploring the space of possible sequences. Let’s get started!

What is search in LLMs?

When an LLM generates text, it does so one token at a time, and at each step it outputs a probability distribution over all possible next tokens. Deciding which token to pick from that distribution is decoding, or “search,” because we are effectively searching for the best sequence the model can produce. You can imagine the generation process as a tree of possibilities: the root is the start of your sequence, each tree level represents the next token position, and each branch is a possible token. The job of a decoding strategy is to traverse this tree in a smart way to build a complete sequence.

The greedy search is a straightforward strategy, which takes the highest-probability token at each step. Greedy search is fast and easy—but it’s shortsighted. Picking the locally most likely token can lead to a suboptimal overall sequence. For example, maybe the second-best token at one stage would lead to a much better sentence later. Greedy search “commits” to a path without considering alternatives, much like a person who always takes the seemingly best immediate move in a game, potentially missing a winning long-term strategy. Technically, greedy decoding doesn’t always produce the globally most likely sequence under the model’s probability distribution. It often results in generic or repetitive text because the model chooses the obvious high-probability continuations.

At the other extreme, one could attempt exhaustive search, evaluating every possible sequence to find the absolute top-probability one. But this is utterly infeasible for language generation – if your vocabulary has 50,000 tokens and you generate even a 20-token sentence, the number of possible sequences is astronomically high (50,000^20!). Exhaustive search is exponentially expensive. Thus, practical decoding requires a heuristic search that balances quality and computational cost. This is where beam search comes in—a clever algorithm that lies between greedy and exhaustive search.

What is beam search?

Beam search is a popular decoding algorithm that improves greedy search by keeping multiple possibilities open at each step. The core idea is that instead of following just one path down the generation tree, we maintain K parallel candidates (where K is called the beam width). Instead of choosing the best next token at each generation step, we look at the top K tokens and branch out. We temporarily explore up to K different continuations for each partial sequence. We then pick the K best resulting sequences (by overall probability) and discard the rest. In effect, beam search expands a beam of the most promising hypotheses at each step, hence the name.

Here’s how beam search works step by step:

Start with the special beginning-of-sequence token (let’s call it <SOS>). This is our initial sequence. The score (overall probability) of this sequence is 1 (or log score 0, since log(1)=0).
Use the model to get probabilities for the next token. Instead of picking just one token, pick the top K tokens. This results in K one-token sequences (each sequence is <SOS> followed by one of these top tokens). We now have K hypotheses in our beam.
For each hypothesis in the beam, ask the model for the probabilities of the next token after that sequence. Each hypothesis can branch out into new ...

Introduction

Neural Network Training and Optimization

Embeddings and Tokenization

Attention Mechanisms

Evaluation Techniques

Model Architectures and Comparisons

Learning Techniques

Scalability and Efficiency

Wrap Up

Fundamentals of Generative AI

Beam Search

What is search in LLMs?

What is beam search?