Sampling finds A good sequence. Beam search aims for THE highest-probability sequence. Used in machine translation, code completion, and structured outputs. Slower than greedy but more thorough. The math: maintain top-K partial sequences instead of one.

Advertisement

The procedure

def beam_search(model, prompt, K=5, max_len=100):
    beams = [(prompt, 0.0)]   # (sequence, log_prob)
    for _ in range(max_len):
        candidates = []
        for seq, lp in beams:
            logits = model(seq)[-1]
            log_probs = log_softmax(logits)
            for token, tp in topK(log_probs, K):
                candidates.append((seq+[token], lp+tp))
        beams = topK(candidates, K, key=lambda x: x[1])
        if all(seq[-1] == EOS for seq, _ in beams): break
    return beams[0][0]

At each step: extend each beam by K candidates. Keep top K total. Final answer: the highest-scoring complete sequence. K is the beam width — typical 4-10.

Log probabilities not probabilities

Probabilities multiply; log probabilities add. Adding is numerically stable; multiplying tiny floats vanishes. Always work in log-probability space for any search over sequences.

Advertisement

Length normalization

# Without normalization: short sequences win unfairly
# (more steps = more negative log probs added)
#
# Standard: score = log_prob / length^α
# α typically 0.6-0.7 for translation

Without length penalty, beam search prefers short outputs. Common fix: normalize by length raised to a power. Less needed for chat where EOS naturally ends things; critical for translation, summarization.

Speed cost

Beam K: K× slower than greedy (must batch K hypotheses through model). Memory: K× KV caches. Skips sampling diversity. Right for: tasks where 'the best' is well-defined. Wrong for: chat (loses creativity), free-form generation.

Modern picture

Beam search dominated 2014-2020 for translation, summarization. Now: sampling + temperature is preferred for most LLM use cases. Beam search remains for: machine translation production (NMT), structured code generation (some implementations), constrained generation.

Beam: top-K parallel sequences. Adds; length-normalizes. Slower but more thorough than sampling. Niche use in modern LLMs.