Sampling Strategies — Greedy, Top-K, Top-P, Temperature

Once the model produces logits, you need to pick a token. The choice between greedy, top-k, top-p, and temperature shapes output style. The math is short; the practical differences are large.

Advertisement

Greedy

token = argmax(logits)

Pick the token with the highest probability. Deterministic. Fast. Tends to be repetitive — once a sequence enters a loop, it stays there. Right for: deterministic tasks (classification, extraction). Wrong for: creative writing.

Sampling with temperature

probs = softmax(logits / T)
token = sample(probs)

# T < 1: peaked, near-greedy
# T = 1: model's learned distribution
# T > 1: flatter, more diverse

Sampling from softmax. Temperature reshapes the distribution. Most chat models use T=0.7-0.8 by default. T=0 collapses to greedy (T=0 is implemented as argmax to avoid divide-by-zero).

Advertisement

Top-k

# Sort logits desc; keep top k:
top_k_logits = sort(logits)[:k]
set others to -inf
token = sample(softmax(modified_logits / T))

Truncate to top K candidates before sampling. Eliminates very-low-probability tokens. Default K: 40-100. Problem: K is fixed, but the distribution shape varies — sometimes top 5 carry 99% of mass; sometimes 200 needed.

Top-p (nucleus)

# Sort desc; keep smallest set with cumulative prob >= p:
sorted = sort_desc(softmax(logits))
cum = cumsum(sorted)
keep = sorted[cum >= p][0:k_dynamic]
set rest to -inf
sample

Adaptive K: keeps however many tokens needed to reach probability mass p. Typical p=0.9-0.95. Picks more tokens when uncertain, fewer when confident. Modern default.

Repetition penalty

# Decrease logits of tokens already in context:
for t in recent_tokens:
    logits[t] /= penalty   # if > 0
    logits[t] *= penalty   # if < 0
# typical penalty: 1.1-1.5

Heuristic to discourage loops. Penalizes recently-emitted tokens. Hacky; can degrade quality (the model genuinely needs to repeat 'the' often). Use sparingly. Better: use a model that doesn't loop in the first place.

Greedy: deterministic. T+top-p: modern default for chat. Repetition penalty: hack for loop avoidance.