A pretrained LLM produces next tokens. A useful chat model needs alignment: follow instructions, decline harmful requests, prefer helpful tone. RLHF and its simpler successor DPO are the standard recipes. The math is approachable.
SFT — supervised fine-tuning first
Start with instruction-response pairs. Standard cross-entropy training. Teaches format and basic helpfulness. Typically 50K-500K examples. Output: a model that follows instructions but may not always pick the best response.
Reward model
# Given two responses A, B to the same prompt:
# Humans label which is preferred
# Train a reward model R(prompt, response) such that
# R(prompt, A) > R(prompt, B) when A is preferred
#
# Loss: -log sigmoid(R(prompt, A) - R(prompt, B))Reward model is a separate network (often initialized from SFT model). Predicts preference scores. Used to score outputs during RL training.
PPO — RL with reward model
Standard policy-gradient algorithm. Generate responses, score with reward model, update policy to favor high-reward outputs. Includes KL penalty against the SFT baseline to prevent collapse. Complex pipeline; multiple model copies in memory; expensive.
DPO — direct preference optimization
# Skip the reward model entirely.
# Use preference pairs (A preferred over B) to directly optimize:
# loss = -log sigmoid(β * [log π(A|x)/π_ref(A|x) - log π(B|x)/π_ref(B|x)])
#
# π = current model, π_ref = SFT baselineClosed-form derivation: PPO with KL constraint is equivalent to a simple preference loss. Train directly on preference pairs. Same memory as supervised training. Standard in 2026 for most alignment use cases.
Practical recipe
Start with SFT on instruction data. Collect preference pairs (or use existing: UltraFeedback, Helpful-and-Harmless). Apply DPO for 1-3 epochs. Validate on AlpacaEval, MT-Bench. Total compute: ~2-5× the SFT cost. Far cheaper than PPO.