Advertisement
loss = -log σ(β · (log π(A) - log π(B)) - β · (log π_ref(A) - log π_ref(B)))
What you're seeing
DPO: use preference pairs directly. No reward model. No PPO.
★ KEY TAKEAWAY
DPO: skip the reward model — use preference pairs to directly optimize. The math is a closed-form derivation of PPO + KL constraint.
▶ WHAT TO TRY
- Slide preferred logp up — loss drops sharply.
- Slide rejected logp up — loss rises (model has to prefer A over B).
- β controls how aggressively to follow preference signal.