▶ Interactive Lab

LR Schedule — Warmup + Cosine

Visualize the canonical LLM training learning rate.

Advertisement
Standard LLM schedule: linear warmup → cosine decay → small min LR.

What you're seeing

Warmup typically 1-2% of total steps. Cosine decay from peak to ~10% of peak. Both have empirical justification.

★ KEY TAKEAWAY
Warmup linearly to peak, then cosine decay to ~10% of peak. The standard LLM training schedule.
▶ WHAT TO TRY
  • Slide Warmup % — without warmup the curve starts at peak, which causes early-training divergence.
  • Slide Min/Peak — most schedules decay to 10% of peak so late training keeps refining.