The loss curve is your primary diagnostic during training. Knowing the shapes — healthy descent, spikes, plateaus, divergence — lets you intervene early. Saves hours of wasted compute.

Advertisement

Healthy curve

Steep initial drop in first 100-1000 steps. Smooth decline thereafter. Minor wiggles. Gradient norm relatively stable. Validation loss tracks training loss (slight gap). Perplexity (exp(loss)) on validation makes sense.

Spikes

Sudden upward jump, sometimes recovers. Cause: bad batch (corrupt data or extreme outliers), too-high LR temporarily, numerical overflow. Fix: lower LR, increase warmup, add gradient clipping, skip-step on huge gradient norm.

Advertisement

Plateau then continued descent

Loss flattens for a while then resumes dropping. Normal — model worked through one capability tier, finding the next. Don't kill the run prematurely. Watch gradient norm; if nonzero, it's still learning.

Divergence

Loss steadily climbing. Almost always a config bug: wrong LR, missing gradient clipping, broken data loader. Kill immediately and fix. Never 'wait it out' — divergence doesn't self-correct.

Train ≪ val (overfitting)

Training loss low, validation loss high or increasing. Common in fine-tuning with too many epochs. Fix: fewer epochs, more regularization, smaller LoRA rank, more data. Track validation from step 0; don't just look at training.

Healthy: smooth decline. Spikes: lower LR + clip. Divergence: bug, restart. Track validation always.