Cross-Entropy Loss Surface

Advertisement

True class P(class 0) 0.50 P(class 1) 0.30

loss = -log(prob of correct class). Confident wrong → huge loss. Confident right → near 0.

What you're seeing

3-class example. Adjust the predicted distribution. Loss = -log(p[true_class]). P(class 2) is automatic = 1 - P(0) - P(1).

The asymmetry matters: a confident wrong prediction gets punished orders of magnitude more than a confident right one gets rewarded. This is why a single bad batch can spike training loss.

★ KEY TAKEAWAY

Cross-entropy = -log(prob of correct class). Confident-and-right is cheap; confident-and-wrong is catastrophic.

▶ WHAT TO TRY

Set the predicted prob of the true class very low (1%) — loss spikes to ~4.6.
Set it high (95%) — loss drops to ~0.05.
This asymmetry is why one bad batch can spike training loss.