Advertisement
loss = -log(prob of correct class). Confident wrong → huge loss. Confident right → near 0.
What you're seeing
3-class example. Adjust the predicted distribution. Loss = -log(p[true_class]). P(class 2) is automatic = 1 - P(0) - P(1).
The asymmetry matters: a confident wrong prediction gets punished orders of magnitude more than a confident right one gets rewarded. This is why a single bad batch can spike training loss.
★ KEY TAKEAWAY
Cross-entropy = -log(prob of correct class). Confident-and-right is cheap; confident-and-wrong is catastrophic.
▶ WHAT TO TRY
- Set the predicted prob of the true class very low (1%) — loss spikes to ~4.6.
- Set it high (95%) — loss drops to ~0.05.
- This asymmetry is why one bad batch can spike training loss.