Advertisement
K micro-batches accumulate gradients; one optimizer step at end.
What you're seeing
Each micro-batch's loss is divided by K so the accumulated grad = average grad. Memory stays low.
★ KEY TAKEAWAY
K micro-batches → one optimizer step. Memory of one micro-batch, gradient of an effective K× batch.
▶ WHAT TO TRY
- Set K to 32 and click Run — watch 32 micro-batches build up before one optimizer step.
- Each micro-batch's loss is divided by K, so the accumulated gradient equals the average.