Gradient Accumulation — Belgavi.AI Lab

Advertisement

K 8

K micro-batches accumulate gradients; one optimizer step at end.

Each micro-batch's loss is divided by K so the accumulated grad = average grad. Memory stays low.

★ KEY TAKEAWAY

K micro-batches → one optimizer step. Memory of one micro-batch, gradient of an effective K× batch.

▶ WHAT TO TRY

Set K to 32 and click Run — watch 32 micro-batches build up before one optimizer step.
Each micro-batch's loss is divided by K, so the accumulated gradient equals the average.