Advertisement
Forward ≈ 2·params·seq FLOPs. Backward ~2× more.
What you're seeing
One training step ~3× the inference compute. Plus optimizer step.
★ KEY TAKEAWAY
Forward FLOPs ≈ 2·params·seq. Backward is 2× more. Total step ≈ 3× forward. Plus the optimizer step.
▶ WHAT TO TRY
- Pick a model size and seq length.
- At 50 GFLOPS (CPU): training step takes seconds even for small models.