Advertisement
FFN dominates. Attention 2nd. Embedding negligible for big models.
What you're seeing
For d=2048, L=24: ~600M params, FFN ~60%, attention ~25%, embedding ~10% (tied).
★ KEY TAKEAWAY
FFN holds ~60% of transformer params. Attention ~25%. Embedding ~10% (tied) or ~20% (untied). For SLMs, embedding share matters more.
▶ WHAT TO TRY
- Slide d down to see embedding share grow.
- Toggle Tied to halve the embedding cost.