Advertisement
Computes mean and variance ACROSS feature dim. Per-token.
What you're seeing
For each token: subtract mean, divide by std. Then apply learnable γ (scale) and β (shift). Shown as raw input, mean line, variance band, normalized output.
★ KEY TAKEAWAY
LayerNorm: subtract mean, divide by std, then learnable scale + shift. Per-token; no batch dependence.
▶ WHAT TO TRY
- Click Resample to see how arbitrary inputs always end up with mean=0, std=1 after normalization.
- The learned γ (scale) and β (shift) let the model un-normalize where useful.