Encoder block

Multi-head self-attention + FFN, with residual + LayerNorm. Stack N times.

Advertisement

Decoder block

Masked self-attention + cross-attention to encoder + FFN. Stack N times. Causal mask enables next-token training.

Advertisement

Positional encoding

Attention is position-agnostic. Add sinusoidal or learned position embeddings. RoPE (rotary) modern default.