After L transformer blocks, you have a hidden state h ∈ ℝ^d per token. To predict next tokens, project to vocabulary size and softmax. This last step is the largest matmul in a single inference forward pass — often a serving bottleneck.

Advertisement

Linear projection to vocab

h_final ∈ ℝ^(N × d)         (after L blocks + final norm)
logits  = h_final · W_out   ∈ ℝ^(N × V)

W_out ∈ ℝ^(d × V)

V is the vocabulary size — 32K to 130K for modern LLMs. d is 768 to 4096. So W_out can be 4096 × 130000 = 530M parameters. Often the single largest weight tensor in the model.

Logits to probabilities

for each token position i:
  probs[i] = softmax(logits[i])    ∈ ℝ^V
loss[i]   = -log(probs[i, target_i])

During training, compute loss at every position. During inference, only need logits at the LAST position (since others are already known). This is a big win: V·d compute on 1 token vs N tokens.

Advertisement

Tied embeddings — share E and W_out

# Standard untied:
W_out (d × V)   # separate from input embedding E (V × d)

# Tied:
W_out = Eᵀ      # same matrix, transposed

Tying eliminates the W_out parameters entirely. Saves d·V params. For Phi-3 with d=3072, V=32K: saves ~98M params (~3% of model). Required for most SLMs to fit memory budgets.

Sampling vs argmax

Greedy: token = argmax(logits). Deterministic, fast, sometimes repetitive. Sampling: token = sample from softmax(logits/T). With top-k or top-p truncation. Adds randomness; better creative output. Choice is per-deployment; quality wars are typically about sampling defaults.

CPU-specific cost

V·d matmul per step. For Phi-3: 3072 × 32064 ≈ 100M multiplies per generation step. Comparable to a single transformer block's FFN cost. On CPU: dominates if not quantized. INT4 quantization brings it in line with the attention layers.

logits = h · W_out, softmax → probs. W_out can be 500M params or tied to E (free). Last-position inference cuts cost N×.