Softmax converts a vector of real numbers into a probability distribution. It shows up at the end of every transformer (logits → token probabilities) and inside attention (scores → attention weights). The properties that matter aren't obvious until you derive it.

Advertisement

The formula

softmax(z)[i] = exp(z[i]) / sum over j of exp(z[j])

Input: a vector z ∈ ℝᴷ of real-valued 'logits'. Output: a probability distribution p ∈ ℝᴷ where each p[i] ∈ (0, 1) and they sum to exactly 1. K is the vocabulary size (~32K-130K for modern LLMs) or the sequence length (for attention).

Why exp?

exp() guarantees the output is positive (probabilities must be non-negative). It's also monotonic: a larger input gives a larger output. The ratio property exp(z[i]+c)/exp(z[j]+c) = exp(z[i])/exp(z[j]) means softmax is shift-invariant: adding a constant to all logits doesn't change the distribution. That gives numerical stability.

Advertisement

Numerical stability trick

softmax(z) = softmax(z - max(z))

z_safe = z - max(z)        # all entries ≤ 0
exp_z  = exp(z_safe)        # no overflow
result = exp_z / sum(exp_z)

Without subtracting max, exp(z) can overflow for typical logit values. Every production implementation does this subtraction. The math is identical; the floats survive.

Temperature and sharpness

softmax(z / T)

Dividing logits by T before softmax controls sharpness. T=1: the model's learned distribution. T<1: peaked (more deterministic). T>1: flat (more diverse). T→0: argmax (greedy decoding). T→∞: uniform. Temperature is one of the most important inference-time knobs.

Gradient of softmax

∂softmax(z)[i]/∂z[j] = softmax(z)[i] · (δ[i,j] - softmax(z)[j])

The Kronecker delta δ[i,j] is 1 when i=j else 0. This Jacobian is essential for backpropagation through attention. It's also why combining softmax with cross-entropy gives a clean gradient (next article).

Softmax: exp(z-max(z)) normalized. Temperature reshapes. Gradient combines cleanly with cross-entropy.