Softmax converts a vector of real numbers into a probability distribution. It shows up at the end of every transformer (logits → token probabilities) and inside attention (scores → attention weights). The properties that matter aren't obvious until you derive it.
The formula
softmax(z)[i] = exp(z[i]) / sum over j of exp(z[j])Input: a vector z ∈ ℝᴷ of real-valued 'logits'. Output: a probability distribution p ∈ ℝᴷ where each p[i] ∈ (0, 1) and they sum to exactly 1. K is the vocabulary size (~32K-130K for modern LLMs) or the sequence length (for attention).
Why exp?
exp() guarantees the output is positive (probabilities must be non-negative). It's also monotonic: a larger input gives a larger output. The ratio property exp(z[i]+c)/exp(z[j]+c) = exp(z[i])/exp(z[j]) means softmax is shift-invariant: adding a constant to all logits doesn't change the distribution. That gives numerical stability.
Numerical stability trick
softmax(z) = softmax(z - max(z))
z_safe = z - max(z) # all entries ≤ 0
exp_z = exp(z_safe) # no overflow
result = exp_z / sum(exp_z)Without subtracting max, exp(z) can overflow for typical logit values. Every production implementation does this subtraction. The math is identical; the floats survive.
Temperature and sharpness
softmax(z / T)Dividing logits by T before softmax controls sharpness. T=1: the model's learned distribution. T<1: peaked (more deterministic). T>1: flat (more diverse). T→0: argmax (greedy decoding). T→∞: uniform. Temperature is one of the most important inference-time knobs.
Gradient of softmax
∂softmax(z)[i]/∂z[j] = softmax(z)[i] · (δ[i,j] - softmax(z)[j])The Kronecker delta δ[i,j] is 1 when i=j else 0. This Jacobian is essential for backpropagation through attention. It's also why combining softmax with cross-entropy gives a clean gradient (next article).