The KV cache is the single most important inference optimization for transformers. It turns O(T²) decode into O(T·N). Knowing its math and memory cost helps you reason about long-context inference.
Why cache K and V (not Q)?
# At decode step t, attention is:
Q_t = x_t · W_Q # 1 vector (just new token)
K = stack(K_0, K_1, ..., K_t) # all keys so far
V = stack(V_0, V_1, ..., V_t) # all values so far
out_t = softmax(Q_t · Kᵀ / sqrt(d)) · VQ for past positions is never used again (they're already done). K and V for past positions are used at every future step. Cache them; recompute only the new Q.
Cache append, not recompute
# After processing new token t:
k_t = x_t · W_K # 1 vector
v_t = x_t · W_V # 1 vector
k_cache = concat(k_cache, k_t) # grows by 1
v_cache = concat(v_cache, v_t)Each step adds one row to k_cache and v_cache. The rest of the cache is reused. Implementation: pre-allocate a fixed buffer for max_seq, write into position [t].
Memory cost
# Per layer:
K_size = h * d_head * seq * batch * bytes
V_size = same as K
# Total across L layers:
total = 2 * L * h * d_head * seq * batch * bytesFor Llama 3 8B (L=32, h=8 KV heads — GQA, d_head=128, BF16=2 bytes): per token: 32 · 8 · 128 · 2 · 2 = 128 KB. At seq=8192, batch=1: 1 GB. At seq=32768: 4 GB.
GQA reduces it
Grouped Query Attention shares K, V across query heads. Llama 3 has 32 query heads but only 8 KV heads — 4× cache reduction. MLA (DeepSeek): compressed latent K, V — additional ~4× reduction. Without these, long-context inference would be infeasible.
Quantizing the KV cache
# vLLM and TensorRT-LLM: FP8 KV cache
# - 2x memory reduction vs BF16
# - <1% quality drop
# llama.cpp: per-tensor INT4 KV
# - 4x reduction vs BF16
# - 1-2% quality dropKV cache quantization is one of the highest-impact inference optimizations. Lets you serve longer contexts or bigger batches at the same memory budget. Standard in production inference engines.