Attention Variants in 2026

Multi-Head Attention (MHA) is canonical but expensive at inference time. MQA, GQA, MLA reduce KV-cache size at small quality cost. Knowing what's deployed where helps you read papers and pick architectures.

Advertisement

MHA → MQA: drop per-head K,V

MQA shares one K,V across all query heads. KV cache 8-32x smaller. Slight quality drop. Used in some inference-optimized models.

GQA: middle ground

Group Query Attention. Groups of heads share K,V. Balance between MHA quality and MQA cache savings. Used in Llama 2/3, Mixtral, most current open models.

Advertisement

MLA: DeepSeek's compression

Multi-head Latent Attention. Compresses K,V to lower-rank latent then decompresses on use. ~4x KV cache reduction with no quality drop in benchmarks. Used in DeepSeek V2/V3.

GQA is the modern default. MLA is a 2024+ improvement for cache-constrained inference. MQA mostly retired.

MHA → MQA: drop per-head K,V

GQA: middle ground

MLA: DeepSeek&#x27;s compression

MLA: DeepSeek's compression