Multi-Head Attention (MHA) is canonical but expensive at inference time. MQA, GQA, MLA reduce KV-cache size at small quality cost. Knowing what's deployed where helps you read papers and pick architectures.
Advertisement
MHA → MQA: drop per-head K,V
MQA shares one K,V across all query heads. KV cache 8-32x smaller. Slight quality drop. Used in some inference-optimized models.
GQA: middle ground
Group Query Attention. Groups of heads share K,V. Balance between MHA quality and MQA cache savings. Used in Llama 2/3, Mixtral, most current open models.
Advertisement
MLA: DeepSeek's compression
Multi-head Latent Attention. Compresses K,V to lower-rank latent then decompresses on use. ~4x KV cache reduction with no quality drop in benchmarks. Used in DeepSeek V2/V3.
GQA is the modern default. MLA is a 2024+ improvement for cache-constrained inference. MQA mostly retired.