'LLM cache' refers to three different things at three different layers. Each saves different costs. Conflating them produces bad architecture choices.
KV cache (per-request)
Transformer caches past key/value tensors to avoid recomputing attention for already-seen tokens. Automatic inside the model. Memory ~2 × layers × seq_len × hidden_dim × batch. Dominant memory cost during inference.
Prefix cache (cross-request)
Same system prompt or context across many requests. Cache the KV tensors for that shared prefix. vLLM, SGLang, Anthropic's prompt caching — all do this. 5-10x cost reduction for cacheable workloads.
Semantic cache (cross-query)
Embed query, search for semantically similar past query, return cached answer if hit. Saves model call entirely. Tools: GPTCache. Useful for FAQ-like deterministic workloads, useless when queries are unique.