LLM API calls cost money and add latency. Caching is the highest-ROI optimization. Three caching strategies stack: exact-match for repeat queries, semantic for near-duplicates, prompt-prefix for shared system prompts. Combined, they can cut API spend 50-80%.
Exact-match
Hash the full prompt; cache the response. Hit rate ~10-30% on chatbot traffic. Use Redis with TTL ~1 day. Be careful with timestamps or randomness in the prompt — they kill the cache.
Semantic caching
Embed the query; check vector DB for similar previous queries; return their cached response if similarity > 0.95. Hit rate ~30-50%. Tools: GPTCache, langchain's RedisSemanticCache. Trade-off: occasional stale or off-topic responses; tune similarity threshold.
Prompt-prefix caching
Anthropic and OpenAI now support explicit cache hints on the prefix of a prompt (the system message + tools + few-shot examples). The provider stores the KV-cache for that prefix; subsequent calls with the same prefix are 5-10x faster and ~50% cheaper. Use for: long system prompts, RAG with stable retriever output.
Cache invalidation
Bump the cache key when the prompt template changes (include a version hash). For semantic cache, invalidate when the source-of-truth changes (e.g., when product catalog updates). Use TTL to bound staleness even without explicit invalidation.
Cost-benefit analysis
Caching makes sense when (request cost × hit rate) > (cache infra cost). At $0.001/req and 30% hit rate over 10M req/day = $9K/mo saved. Redis cluster + embedding service = $500/mo. ROI: 18x. Always cache.