LLM Caching Strategies — Belgavi.AI Lab

LLM API calls cost money and add latency. Caching is the highest-ROI optimization. Three caching strategies stack: exact-match for repeat queries, semantic for near-duplicates, prompt-prefix for shared system prompts. Combined, they can cut API spend 50-80%.

Advertisement

Exact-match

Hash the full prompt; cache the response. Hit rate ~10-30% on chatbot traffic. Use Redis with TTL ~1 day. Be careful with timestamps or randomness in the prompt — they kill the cache.

Semantic caching

Embed the query; check vector DB for similar previous queries; return their cached response if similarity > 0.95. Hit rate ~30-50%. Tools: GPTCache, langchain's RedisSemanticCache. Trade-off: occasional stale or off-topic responses; tune similarity threshold.

Advertisement

Prompt-prefix caching

Anthropic and OpenAI now support explicit cache hints on the prefix of a prompt (the system message + tools + few-shot examples). The provider stores the KV-cache for that prefix; subsequent calls with the same prefix are 5-10x faster and ~50% cheaper. Use for: long system prompts, RAG with stable retriever output.

Cache invalidation

Bump the cache key when the prompt template changes (include a version hash). For semantic cache, invalidate when the source-of-truth changes (e.g., when product catalog updates). Use TTL to bound staleness even without explicit invalidation.

Cost-benefit analysis

Caching makes sense when (request cost × hit rate) > (cache infra cost). At $0.001/req and 30% hit rate over 10M req/day = $9K/mo saved. Redis cluster + embedding service = $500/mo. ROI: 18x. Always cache.

Exact-match + semantic + prompt-prefix together. 50-80% API spend reduction is achievable. Start with provider-native prefix cache.