Caching Strategies for AI Systems
Executive Summary
Caching in AI systems operates at three distinct layers — prompt caching at the API level, semantic caching at the request level, and KV cache at the model inference level — each reducing latency and cost by different mechanisms. Without caching, production AI systems serving thousands of concurrent users incur redundant LLM API calls, inflate costs linearly with request volume, and impose unnecessary latency on requests that have been answered before. This chapter covers the architecture and implementation of multi-layer caching for enterprise AI deployments, with particular attention to the trade-offs between cache hit rate, staleness risk, and the special constraints that apply to clinical AI.
Learning Objectives
- Distinguish prompt caching, semantic caching, and KV cache, and identify when each applies
- Implement semantic caching with a vector similarity threshold appropriate to the use case
- Configure prompt caching on the Anthropic API and Azure OpenAI Service
- Design cache invalidation policies for knowledge bases that change (formularies, clinical guidelines)
- Identify when caching is inappropriate: non-idempotent queries, PHI-containing requests
Business Problem
In a Reference Healthcare Organization deploying a clinical decision support RAG system, the same clinical query patterns appear repeatedly: "What is the first-line treatment for community-acquired pneumonia?", "What are the ACLS chest compression guidelines?", "What are the formulary alternatives to [brand name drug]?". Answering each of these queries with a fresh LLM call costs API credits and adds 1–3 seconds of latency. At scale — thousands of clinical users, tens of thousands of queries per day — this produces significant costs and latency that could be avoided for semantically identical queries.
Why This Technology Exists
The original LLM API architecture is stateless: every request submits the full prompt, the API processes it, and returns a response. There is no mechanism for the API to recognize that a request is semantically identical to one it has recently processed.
Three independent caching innovations emerged to address this:
Prompt caching (provider-side): Anthropic and Azure OpenAI added cache_control parameters that allow the API to cache the KV computation for repeated prompt prefixes (system prompts, document chunks). This reduces cost and first-token latency for requests sharing a common prefix.
Semantic caching (application-side): Application-layer caches that store previous LLM responses indexed by the embedding of the query, allowing fuzzy retrieval of cached responses for semantically similar (not just identical) queries.
Model-level KV cache (inference-side): During inference, the attention mechanism computes key-value pairs for every token in the context. When running multiple requests with the same prefix, the KV pairs for that prefix can be reused (vLLM's prefix caching, PagedAttention).
Enterprise Considerations
Cache key design: The semantic cache key must include not just the query but also contextual dimensions: the knowledge base version in use, the user's department (if retrieval is department-scoped), and the retrieval date. A cached response from a 3-month-old knowledge base version is not valid for a query against the current version.
Cache hit rate monitoring: Track cache hit rate as an operational metric. A dropping cache hit rate may indicate query pattern shift, aggressive cache invalidation, or an error in the cache lookup logic. A very high hit rate (>90%) may indicate that the TTL is too long and stale content is being served.
Cold cache at startup: After a deployment, the semantic cache is empty. The first few hundred requests will all miss, temporarily increasing API costs and latency. Warm the cache before deploying by running a representative query set.
Common Mistakes
1. Setting similarity threshold too low. A threshold of 0.85 will return cached responses for semantically similar but meaningfully different queries ("What is the first-line antibiotic for CAP?" vs. "What is the first-line antibiotic for hospital-acquired pneumonia?"). For clinical use, set the threshold at 0.95 or above.
2. Not invalidating cache on knowledge base update. If the formulary is updated and the semantic cache is not invalidated, users may receive cached responses citing a removed drug as available. Implement event-driven cache invalidation tied to knowledge base update events.
3. Caching patient-specific queries. Any query that includes a patient identifier or patient-specific clinical context must not be cached. The risk of returning one patient's clinical context to a query from a different patient's context is a PHI breach.
4. Not recording cachehit in response metadata. If the client does not know whether a response came from cache, clinical users cannot assess its freshness. Return cache</em>hit: true and cached_at: <timestamp> in the response metadata.
Best Practices
- Set semantic similarity threshold at ≥0.95 for clinical queries
- Never cache patient-specific queries or any response derived from patient PHI
- Implement event-driven cache invalidation tied to knowledge base update events
- Return cache metadata (cachehit, cachedat, source_version) in every response
- Monitor cache hit rate as an operational metric
- Apply prompt caching to system prompts and retrieved context chunks that are stable across requests
Trade-offs
| Cache Type | Latency Reduction | Cost Reduction | Risk | Staleness Risk |
|---|---|---|---|---|
| Semantic Cache | High (returns in ms) | High (no LLM call) | Incorrect similarity match | TTL expiry policy |
| Prompt Cache (API) | Medium (faster TTFT) | Medium (prefix tokens) | Low | None (per-request context) |
| KV Cache (Model) | Low-Medium | None (cost is per token) | Low | None (in-request) |
Interview Questions
Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?
Category: System Design Difficulty: Senior Role: AI Architect
Answer Framework:
Threshold selection: For clinical use, the similarity threshold must be set high enough that semantically different queries cannot retrieve the same cached response. I would start at 0.95 and evaluate against a test query set that includes pairs of clinically similar but meaningfully different questions. A 0.85 threshold appropriate for a customer support chatbot is not appropriate for a system making clinical recommendations.
Cache invalidation: Event-driven invalidation is the correct pattern. When the knowledge base pipeline completes re-indexing a document, it publishes a knowledge<em>base.document.updated event containing the document ID. A cache invalidation handler subscribes to this event and removes all cached responses whose source</em>documents list includes that document ID. This requires maintaining a reverse index from document ID to cache keys at write time. TTL-based expiry is a fallback, not the primary invalidation mechanism.
Key Points to Hit:
- High threshold for clinical safety; justify with the failure mode (incorrect recommendation)
- Event-driven invalidation as the primary mechanism; TTL as safety net
- Reverse index (sourceid → cachekeys) for efficient invalidation
- Exclusion of patient-specific queries from the cache
Key Takeaways
- Semantic caching, prompt caching, and KV cache operate at different layers and solve different problems; all three are relevant in a production AI system
- The similarity threshold for a clinical semantic cache must be high (≥0.95) because the failure mode is returning an incorrect clinical recommendation
- Never cache patient-specific queries — PHI in queries makes caching a patient data breach risk
- Cache invalidation must be event-driven (tied to knowledge base updates) for clinical domains; TTL alone is not sufficient
- Return cache metadata in every response so clients can assess freshness
Further Reading
- Vector Databases — pgvector used as the semantic cache store
- LLM Serving Infrastructure — KV cache at the model serving layer
- Cloud AI Platforms — Prompt caching availability per provider