Caching Strategies for AI Systems
Common Mistakes
1. Setting similarity threshold too low. A threshold of 0.85 will return cached responses for semantically similar but meaningfully different queries ("What is the first-line antibiotic for CAP?" vs. "What is the first-line antibiotic for hospital-acquired pneumonia?"). For clinical use, set the threshold at 0.95 or above.
2. Not invalidating cache on knowledge base update. If the formulary is updated and the semantic cache is not invalidated, users may receive cached responses citing a removed drug as available. Implement event-driven cache invalidation tied to knowledge base update events.
3. Caching patient-specific queries. Any query that includes a patient identifier or patient-specific clinical context must not be cached. The risk of returning one patient's clinical context to a query from a different patient's context is a PHI breach.
4. Not recording cachehit in response metadata. If the client does not know whether a response came from cache, clinical users cannot assess its freshness. Return cache</em>hit: true and cached_at: <timestamp> in the response metadata.
Best Practices
- Set semantic similarity threshold at ≥0.95 for clinical queries
- Never cache patient-specific queries or any response derived from patient PHI
- Implement event-driven cache invalidation tied to knowledge base update events
- Return cache metadata (cachehit, cachedat, source_version) in every response
- Monitor cache hit rate as an operational metric
- Apply prompt caching to system prompts and retrieved context chunks that are stable across requests
Trade-offs
| Cache Type | Latency Reduction | Cost Reduction | Risk | Staleness Risk |
|---|---|---|---|---|
| Semantic Cache | High (returns in ms) | High (no LLM call) | Incorrect similarity match | TTL expiry policy |
| Prompt Cache (API) | Medium (faster TTFT) | Medium (prefix tokens) | Low | None (per-request context) |
| KV Cache (Model) | Low-Medium | None (cost is per token) | Low | None (in-request) |
Interview Questions
Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?
Category: System Design Difficulty: Senior Role: AI Architect
Answer Framework:
Threshold selection: For clinical use, the similarity threshold must be set high enough that semantically different queries cannot retrieve the same cached response. I would start at 0.95 and evaluate against a test query set that includes pairs of clinically similar but meaningfully different questions. A 0.85 threshold appropriate for a customer support chatbot is not appropriate for a system making clinical recommendations.
Cache invalidation: Event-driven invalidation is the correct pattern. When the knowledge base pipeline completes re-indexing a document, it publishes a knowledge<em>base.document.updated event containing the document ID. A cache invalidation handler subscribes to this event and removes all cached responses whose source</em>documents list includes that document ID. This requires maintaining a reverse index from document ID to cache keys at write time. TTL-based expiry is a fallback, not the primary invalidation mechanism.
Key Points to Hit:
- High threshold for clinical safety; justify with the failure mode (incorrect recommendation)
- Event-driven invalidation as the primary mechanism; TTL as safety net
- Reverse index (sourceid → cachekeys) for efficient invalidation
- Exclusion of patient-specific queries from the cache
Key Takeaways
- Semantic caching, prompt caching, and KV cache operate at different layers and solve different problems; all three are relevant in a production AI system
- The similarity threshold for a clinical semantic cache must be high (≥0.95) because the failure mode is returning an incorrect clinical recommendation
- Never cache patient-specific queries — PHI in queries makes caching a patient data breach risk
- Cache invalidation must be event-driven (tied to knowledge base updates) for clinical domains; TTL alone is not sufficient
- Return cache metadata in every response so clients can assess freshness