Caching Strategies for AI Systems

Common Mistakes

1. Setting similarity threshold too low. A threshold of 0.85 will return cached responses for semantically similar but meaningfully different queries ("What is the first-line antibiotic for CAP?" vs. "What is the first-line antibiotic for hospital-acquired pneumonia?"). For clinical use, set the threshold at 0.95 or above.

2. Not invalidating cache on knowledge base update. If the formulary is updated and the semantic cache is not invalidated, users may receive cached responses citing a removed drug as available. Implement event-driven cache invalidation tied to knowledge base update events.

3. Caching patient-specific queries. Any query that includes a patient identifier or patient-specific clinical context must not be cached. The risk of returning one patient's clinical context to a query from a different patient's context is a PHI breach.

4. Not recording cachehit in response metadata. If the client does not know whether a response came from cache, clinical users cannot assess its freshness. Return cache</em>hit: true and cached_at: <timestamp> in the response metadata.

Best Practices

Set semantic similarity threshold at ≥0.95 for clinical queries

Never cache patient-specific queries or any response derived from patient PHI

Implement event-driven cache invalidation tied to knowledge base update events

Return cache metadata (cachehit, cachedat, source_version) in every response

Monitor cache hit rate as an operational metric

Apply prompt caching to system prompts and retrieved context chunks that are stable across requests

Trade-offs

Cache Type Latency Reduction Cost Reduction Risk Staleness Risk

Semantic Cache High (returns in ms) High (no LLM call) Incorrect similarity match TTL expiry policy

Prompt Cache (API) Medium (faster TTFT) Medium (prefix tokens) Low None (per-request context)

KV Cache (Model) Low-Medium None (cost is per token) Low None (in-request)

Interview Questions

Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

Threshold selection: For clinical use, the similarity threshold must be set high enough that semantically different queries cannot retrieve the same cached response. I would start at 0.95 and evaluate against a test query set that includes pairs of clinically similar but meaningfully different questions. A 0.85 threshold appropriate for a customer support chatbot is not appropriate for a system making clinical recommendations.

Cache invalidation: Event-driven invalidation is the correct pattern. When the knowledge base pipeline completes re-indexing a document, it publishes a knowledge<em>base.document.updated event containing the document ID. A cache invalidation handler subscribes to this event and removes all cached responses whose source</em>documents list includes that document ID. This requires maintaining a reverse index from document ID to cache keys at write time. TTL-based expiry is a fallback, not the primary invalidation mechanism.

Key Points to Hit:

High threshold for clinical safety; justify with the failure mode (incorrect recommendation)

Event-driven invalidation as the primary mechanism; TTL as safety net

Reverse index (sourceid → cachekeys) for efficient invalidation

Exclusion of patient-specific queries from the cache

Key Takeaways

Semantic caching, prompt caching, and KV cache operate at different layers and solve different problems; all three are relevant in a production AI system

The similarity threshold for a clinical semantic cache must be high (≥0.95) because the failure mode is returning an incorrect clinical recommendation

Never cache patient-specific queries — PHI in queries makes caching a patient data breach risk

Cache invalidation must be event-driven (tied to knowledge base updates) for clinical domains; TTL alone is not sufficient

Return cache metadata in every response so clients can assess freshness

Cache Type	Latency Reduction	Cost Reduction	Risk	Staleness Risk
Semantic Cache	High (returns in ms)	High (no LLM call)	Incorrect similarity match	TTL expiry policy
Prompt Cache (API)	Medium (faster TTFT)	Medium (prefix tokens)	Low	None (per-request context)
KV Cache (Model)	Low-Medium	None (cost is per token)	Low	None (in-request)

Caching Strategies for AI Systems#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?#

Key Takeaways#

Caching Strategies for AI Systems

Common Mistakes

Best Practices

Trade-offs

Interview Questions

Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?

Key Takeaways