Caching Strategies for AI Systems

Executive Summary

Caching in AI systems operates at three distinct layers — prompt caching at the API level, semantic caching at the request level, and KV cache at the model inference level — each reducing latency and cost by different mechanisms. Without caching, production AI systems serving thousands of concurrent users incur redundant LLM API calls, inflate costs linearly with request volume, and impose unnecessary latency on requests that have been answered before. This chapter covers the architecture and implementation of multi-layer caching for enterprise AI deployments, with particular attention to the trade-offs between cache hit rate, staleness risk, and the special constraints that apply to clinical AI.

Learning Objectives

Distinguish prompt caching, semantic caching, and KV cache, and identify when each applies
Implement semantic caching with a vector similarity threshold appropriate to the use case
Configure prompt caching on the Anthropic API and Azure OpenAI Service
Design cache invalidation policies for knowledge bases that change (formularies, clinical guidelines)
Identify when caching is inappropriate: non-idempotent queries, PHI-containing requests

Business Problem

In a Reference Healthcare Organization deploying a clinical decision support RAG system, the same clinical query patterns appear repeatedly: "What is the first-line treatment for community-acquired pneumonia?", "What are the ACLS chest compression guidelines?", "What are the formulary alternatives to [brand name drug]?". Answering each of these queries with a fresh LLM call costs API credits and adds 1–3 seconds of latency. At scale — thousands of clinical users, tens of thousands of queries per day — this produces significant costs and latency that could be avoided for semantically identical queries.

Why This Technology Exists

The original LLM API architecture is stateless: every request submits the full prompt, the API processes it, and returns a response. There is no mechanism for the API to recognize that a request is semantically identical to one it has recently processed.

Three independent caching innovations emerged to address this:

Prompt caching (provider-side): Anthropic and Azure OpenAI added cache_control parameters that allow the API to cache the KV computation for repeated prompt prefixes (system prompts, document chunks). This reduces cost and first-token latency for requests sharing a common prefix.

Semantic caching (application-side): Application-layer caches that store previous LLM responses indexed by the embedding of the query, allowing fuzzy retrieval of cached responses for semantically similar (not just identical) queries.

Model-level KV cache (inference-side): During inference, the attention mechanism computes key-value pairs for every token in the context. When running multiple requests with the same prefix, the KV pairs for that prefix can be reused (vLLM's prefix caching, PagedAttention).

Enterprise Considerations

Cache key design: The semantic cache key must include not just the query but also contextual dimensions: the knowledge base version in use, the user's department (if retrieval is department-scoped), and the retrieval date. A cached response from a 3-month-old knowledge base version is not valid for a query against the current version.

Cache hit rate monitoring: Track cache hit rate as an operational metric. A dropping cache hit rate may indicate query pattern shift, aggressive cache invalidation, or an error in the cache lookup logic. A very high hit rate (>90%) may indicate that the TTL is too long and stale content is being served.

Cold cache at startup: After a deployment, the semantic cache is empty. The first few hundred requests will all miss, temporarily increasing API costs and latency. Warm the cache before deploying by running a representative query set.

Common Mistakes

1. Setting similarity threshold too low. A threshold of 0.85 will return cached responses for semantically similar but meaningfully different queries ("What is the first-line antibiotic for CAP?" vs. "What is the first-line antibiotic for hospital-acquired pneumonia?"). For clinical use, set the threshold at 0.95 or above.

2. Not invalidating cache on knowledge base update. If the formulary is updated and the semantic cache is not invalidated, users may receive cached responses citing a removed drug as available. Implement event-driven cache invalidation tied to knowledge base update events.

3. Caching patient-specific queries. Any query that includes a patient identifier or patient-specific clinical context must not be cached. The risk of returning one patient's clinical context to a query from a different patient's context is a PHI breach.

4. Not recording cachehit in response metadata. If the client does not know whether a response came from cache, clinical users cannot assess its freshness. Return cache</em>hit: true and cached_at: <timestamp> in the response metadata.

Best Practices

Set semantic similarity threshold at ≥0.95 for clinical queries

Never cache patient-specific queries or any response derived from patient PHI

Implement event-driven cache invalidation tied to knowledge base update events

Return cache metadata (cachehit, cachedat, source_version) in every response

Monitor cache hit rate as an operational metric

Apply prompt caching to system prompts and retrieved context chunks that are stable across requests

Trade-offs

Cache Type Latency Reduction Cost Reduction Risk Staleness Risk

Semantic Cache High (returns in ms) High (no LLM call) Incorrect similarity match TTL expiry policy

Prompt Cache (API) Medium (faster TTFT) Medium (prefix tokens) Low None (per-request context)

KV Cache (Model) Low-Medium None (cost is per token) Low None (in-request)

Interview Questions

Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

Threshold selection: For clinical use, the similarity threshold must be set high enough that semantically different queries cannot retrieve the same cached response. I would start at 0.95 and evaluate against a test query set that includes pairs of clinically similar but meaningfully different questions. A 0.85 threshold appropriate for a customer support chatbot is not appropriate for a system making clinical recommendations.

Cache invalidation: Event-driven invalidation is the correct pattern. When the knowledge base pipeline completes re-indexing a document, it publishes a knowledge<em>base.document.updated event containing the document ID. A cache invalidation handler subscribes to this event and removes all cached responses whose source</em>documents list includes that document ID. This requires maintaining a reverse index from document ID to cache keys at write time. TTL-based expiry is a fallback, not the primary invalidation mechanism.

Key Points to Hit:

High threshold for clinical safety; justify with the failure mode (incorrect recommendation)

Event-driven invalidation as the primary mechanism; TTL as safety net

Reverse index (sourceid → cachekeys) for efficient invalidation

Exclusion of patient-specific queries from the cache

Key Takeaways

Semantic caching, prompt caching, and KV cache operate at different layers and solve different problems; all three are relevant in a production AI system

The similarity threshold for a clinical semantic cache must be high (≥0.95) because the failure mode is returning an incorrect clinical recommendation

Never cache patient-specific queries — PHI in queries makes caching a patient data breach risk

Cache invalidation must be event-driven (tied to knowledge base updates) for clinical domains; TTL alone is not sufficient

Return cache metadata in every response so clients can assess freshness

Further Reading

Vector Databases — pgvector used as the semantic cache store

LLM Serving Infrastructure — KV cache at the model serving layer

Cloud AI Platforms — Prompt caching availability per provider

Cache Type	Latency Reduction	Cost Reduction	Risk	Staleness Risk
Semantic Cache	High (returns in ms)	High (no LLM call)	Incorrect similarity match	TTL expiry policy
Prompt Cache (API)	Medium (faster TTFT)	Medium (prefix tokens)	Low	None (per-request context)
KV Cache (Model)	Low-Medium	None (cost is per token)	Low	None (in-request)

Caching Strategies for AI Systems#

Executive Summary#

Learning Objectives#

Business Problem#

Why This Technology Exists#

Enterprise Considerations#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?#

Key Takeaways#

Further Reading#

Caching Strategies for AI Systems

Executive Summary

Learning Objectives

Business Problem

Why This Technology Exists

Enterprise Considerations

Common Mistakes

Best Practices

Trade-offs

Interview Questions

Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?

Key Takeaways

Further Reading