Caching Strategies for AI Systems

Executive Summary

Caching in AI systems operates at three distinct layers — prompt caching at the API level, semantic caching at the request level, and KV cache at the model inference level — each reducing latency and cost by different mechanisms. Without caching, production AI systems serving thousands of concurrent users incur redundant LLM API calls, inflate costs linearly with request volume, and impose unnecessary latency on requests that have been answered before. This chapter covers the architecture and implementation of multi-layer caching for enterprise AI deployments, with particular attention to the trade-offs between cache hit rate, staleness risk, and the special constraints that apply to clinical AI.

Learning Objectives

  • Distinguish prompt caching, semantic caching, and KV cache, and identify when each applies
  • Implement semantic caching with a vector similarity threshold appropriate to the use case
  • Configure prompt caching on the Anthropic API and Azure OpenAI Service
  • Design cache invalidation policies for knowledge bases that change (formularies, clinical guidelines)
  • Identify when caching is inappropriate: non-idempotent queries, PHI-containing requests

Business Problem

In a Reference Healthcare Organization deploying a clinical decision support RAG system, the same clinical query patterns appear repeatedly: "What is the first-line treatment for community-acquired pneumonia?", "What are the ACLS chest compression guidelines?", "What are the formulary alternatives to [brand name drug]?". Answering each of these queries with a fresh LLM call costs API credits and adds 1–3 seconds of latency. At scale — thousands of clinical users, tens of thousands of queries per day — this produces significant costs and latency that could be avoided for semantically identical queries.

Why This Technology Exists

The original LLM API architecture is stateless: every request submits the full prompt, the API processes it, and returns a response. There is no mechanism for the API to recognize that a request is semantically identical to one it has recently processed.

Three independent caching innovations emerged to address this:

Prompt caching (provider-side): Anthropic and Azure OpenAI added cache_control parameters that allow the API to cache the KV computation for repeated prompt prefixes (system prompts, document chunks). This reduces cost and first-token latency for requests sharing a common prefix.

Semantic caching (application-side): Application-layer caches that store previous LLM responses indexed by the embedding of the query, allowing fuzzy retrieval of cached responses for semantically similar (not just identical) queries.

Model-level KV cache (inference-side): During inference, the attention mechanism computes key-value pairs for every token in the context. When running multiple requests with the same prefix, the KV pairs for that prefix can be reused (vLLM's prefix caching, PagedAttention).

Architecture

Semantic Caching

Semantic caching stores previous query-response pairs indexed by the query embedding. On a new query, if the cosine similarity between the new query embedding and a stored embedding exceeds a threshold, the cached response is returned without calling the LLM.

python
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
import hashlib
import json
import redis
import numpy as np

# Educational example — not for clinical use

@dataclass
class CacheEntry:
    query: str
    query_embedding: list[float]
    response: str
    created_at: datetime
    ttl_seconds: int
    source_documents: list[str]     # For citation tracking


class SemanticCache:
    """
    Semantic cache for LLM query-response pairs.
    
    Uses Redis for storage with pgvector for similarity search,
    or in-process numpy for small deployments.
    
    Educational example — not for clinical use.
    NOT appropriate for patient-specific queries.
    """
    
    def __init__(
        self,
        similarity_threshold: float = 0.95,   # Very high threshold for clinical use
        default_ttl_seconds: int = 86400,     # 1 day
        embedding_client=None,
        redis_client=None,
    ):
        self.threshold = similarity_threshold
        self.default_ttl = default_ttl_seconds
        self.embedding_client = embedding_client
        self.redis = redis_client
    
    async def lookup(self, query: str) -> Optional[CacheEntry]:
        """
        Look up a query in the semantic cache.
        Returns a CacheEntry if a similar query was found above threshold,
        else returns None (cache miss).
        """
        query_embedding = await self.embedding_client.embed(query)
        
        # Retrieve candidate embeddings from store
        candidates = await self._get_candidates(query_embedding, top_k=10)
        
        for entry in candidates:
            similarity = self._cosine_similarity(
                np.array(query_embedding),
                np.array(entry.query_embedding)
            )
            if similarity >= self.threshold:
                # Check TTL
                if self._is_expired(entry):
                    await self._delete(entry.query)
                    continue
                return entry
        
        return None
    
    async def store(
        self,
        query: str,
        query_embedding: list[float],
        response: str,
        source_documents: list[str],
        ttl_seconds: Optional[int] = None,
    ) -> None:
        """Store a query-response pair in the semantic cache."""
        entry = CacheEntry(
            query=query,
            query_embedding=query_embedding,
            response=response,
            created_at=datetime.utcnow(),
            ttl_seconds=ttl_seconds or self.default_ttl,
            source_documents=source_documents,
        )
        
        key = self._cache_key(query)
        value = json.dumps({
            "query": entry.query,
            "query_embedding": entry.query_embedding,
            "response": entry.response,
            "created_at": entry.created_at.isoformat(),
            "ttl_seconds": entry.ttl_seconds,
            "source_documents": entry.source_documents,
        })
        
        await self.redis.setex(key, entry.ttl_seconds, value)
    
    async def invalidate_by_source(self, source_document_id: str) -> int:
        """
        Invalidate all cached responses that cite a specific source document.
        
        Call this when a clinical guideline is updated to prevent stale
        recommendations from being served from cache.
        """
        # This requires scanning all cached entries — expensive at scale.
        # Production implementation: maintain a reverse index (source_id → cache_keys).
        invalidated = 0
        
        async for key in self.redis.scan_iter("semantic_cache:*"):
            value = await self.redis.get(key)
            if value:
                entry_data = json.loads(value)
                if source_document_id in entry_data.get("source_documents", []):
                    await self.redis.delete(key)
                    invalidated += 1
        
        return invalidated
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
    
    def _is_expired(self, entry: CacheEntry) -> bool:
        age = (datetime.utcnow() - entry.created_at).total_seconds()
        return age > entry.ttl_seconds
    
    def _cache_key(self, query: str) -> str:
        content_hash = hashlib.sha256(query.encode()).hexdigest()[:16]
        return f"semantic_cache:{content_hash}"
    
    async def _get_candidates(self, embedding: list[float], top_k: int) -> list[CacheEntry]:
        # In production: use pgvector ANN search on the stored embeddings
        # This example shows simplified key-scan (suitable only for small caches)
        entries = []
        async for key in self.redis.scan_iter("semantic_cache:*"):
            value = await self.redis.get(key)
            if value:
                data = json.loads(value)
                entry = CacheEntry(
                    query=data["query"],
                    query_embedding=data["query_embedding"],
                    response=data["response"],
                    created_at=datetime.fromisoformat(data["created_at"]),
                    ttl_seconds=data["ttl_seconds"],
                    source_documents=data["source_documents"],
                )
                entries.append(entry)
        return entries[:top_k]
    
    async def _delete(self, query: str) -> None:
        await self.redis.delete(self._cache_key(query))

Prompt Caching (Anthropic API)

Anthropic's prompt caching allows the model to cache the KV computation for marked prompt sections. System prompts and retrieved context chunks are prime candidates.

python
import anthropic

# Educational example — not for clinical use

client = anthropic.Anthropic()

def query_with_prompt_caching(
    system_prompt: str,
    retrieved_chunks: list[str],
    user_query: str,
) -> str:
    """
    Query the Anthropic API using prompt caching for system prompt
    and retrieved context chunks.
    
    Prompt caching reduces cost and time-to-first-token for requests
    sharing a common prefix. The system prompt and retrieved chunks
    are cached after the first request; subsequent requests that use
    the same prefix pay only for the new query tokens.
    
    Note: Verify cache_control parameter availability in current SDK docs.
    """
    
    # Format retrieved chunks as a single cached context block
    context_text = "\n\n---\n\n".join([
        f"Source Document {i+1}:\n{chunk}"
        for i, chunk in enumerate(retrieved_chunks)
    ])
    
    response = client.messages.create(
        model="claude-opus-4-8",    # verify current model IDs
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache the system prompt
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Retrieved context:\n\n{context_text}",
                        "cache_control": {"type": "ephemeral"},  # Cache context chunks
                    },
                    {
                        "type": "text",
                        "text": f"\nQuery: {user_query}",
                        # No cache_control — this changes per request
                    }
                ]
            }
        ]
    )
    
    # Inspect cache usage in response
    usage = response.usage
    if hasattr(usage, 'cache_creation_input_tokens'):
        print(f"Cache created: {usage.cache_creation_input_tokens} tokens")
    if hasattr(usage, 'cache_read_input_tokens'):
        print(f"Cache read: {usage.cache_read_input_tokens} tokens")
    
    return response.content[0].text

Cache Invalidation Policy

The hardest problem in caching is knowing when to invalidate. For clinical knowledge bases:

python
from enum import Enum
from dataclasses import dataclass

class DocumentType(Enum):
    CLINICAL_GUIDELINE = "clinical_guideline"
    FORMULARY = "formulary"
    HOSPITAL_PROTOCOL = "hospital_protocol"
    DRUG_INTERACTION = "drug_interaction"

# Document type → cache TTL policy
CACHE_TTL_POLICY = {
    DocumentType.CLINICAL_GUIDELINE: {
        "ttl_seconds": 7 * 86400,       # 7 days
        "rationale": "Guidelines updated quarterly; short TTL provides safety margin"
    },
    DocumentType.FORMULARY: {
        "ttl_seconds": 86400,           # 1 day
        "rationale": "Formulary changes 5-business-day SLA; 1-day TTL ensures freshness"
    },
    DocumentType.HOSPITAL_PROTOCOL: {
        "ttl_seconds": 3 * 86400,       # 3 days
        "rationale": "Protocols updated infrequently but are safety-critical"
    },
    DocumentType.DRUG_INTERACTION: {
        "ttl_seconds": 4 * 3600,        # 4 hours
        "rationale": "Drug interaction data updated continuously; aggressive TTL"
    },
}

When NOT to Cache

Semantic caching is inappropriate for:

  • Patient-specific queries: "What are the medications for patient John Doe?" contains PHI and produces patient-specific answers. Caching these would return another patient's medical information to a different user querying about the same patient. Never cache patient-specific responses.
  • Queries where recency is critical: "Are there any new drug alerts for [medication] in the past 24 hours?" Caching would suppress real new alerts.
  • Queries with implicit temporal context: "What is the on-call pharmacist today?" The answer changes daily.
  • Non-idempotent operations: Any query that triggers an EHR write, order, or notification must not be cached.

Enterprise Considerations

Cache key design: The semantic cache key must include not just the query but also contextual dimensions: the knowledge base version in use, the user's department (if retrieval is department-scoped), and the retrieval date. A cached response from a 3-month-old knowledge base version is not valid for a query against the current version.

Cache hit rate monitoring: Track cache hit rate as an operational metric. A dropping cache hit rate may indicate query pattern shift, aggressive cache invalidation, or an error in the cache lookup logic. A very high hit rate (>90%) may indicate that the TTL is too long and stale content is being served.

Cold cache at startup: After a deployment, the semantic cache is empty. The first few hundred requests will all miss, temporarily increasing API costs and latency. Warm the cache before deploying by running a representative query set.

Common Mistakes

1. Setting similarity threshold too low. A threshold of 0.85 will return cached responses for semantically similar but meaningfully different queries ("What is the first-line antibiotic for CAP?" vs. "What is the first-line antibiotic for hospital-acquired pneumonia?"). For clinical use, set the threshold at 0.95 or above.

2. Not invalidating cache on knowledge base update. If the formulary is updated and the semantic cache is not invalidated, users may receive cached responses citing a removed drug as available. Implement event-driven cache invalidation tied to knowledge base update events.

3. Caching patient-specific queries. Any query that includes a patient identifier or patient-specific clinical context must not be cached. The risk of returning one patient's clinical context to a query from a different patient's context is a PHI breach.

4. Not recording cachehit in response metadata. If the client does not know whether a response came from cache, clinical users cannot assess its freshness. Return cache</em>hit: true and cached_at: &lt;timestamp&gt; in the response metadata.

Best Practices

  • Set semantic similarity threshold at ≄0.95 for clinical queries
  • Never cache patient-specific queries or any response derived from patient PHI
  • Implement event-driven cache invalidation tied to knowledge base update events
  • Return cache metadata (cachehit, cachedat, source_version) in every response
  • Monitor cache hit rate as an operational metric
  • Apply prompt caching to system prompts and retrieved context chunks that are stable across requests

Trade-offs

Cache Type Latency Reduction Cost Reduction Risk Staleness Risk
Semantic Cache High (returns in ms) High (no LLM call) Incorrect similarity match TTL expiry policy
Prompt Cache (API) Medium (faster TTFT) Medium (prefix tokens) Low None (per-request context)
KV Cache (Model) Low-Medium None (cost is per token) Low None (in-request)

Interview Questions

Q: You are deploying a semantic cache for a clinical RAG system. What similarity threshold would you set, and how would you handle cache invalidation when a clinical guideline is updated?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

Threshold selection: For clinical use, the similarity threshold must be set high enough that semantically different queries cannot retrieve the same cached response. I would start at 0.95 and evaluate against a test query set that includes pairs of clinically similar but meaningfully different questions. A 0.85 threshold appropriate for a customer support chatbot is not appropriate for a system making clinical recommendations.

Cache invalidation: Event-driven invalidation is the correct pattern. When the knowledge base pipeline completes re-indexing a document, it publishes a knowledge<em>base.document.updated event containing the document ID. A cache invalidation handler subscribes to this event and removes all cached responses whose source</em>documents list includes that document ID. This requires maintaining a reverse index from document ID to cache keys at write time. TTL-based expiry is a fallback, not the primary invalidation mechanism.

Key Points to Hit:

  • High threshold for clinical safety; justify with the failure mode (incorrect recommendation)
  • Event-driven invalidation as the primary mechanism; TTL as safety net
  • Reverse index (sourceid → cachekeys) for efficient invalidation
  • Exclusion of patient-specific queries from the cache

Key Takeaways

  • Semantic caching, prompt caching, and KV cache operate at different layers and solve different problems; all three are relevant in a production AI system
  • The similarity threshold for a clinical semantic cache must be high (≄0.95) because the failure mode is returning an incorrect clinical recommendation
  • Never cache patient-specific queries — PHI in queries makes caching a patient data breach risk
  • Cache invalidation must be event-driven (tied to knowledge base updates) for clinical domains; TTL alone is not sufficient
  • Return cache metadata in every response so clients can assess freshness

Glossary

Semantic Cache: An application-layer cache that stores LLM query-response pairs indexed by query embedding, enabling cache retrieval for semantically similar (not just identical) queries.

Prompt Caching: A provider-side feature that caches the KV computation for repeated prompt prefixes, reducing cost and first-token latency for subsequent requests sharing the same prefix.

KV Cache: The key-value attention cache maintained during model inference. Prefix caching reuses KV pairs for the shared prefix across multiple concurrent requests.

Cache Invalidation: The process of removing or expiring cached entries when the underlying data changes. The hardest problem in caching strategy.

TTL (Time to Live): The maximum duration a cached entry is considered valid before it must be refreshed from the source.

Further Reading

Enterprise AI Technical Reference — v1.0 Updated July 02, 2026