Retrieval-Augmented Generation

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate


Trade-offs and Considerations

RAG Failure Modes

Failure Mode Cause Detection Mitigation
Wrong retrieval Low relevance threshold, poor embeddings RAGAS context recall metric Better chunking, domain-specific embedder
Ignored retrieval LLM uses parametric memory instead of context RAGAS faithfulness metric Stricter system prompt: "only use provided context"
Hallucinated citations LLM fabricates source attribution Citation verification pipeline Validate that cited source contains claimed content
Context too long Top-k too large, context budget exceeded Monitor context window usage Reduce top_k; use re-ranker to select best 5 not top 20
Stale knowledge Index not updated when documents change Monitor document update timestamps Incremental re-indexing pipeline with change detection
PHI leakage Clinical chunks contain patient identifiers PHI scanner on ingested documents De-identify before indexing OR enforce strict access control on collection

Advanced RAG Patterns

Corrective RAG (CRAG): After initial retrieval, the LLM grades the relevance of each retrieved document. If retrieval quality is low, it triggers web search or a broader retrieval before generating.

Self-RAG: The LLM decides whether retrieval is needed at all, and after generation, assesses whether the generated text is supported by the retrieved evidence (introspection tokens: [Retrieve], [Supported], [Unsupported]).

RAG Fusion: Run the same query against the retrieval system multiple times with rephrased versions, then fuse and re-rank the results using Reciprocal Rank Fusion (RRF). Increases recall at the cost of retrieval latency.

These patterns improve quality but add complexity. Implement them only after the baseline RAG system has been evaluated and found insufficient.


Comparison Table

Retrieval Strategy Pros Cons Best For
Dense-only (vector search) Simple, captures semantics Misses exact clinical term matches General clinical questions
Sparse-only (BM25) Exact match, drug names/ICD codes No semantic understanding Formulary lookup, code search
Hybrid (dense + sparse) Best of both Slightly more complex Production clinical RAG
Hybrid + Re-ranking Highest precision 50–100ms latency overhead High-stakes clinical decisions
HyDE Improves recall for complex queries Extra LLM call for hypothesis Complex diagnostic questions

Interview Questions

Q1: Design a RAG system for a hospital's clinical decision support platform

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

I'd begin by clarifying scope: what clinical questions does this answer? What's the scale (concurrent users, query volume)? What's the latency SLA? What EHR system is in use? I'll assume: clinical staff Q&A over institutional guidelines and formulary, 300 concurrent users, p95 < 3 seconds, Epic EHR.

Architecture: Two separate pipelines — indexing (offline) and query (real-time). For indexing: document ingest from an EHR's FHIR DocumentReference API and document management systems (SharePoint, Confluence, or equivalent) → format-specific parsers → semantic chunking at 512 tokens with 64 token overlap → embedding with a domain-appropriate clinical embedding model (e.g., voyage-medical-2 or equivalent — verify current availability) → storage in a vector database (Pinecone, Weaviate, or equivalent) with metadata in PostgreSQL (section hierarchy, document type, last updated, access level).

For the query pipeline: clinical staff submits a question through a Claude Desktop or web interface → query classification (diagnostic vs. pharmacology vs. procedure vs. administrative) routes to the appropriate Pinecone collection → hybrid retrieval (dense + BM25) retrieves top-20 candidates → cross-encoder re-ranks to top-8 → patient context from Epic FHIR API appended → Claude Opus generates response citing sources → citation validation confirms each citation's source actually contains the cited content → response with citations delivered.

HIPAA architecture: the entire pipeline runs in the hospital's AWS account with a Bedrock endpoint. No PHI leaves the hospital's VPC. Access control enforced at the collection level — payer policy documents accessible only to care coordinators, not to all clinical staff.

Evaluation: RAGAS metrics run weekly against a 200-question golden dataset (built with clinical informatics team): faithfulness > 0.85, context recall > 0.80, answer relevancy > 0.75. Degrade alerts if any metric drops > 10%.

Key Points to Hit:

  • Separate indexing and query pipelines
  • Hybrid retrieval (dense + sparse) — not just vector search
  • Cross-encoder re-ranking for quality
  • Patient context from EHR for patient-specific answers
  • HIPAA-compliant architecture (within VPC, BAA)
  • RAGAS evaluation pipeline with specific metrics and thresholds
  • Citation validation to prevent fabricated source attribution

Q2: What causes "hallucinated citations" in RAG systems and how do you prevent them?

Category: Technical Depth Difficulty: Senior Role: AI Architect

Answer Framework:

A hallucinated citation occurs when the LLM attributes a factual claim to a retrieved source that either doesn't exist or doesn't contain the claimed information. This is distinct from factual hallucination — it is hallucination specifically of the attribution layer.

The cause: LLMs are trained to produce coherent, well-structured text. When instructed to cite sources, they are rewarded for producing citations that look credible. If the actual retrieved sources are not explicitly available in the context, the model will generate plausible-sounding citations from its parametric memory — which may be fabricated. Even when the retrieved sources are available, the model may attribute a claim to the wrong source within the context if the sources are numerous or similar.

Prevention has two layers: structural and validation. Structurally, the system prompt should explicitly list retrieved sources with numbered identifiers ([Source 1], [Source 2]) and instruct the model to use only these identifiers. This makes citation to a non-existent source impossible — there is no [Source 17] if only 8 were retrieved.

Validation: a post-generation step verifies each citation. For each claim-citation pair, retrieve the content of the cited source and use a lightweight model to verify that the cited text actually supports the claim. If it doesn't, either regenerate with stricter instructions or flag for human review. This verification step adds ~200–500ms but is non-negotiable for clinical applications where incorrect attribution could misdirect clinical decision-making.