Retrieval-Augmented Generation

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate

Trade-offs and Considerations

RAG Failure Modes

Failure Mode	Cause	Detection	Mitigation
Wrong retrieval	Low relevance threshold, poor embeddings	RAGAS context recall metric	Better chunking, domain-specific embedder
Ignored retrieval	LLM uses parametric memory instead of context	RAGAS faithfulness metric	Stricter system prompt: "only use provided context"
Hallucinated citations	LLM fabricates source attribution	Citation verification pipeline	Validate that cited source contains claimed content
Context too long	Top-k too large, context budget exceeded	Monitor context window usage	Reduce top_k; use re-ranker to select best 5 not top 20
Stale knowledge	Index not updated when documents change	Monitor document update timestamps	Incremental re-indexing pipeline with change detection
PHI leakage	Clinical chunks contain patient identifiers	PHI scanner on ingested documents	De-identify before indexing OR enforce strict access control on collection

Advanced RAG Patterns

Corrective RAG (CRAG): After initial retrieval, the LLM grades the relevance of each retrieved document. If retrieval quality is low, it triggers web search or a broader retrieval before generating.

Self-RAG: The LLM decides whether retrieval is needed at all, and after generation, assesses whether the generated text is supported by the retrieved evidence (introspection tokens: [Retrieve], [Supported], [Unsupported]).

RAG Fusion: Run the same query against the retrieval system multiple times with rephrased versions, then fuse and re-rank the results using Reciprocal Rank Fusion (RRF). Increases recall at the cost of retrieval latency.

These patterns improve quality but add complexity. Implement them only after the baseline RAG system has been evaluated and found insufficient.

Comparison Table

Retrieval Strategy	Pros	Cons	Best For
Dense-only (vector search)	Simple, captures semantics	Misses exact clinical term matches	General clinical questions
Sparse-only (BM25)	Exact match, drug names/ICD codes	No semantic understanding	Formulary lookup, code search
Hybrid (dense + sparse)	Best of both	Slightly more complex	Production clinical RAG
Hybrid + Re-ranking	Highest precision	50–100ms latency overhead	High-stakes clinical decisions
HyDE	Improves recall for complex queries	Extra LLM call for hypothesis	Complex diagnostic questions

Interview Questions

Q1: Design a RAG system for a hospital's clinical decision support platform

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

I'd begin by clarifying scope: what clinical questions does this answer? What's the scale (concurrent users, query volume)? What's the latency SLA? What EHR system is in use? I'll assume: clinical staff Q&A over institutional guidelines and formulary, 300 concurrent users, p95 < 3 seconds, Epic EHR.

Architecture: Two separate pipelines — indexing (offline) and query (real-time). For indexing: document ingest from an EHR's FHIR DocumentReference API and document management systems (SharePoint, Confluence, or equivalent) → format-specific parsers → semantic chunking at 512 tokens with 64 token overlap → embedding with a domain-appropriate clinical embedding model (e.g., voyage-medical-2 or equivalent — verify current availability) → storage in a vector database (Pinecone, Weaviate, or equivalent) with metadata in PostgreSQL (section hierarchy, document type, last updated, access level).

For the query pipeline: clinical staff submits a question through a Claude Desktop or web interface → query classification (diagnostic vs. pharmacology vs. procedure vs. administrative) routes to the appropriate Pinecone collection → hybrid retrieval (dense + BM25) retrieves top-20 candidates → cross-encoder re-ranks to top-8 → patient context from Epic FHIR API appended → Claude Opus generates response citing sources → citation validation confirms each citation's source actually contains the cited content → response with citations delivered.

HIPAA architecture: the entire pipeline runs in the hospital's AWS account with a Bedrock endpoint. No PHI leaves the hospital's VPC. Access control enforced at the collection level — payer policy documents accessible only to care coordinators, not to all clinical staff.

Evaluation: RAGAS metrics run weekly against a 200-question golden dataset (built with clinical informatics team): faithfulness > 0.85, context recall > 0.80, answer relevancy > 0.75. Degrade alerts if any metric drops > 10%.

Key Points to Hit:

Separate indexing and query pipelines
Hybrid retrieval (dense + sparse) — not just vector search
Cross-encoder re-ranking for quality
Patient context from EHR for patient-specific answers
HIPAA-compliant architecture (within VPC, BAA)
RAGAS evaluation pipeline with specific metrics and thresholds
Citation validation to prevent fabricated source attribution

Q2: What causes "hallucinated citations" in RAG systems and how do you prevent them?

Category: Technical Depth Difficulty: Senior Role: AI Architect

Answer Framework:

A hallucinated citation occurs when the LLM attributes a factual claim to a retrieved source that either doesn't exist or doesn't contain the claimed information. This is distinct from factual hallucination — it is hallucination specifically of the attribution layer.

The cause: LLMs are trained to produce coherent, well-structured text. When instructed to cite sources, they are rewarded for producing citations that look credible. If the actual retrieved sources are not explicitly available in the context, the model will generate plausible-sounding citations from its parametric memory — which may be fabricated. Even when the retrieved sources are available, the model may attribute a claim to the wrong source within the context if the sources are numerous or similar.

Prevention has two layers: structural and validation. Structurally, the system prompt should explicitly list retrieved sources with numbered identifiers ([Source 1], [Source 2]) and instruct the model to use only these identifiers. This makes citation to a non-existent source impossible — there is no [Source 17] if only 8 were retrieved.

Validation: a post-generation step verifies each citation. For each claim-citation pair, retrieve the content of the cited source and use a lightweight model to verify that the cited text actually supports the claim. If it doesn't, either regenerate with stricter instructions or flag for human review. This verification step adds ~200–500ms but is non-negotiable for clinical applications where incorrect attribution could misdirect clinical decision-making.

Retrieval-Augmented Generation#

Trade-offs and Considerations#

RAG Failure Modes#

Advanced RAG Patterns#

Comparison Table#

Interview Questions#

Q1: Design a RAG system for a hospital's clinical decision support platform#

Q2: What causes "hallucinated citations" in RAG systems and how do you prevent them?#