Retrieval-Augmented Generation
Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate
Executive Summary
Retrieval-Augmented Generation (RAG) is the dominant architectural pattern for grounding LLM responses in organizational knowledge — it is the mechanism by which a general-purpose language model becomes a domain-specific expert on your clinical guidelines, your formulary, your policy documents. Every enterprise AI architect must be able to design, evaluate, and optimize RAG systems. This chapter covers the complete RAG architecture, indexing pipeline design, retrieval strategies, evaluation, and the failure modes that most commonly sink production RAG systems.
Prerequisites
- LLM Fundamentals — Context windows, inference mechanics
- Embeddings and Vector Spaces — How semantic search works
Core Concepts
Why RAG Exists
LLMs have two fundamental limitations that RAG addresses:
Knowledge cutoff: LLMs are trained on data up to a fixed date. Clinical guidelines are updated quarterly. Drug interactions are discovered continuously. A model trained in early 2025 does not know about drugs approved or guidelines updated in late 2025. In a clinical context, this staleness can cause harm.
Knowledge capacity: Even a 70-billion parameter model can only memorize a fraction of the world's clinical knowledge during training. Rare conditions, niche formulary restrictions, institution-specific protocols, and patient-specific history are simply not in the model's parametric memory.
RAG addresses both: at inference time, it retrieves the specific relevant knowledge from an up-to-date, institution-specific knowledge base and injects it into the LLM's context window. The model generates its answer from this retrieved evidence rather than from its parametric memory.
The core RAG insight: Don't train the model to know everything — train the model to reason over retrieved evidence. The knowledge lives in the retrieval system; the reasoning lives in the model.
RAG vs. Fine-Tuning — The Fundamental Choice
Before building a RAG system, this question must be answered: why not fine-tune the model to know the specific knowledge?
Fine-tuning teaches the model new knowledge by updating its weights. RAG provides the model with knowledge at inference time via context injection. The trade-offs:
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge freshness | Real-time — update the index, not the model | Stale — requires retraining for updates |
| Citation / source traceability | Natural — sources are in the context | Difficult — knowledge is baked into weights |
| Hallucination risk | Lower (grounded in retrieved text) | Higher (may confabulate fine-tuned knowledge) |
| Domain coverage | Any content indexed | Only what was in fine-tuning set |
| Implementation cost | Moderate (indexing pipeline + retrieval) | High (data curation + training compute + evaluation) |
| Inference cost | Slightly higher (larger context) | Same as base model |
For enterprise clinical AI, RAG is almost always the right choice. Fine-tuning is appropriate for behavioral changes (making a model respond in a specific clinical documentation format, follow a specific reasoning structure) but RAG is superior for knowledge injection.
Architecture
Complete RAG System Architecture
The Indexing Pipeline
The indexing pipeline runs offline (on a schedule or triggered by document updates) and produces the searchable knowledge base.
Stage 1: Document Parsing Raw documents arrive in heterogeneous formats: PDF clinical guidelines, DOCX policy documents, FHIR DocumentReference resources, CSV drug tables. Each format requires a parser that extracts clean text while preserving structure (section headers, tables, lists).
Critical for clinical documents: preserve the section hierarchy as metadata. Knowing that a chunk came from the "Contraindications" section of a drug monograph is more valuable than knowing it came from "FDA Drug Label" generically.
Stage 2: Chunking Covered in depth in Embeddings and Vector Spaces. For clinical RAG:
- Clinical guidelines: recursive character splitting at 512 tokens, 64 token overlap
- Drug formulary entries: structured chunks preserving each drug's full entry (do not split mid-entry)
- Clinical protocols: semantic chunking to preserve step-by-step procedure integrity
Stage 3: Embedding Each chunk is embedded using a clinical-domain embedding model. The embedding is stored alongside the chunk text and metadata in the vector store.
Stage 4: Metadata Storage Alongside the vector store, maintain a relational store (PostgreSQL) with metadata for each chunk:
- Source document name, version, last updated date
- Section hierarchy (e.g., "Chapter 3 > Antibiotic Selection > Penicillin-Allergic Patients")
- Content type (guideline, formulary, protocol, research)
- Applicable specialties or departments
- PHI flag (for access control)
This enables metadata filtering — retrieving only from relevant document types before or after vector search.
The Query Pipeline
Stage 1: Query Preprocessing The raw clinical query is preprocessed:
- Intent classification: Is this a drug dosing question? A diagnostic question? A procedure question? Routes to the appropriate document collection.
- Query expansion: Adds synonyms and related terms using a medical thesaurus (SNOMED/UMLS). "Chest pain" → ["chest pain", "angina", "substernal discomfort", "cardiac chest pain"].
- HyDE (Hypothetical Document Embeddings): Generate a hypothetical ideal answer to the query, then embed that hypothetical answer for retrieval. This is counterintuitive but highly effective — the hypothetical answer is in the same embedding space as real answers.
Stage 2: Hybrid Retrieval Production RAG systems use hybrid retrieval: combining dense embedding search with sparse BM25 keyword search, then fusing the results.
Dense search captures semantic similarity. Sparse BM25 captures exact keyword matches (critical for clinical terms, drug names, and ICD codes that must match exactly). Neither alone is sufficient:
Hybrid Score = α × dense_score + (1 - α) × sparse_score
Where α is typically 0.7 for clinical text
(more weight on semantic, some weight on exact match)Stage 3: Re-ranking The top-K candidates from retrieval (e.g., top 20) are re-ranked using a cross-encoder model. Unlike the bi-encoder used for initial retrieval (which embeds query and document independently), a cross-encoder processes the query and document together, capturing fine-grained relevance signals.
Cross-encoder re-ranking is the single highest-ROI quality improvement for a mature RAG system, typically improving precision@5 by 15–25% at the cost of 50–100ms additional latency.
Stage 4: Context Building The top re-ranked chunks (typically 5–10) are assembled into the LLM context along with:
- The system prompt (defines the AI's role and output format)
- Patient-specific context (if available — encounter context from EHR)
- The original query
- Explicit instructions to cite sources and only use provided context
Stage 5: Generation and Validation The LLM generates a response grounded in the provided context. A validation layer verifies:
- Each factual claim is attributed to a retrieved source
- Cited sources actually contain the claimed information (prevents fabricated citations)
- The response does not contain direct PHI if the output channel is not PHI-authorized
Implementation Patterns
Pattern 1: Complete Clinical RAG Pipeline
"""
Clinical RAG pipeline for HMS knowledge retrieval.
Context: Clinician at a Reference Healthcare Organization asks a clinical question about
antibiotic selection for a specific patient scenario. The system retrieves from
the organization's clinical guidelines and drug formulary.
Educational Example — Illustrative Workflow. Not intended for clinical decision making.
Dependencies:
anthropic>=0.49.0
chromadb>=0.6.0
python-dotenv>=1.0.0
"""
from __future__ import annotations
import os
from dataclasses import dataclass, field
from typing import Any
import anthropic
import chromadb
from dotenv import load_dotenv
load_dotenv()
CLINICAL_SYSTEM_PROMPT = """You are a clinical knowledge assistant for a Reference Healthcare Organization.
This is an educational example of a clinical RAG system prompt.
Not intended for clinical decision making.
Your role:
- Answer clinical questions based ONLY on the provided retrieved context
- Cite specific sources for every factual claim (format: [Source: {document_name}, Section: {section}])
- If the provided context does not contain enough information to answer safely, say so explicitly
- Never extrapolate beyond what is in the provided context
- All responses require physician review before clinical application
If you cannot find the answer in the provided context, respond:
"The provided clinical guidelines do not contain sufficient information to answer this question.
Please consult the relevant specialist or current institutional protocols."
Output format:
1. Direct answer (2-3 sentences)
2. Supporting details with citations
3. Relevant cautions or contraindications from the sources
4. Recommendation to verify with current protocols"""
@dataclass
class RetrievedChunk:
content: str
source: str
section: str
similarity: float
metadata: dict[str, Any] = field(default_factory=dict)
@dataclass
class ClinicalRAGResponse:
answer: str
retrieved_chunks: list[RetrievedChunk]
query: str
model: str
input_tokens: int
output_tokens: int
class ClinicalRAGPipeline:
"""
Production RAG pipeline for HMS clinical knowledge retrieval.
Implements hybrid retrieval with metadata filtering.
"""
def __init__(
self,
anthropic_client: anthropic.Anthropic,
chroma_client: chromadb.Client,
collection_name: str = "clinical_guidelines",
model: str = "claude-opus-4-8",
top_k_retrieve: int = 20,
top_k_context: int = 8,
similarity_threshold: float = 0.60,
) -> None:
self.anthropic = anthropic_client
self.collection = chroma_client.get_collection(collection_name)
self.model = model
self.top_k_retrieve = top_k_retrieve
self.top_k_context = top_k_context
self.similarity_threshold = similarity_threshold
def _embed_query(self, query: str) -> list[float]:
response = self.anthropic.embeddings.create(
model="voyage-medical-2",
input=[query],
)
return response.data[0].embedding
def _retrieve(
self,
query: str,
document_type_filter: str | None = None,
) -> list[RetrievedChunk]:
"""Retrieve relevant chunks using vector similarity search."""
query_embedding = self._embed_query(query)
where_filter = {}
if document_type_filter:
where_filter = {"content_type": {"$eq": document_type_filter}}
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=self.top_k_retrieve,
where=where_filter if where_filter else None,
include=["documents", "metadatas", "distances"],
)
chunks = []
for doc, meta, distance in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
):
similarity = 1.0 - distance
if similarity >= self.similarity_threshold:
chunks.append(RetrievedChunk(
content=doc,
source=meta.get("document", "Unknown"),
section=meta.get("section", "General"),
similarity=round(similarity, 4),
metadata=meta,
))
return sorted(chunks, key=lambda c: c.similarity, reverse=True)
def _build_context(self, chunks: list[RetrievedChunk]) -> str:
"""Format retrieved chunks into LLM context with clear source attribution."""
context_parts = []
for i, chunk in enumerate(chunks[: self.top_k_context], start=1):
context_parts.append(
f"[Source {i}: {chunk.source} | Section: {chunk.section} "
f"| Relevance: {chunk.similarity:.2f}]\n{chunk.content}"
)
return "\n\n---\n\n".join(context_parts)
def query(
self,
clinical_question: str,
patient_context: str | None = None,
) -> ClinicalRAGResponse:
"""
Process a clinical question through the RAG pipeline.
patient_context: Optional structured patient context from EHR
(allergies, current medications, relevant diagnoses). This is
combined with retrieved knowledge to enable patient-specific answers.
"""
chunks = self._retrieve(clinical_question)
context = self._build_context(chunks)
user_message = f"Clinical Question: {clinical_question}"
if patient_context:
user_message = (
f"Patient Context:\n{patient_context}\n\n"
f"Clinical Question: {clinical_question}"
)
full_user_message = (
f"{user_message}\n\n"
f"Retrieved Clinical Knowledge:\n\n{context}"
)
response = self.anthropic.messages.create(
model=self.model,
max_tokens=1536,
system=CLINICAL_SYSTEM_PROMPT,
messages=[{"role": "user", "content": full_user_message}],
)
return ClinicalRAGResponse(
answer=response.content[0].text,
retrieved_chunks=chunks[: self.top_k_context],
query=clinical_question,
model=self.model,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
)Healthcare Context — HMS Scenario
Clinical RAG Knowledge Base Architecture
A Reference Healthcare Organization's clinical RAG knowledge base may contain multiple document collections, each with different chunking, retrieval, and access control requirements. An example structure:
| Collection | Content | Chunk Strategy | Update Frequency | Access |
|---|---|---|---|---|
clinical_guidelines |
UpToDate, hospital protocols, clinical pathways | Semantic 512T | Quarterly | All clinical staff |
drug_formulary |
HMS formulary, drug monographs, interactions | Per-drug entry | Monthly | Clinical + pharmacy |
research_literature |
Filtered PubMed abstracts, key clinical trials | Fixed 384T | Weekly | Physicians + APPs |
payer_policies |
Insurance prior auth criteria, coverage policies | Fixed 512T | Monthly | Care coordinators |
regulatory_docs |
HIPAA policies, Joint Commission standards | Fixed 256T | Annual | Compliance + legal |
Patient Context Injection
The highest-value clinical RAG systems combine retrieved knowledge with patient-specific EHR context:
def build_patient_context_for_rag(
patient_fhir_bundle: dict,
) -> str:
"""
Extract clinically relevant context from FHIR R4 bundle
for injection alongside retrieved knowledge.
PHI is authorized for use within the clinical application boundary.
"""
context_parts = []
# Active allergies — critical safety context
allergies = extract_allergies(patient_fhir_bundle)
if allergies:
context_parts.append(
f"ALLERGIES (verify before any medication order): "
f"{', '.join(a['substance'] + ' → ' + a['reaction'] for a in allergies)}"
)
# Active medications — drug interaction context
medications = extract_active_medications(patient_fhir_bundle)
if medications:
context_parts.append(
f"ACTIVE MEDICATIONS: {', '.join(m['name'] + ' ' + m['dose'] for m in medications)}"
)
# Relevant diagnoses — indication context
diagnoses = extract_active_diagnoses(patient_fhir_bundle)
if diagnoses:
context_parts.append(
f"ACTIVE DIAGNOSES: {', '.join(d['display'] for d in diagnoses)}"
)
return "\n".join(context_parts)This patient context injection enables answers like: "Azithromycin is first-line for atypical community-acquired pneumonia. However, the patient has an active QT prolongation diagnosis and is on methadone — azithromycin is contraindicated in this patient due to additive QT prolongation risk. Doxycycline would be an alternative."
Without patient context, the RAG system can only answer the general question. With patient context, it can answer the patient-specific question — the difference between a reference tool and clinical decision support.
Trade-offs and Considerations
RAG Failure Modes
| Failure Mode | Cause | Detection | Mitigation |
|---|---|---|---|
| Wrong retrieval | Low relevance threshold, poor embeddings | RAGAS context recall metric | Better chunking, domain-specific embedder |
| Ignored retrieval | LLM uses parametric memory instead of context | RAGAS faithfulness metric | Stricter system prompt: "only use provided context" |
| Hallucinated citations | LLM fabricates source attribution | Citation verification pipeline | Validate that cited source contains claimed content |
| Context too long | Top-k too large, context budget exceeded | Monitor context window usage | Reduce top_k; use re-ranker to select best 5 not top 20 |
| Stale knowledge | Index not updated when documents change | Monitor document update timestamps | Incremental re-indexing pipeline with change detection |
| PHI leakage | Clinical chunks contain patient identifiers | PHI scanner on ingested documents | De-identify before indexing OR enforce strict access control on collection |
Advanced RAG Patterns
Corrective RAG (CRAG): After initial retrieval, the LLM grades the relevance of each retrieved document. If retrieval quality is low, it triggers web search or a broader retrieval before generating.
Self-RAG: The LLM decides whether retrieval is needed at all, and after generation, assesses whether the generated text is supported by the retrieved evidence (introspection tokens: [Retrieve], [Supported], [Unsupported]).
RAG Fusion: Run the same query against the retrieval system multiple times with rephrased versions, then fuse and re-rank the results using Reciprocal Rank Fusion (RRF). Increases recall at the cost of retrieval latency.
These patterns improve quality but add complexity. Implement them only after the baseline RAG system has been evaluated and found insufficient.
Comparison Table
| Retrieval Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Dense-only (vector search) | Simple, captures semantics | Misses exact clinical term matches | General clinical questions |
| Sparse-only (BM25) | Exact match, drug names/ICD codes | No semantic understanding | Formulary lookup, code search |
| Hybrid (dense + sparse) | Best of both | Slightly more complex | Production clinical RAG |
| Hybrid + Re-ranking | Highest precision | 50–100ms latency overhead | High-stakes clinical decisions |
| HyDE | Improves recall for complex queries | Extra LLM call for hypothesis | Complex diagnostic questions |
Interview Questions
Q1: Design a RAG system for a hospital's clinical decision support platform
Category: System Design Difficulty: Principal Role: AI Architect
Answer Framework:
I'd begin by clarifying scope: what clinical questions does this answer? What's the scale (concurrent users, query volume)? What's the latency SLA? What EHR system is in use? I'll assume: clinical staff Q&A over institutional guidelines and formulary, 300 concurrent users, p95 < 3 seconds, Epic EHR.
Architecture: Two separate pipelines — indexing (offline) and query (real-time). For indexing: document ingest from an EHR's FHIR DocumentReference API and document management systems (SharePoint, Confluence, or equivalent) → format-specific parsers → semantic chunking at 512 tokens with 64 token overlap → embedding with a domain-appropriate clinical embedding model (e.g., voyage-medical-2 or equivalent — verify current availability) → storage in a vector database (Pinecone, Weaviate, or equivalent) with metadata in PostgreSQL (section hierarchy, document type, last updated, access level).
For the query pipeline: clinical staff submits a question through a Claude Desktop or web interface → query classification (diagnostic vs. pharmacology vs. procedure vs. administrative) routes to the appropriate Pinecone collection → hybrid retrieval (dense + BM25) retrieves top-20 candidates → cross-encoder re-ranks to top-8 → patient context from Epic FHIR API appended → Claude Opus generates response citing sources → citation validation confirms each citation's source actually contains the cited content → response with citations delivered.
HIPAA architecture: the entire pipeline runs in the hospital's AWS account with a Bedrock endpoint. No PHI leaves the hospital's VPC. Access control enforced at the collection level — payer policy documents accessible only to care coordinators, not to all clinical staff.
Evaluation: RAGAS metrics run weekly against a 200-question golden dataset (built with clinical informatics team): faithfulness > 0.85, context recall > 0.80, answer relevancy > 0.75. Degrade alerts if any metric drops > 10%.
Key Points to Hit:
- Separate indexing and query pipelines
- Hybrid retrieval (dense + sparse) — not just vector search
- Cross-encoder re-ranking for quality
- Patient context from EHR for patient-specific answers
- HIPAA-compliant architecture (within VPC, BAA)
- RAGAS evaluation pipeline with specific metrics and thresholds
- Citation validation to prevent fabricated source attribution
Q2: What causes "hallucinated citations" in RAG systems and how do you prevent them?
Category: Technical Depth Difficulty: Senior Role: AI Architect
Answer Framework:
A hallucinated citation occurs when the LLM attributes a factual claim to a retrieved source that either doesn't exist or doesn't contain the claimed information. This is distinct from factual hallucination — it is hallucination specifically of the attribution layer.
The cause: LLMs are trained to produce coherent, well-structured text. When instructed to cite sources, they are rewarded for producing citations that look credible. If the actual retrieved sources are not explicitly available in the context, the model will generate plausible-sounding citations from its parametric memory — which may be fabricated. Even when the retrieved sources are available, the model may attribute a claim to the wrong source within the context if the sources are numerous or similar.
Prevention has two layers: structural and validation. Structurally, the system prompt should explicitly list retrieved sources with numbered identifiers ([Source 1], [Source 2]) and instruct the model to use only these identifiers. This makes citation to a non-existent source impossible — there is no [Source 17] if only 8 were retrieved.
Validation: a post-generation step verifies each citation. For each claim-citation pair, retrieve the content of the cited source and use a lightweight model to verify that the cited text actually supports the claim. If it doesn't, either regenerate with stricter instructions or flag for human review. This verification step adds ~200–500ms but is non-negotiable for clinical applications where incorrect attribution could misdirect clinical decision-making.
Further Reading
In This Repository:
- Evaluation and Benchmarking — How to measure RAG system quality with RAGAS
- Clinical RAG — [PLANNED] Deep dive on clinical-specific RAG architecture
- examples/rag/ — Working RAG implementations
- architecture/mermaid/01-rag-pipeline-architecture.mmd — [PLANNED] Standalone diagram source
Summary
Key Takeaways:
- RAG grounds LLM responses in up-to-date, institution-specific knowledge — it is the solution to knowledge cutoff and hallucination for enterprise domains
- The indexing pipeline (parse → chunk → embed → store) runs offline; the query pipeline (retrieve → rerank → augment → generate) runs at inference time
- Hybrid retrieval (dense semantic + sparse BM25) outperforms either alone for clinical text — use both
- Cross-encoder re-ranking is the highest-ROI quality improvement for mature RAG systems
- Patient context injection from EHR elevates RAG from reference tool to patient-specific clinical decision support
- RAG failure mode #1: hallucinated citations — prevent with structured citation format + post-generation validation
- Evaluate RAG continuously with RAGAS: faithfulness, context recall, answer relevancy — don't guess at quality
The one thing to remember: RAG is a retrieval problem as much as an LLM problem. The quality of the answer is bounded by the quality of retrieval. Every engineering hour spent on chunking, embedding model selection, and hybrid retrieval pays dividends in answer quality that prompt engineering alone cannot buy.
Next: Prompt Engineering | Previous: Embeddings and Vector Spaces