Retrieval-Augmented Generation

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate

Executive Summary

Retrieval-Augmented Generation (RAG) is the dominant architectural pattern for grounding LLM responses in organizational knowledge — it is the mechanism by which a general-purpose language model becomes a domain-specific expert on your clinical guidelines, your formulary, your policy documents. Every enterprise AI architect must be able to design, evaluate, and optimize RAG systems. This chapter covers the complete RAG architecture, indexing pipeline design, retrieval strategies, evaluation, and the failure modes that most commonly sink production RAG systems.

Prerequisites

LLM Fundamentals — Context windows, inference mechanics
Embeddings and Vector Spaces — How semantic search works

Core Concepts

Why RAG Exists

LLMs have two fundamental limitations that RAG addresses:

Knowledge cutoff: LLMs are trained on data up to a fixed date. Clinical guidelines are updated quarterly. Drug interactions are discovered continuously. A model trained in early 2025 does not know about drugs approved or guidelines updated in late 2025. In a clinical context, this staleness can cause harm.

Knowledge capacity: Even a 70-billion parameter model can only memorize a fraction of the world's clinical knowledge during training. Rare conditions, niche formulary restrictions, institution-specific protocols, and patient-specific history are simply not in the model's parametric memory.

RAG addresses both: at inference time, it retrieves the specific relevant knowledge from an up-to-date, institution-specific knowledge base and injects it into the LLM's context window. The model generates its answer from this retrieved evidence rather than from its parametric memory.

The core RAG insight: Don't train the model to know everything — train the model to reason over retrieved evidence. The knowledge lives in the retrieval system; the reasoning lives in the model.

RAG vs. Fine-Tuning — The Fundamental Choice

Before building a RAG system, this question must be answered: why not fine-tune the model to know the specific knowledge?

Fine-tuning teaches the model new knowledge by updating its weights. RAG provides the model with knowledge at inference time via context injection. The trade-offs:

Dimension	RAG	Fine-Tuning
Knowledge freshness	Real-time — update the index, not the model	Stale — requires retraining for updates
Citation / source traceability	Natural — sources are in the context	Difficult — knowledge is baked into weights
Hallucination risk	Lower (grounded in retrieved text)	Higher (may confabulate fine-tuned knowledge)
Domain coverage	Any content indexed	Only what was in fine-tuning set
Implementation cost	Moderate (indexing pipeline + retrieval)	High (data curation + training compute + evaluation)
Inference cost	Slightly higher (larger context)	Same as base model

For enterprise clinical AI, RAG is almost always the right choice. Fine-tuning is appropriate for behavioral changes (making a model respond in a specific clinical documentation format, follow a specific reasoning structure) but RAG is superior for knowledge injection.

Architecture

Complete RAG System Architecture

graph TD %% Title: Production RAG Architecture — HMS Clinical Knowledge Base subgraph "Data Sources" DS1["Clinical Guidelines\n(PDF/DOCX)"] DS2["Drug Formulary\n(CSV/JSON)"] DS3["EHR Protocols\n(FHIR DocRef)"] DS4["Research Articles\n(PubMed FHIR)"] end subgraph "Indexing Pipeline (offline)" P1["Document Parser\n& Extractor"] P2["Text Chunker\n(semantic/recursive)"] P3["Embedding Model\n(voyage-medical-2)"] P4["Vector Store\n(Pinecone / Weaviate)"] P5["Metadata Store\n(PostgreSQL)"] end subgraph "Query Pipeline (real-time)" Q1["Clinician Query"] Q2["Query Preprocessor\n(intent + expansion)"] Q3["Retriever\n(dense + sparse hybrid)"] Q4["Re-ranker\n(cross-encoder)"] Q5["Context Builder\n(+ patient context)"] Q6["LLM\n(Claude Opus)"] Q7["Response Validator\n(citation check)"] Q8["Clinical Response\n+ Citations"] end DS1 & DS2 & DS3 & DS4 --> P1 P1 --> P2 --> P3 --> P4 P1 --> P5 Q1 --> Q2 --> Q3 P4 -->|"ANN search"| Q3 P5 -->|"Metadata filter"| Q3 Q3 --> Q4 --> Q5 --> Q6 --> Q7 --> Q8

The Indexing Pipeline

The indexing pipeline runs offline (on a schedule or triggered by document updates) and produces the searchable knowledge base.

Stage 1: Document Parsing Raw documents arrive in heterogeneous formats: PDF clinical guidelines, DOCX policy documents, FHIR DocumentReference resources, CSV drug tables. Each format requires a parser that extracts clean text while preserving structure (section headers, tables, lists).

Critical for clinical documents: preserve the section hierarchy as metadata. Knowing that a chunk came from the "Contraindications" section of a drug monograph is more valuable than knowing it came from "FDA Drug Label" generically.

Stage 2: Chunking Covered in depth in Embeddings and Vector Spaces. For clinical RAG:

Clinical guidelines: recursive character splitting at 512 tokens, 64 token overlap
Drug formulary entries: structured chunks preserving each drug's full entry (do not split mid-entry)
Clinical protocols: semantic chunking to preserve step-by-step procedure integrity

Stage 3: Embedding Each chunk is embedded using a clinical-domain embedding model. The embedding is stored alongside the chunk text and metadata in the vector store.

Stage 4: Metadata Storage Alongside the vector store, maintain a relational store (PostgreSQL) with metadata for each chunk:

Source document name, version, last updated date
Section hierarchy (e.g., "Chapter 3 > Antibiotic Selection > Penicillin-Allergic Patients")
Content type (guideline, formulary, protocol, research)
Applicable specialties or departments
PHI flag (for access control)

This enables metadata filtering — retrieving only from relevant document types before or after vector search.

The Query Pipeline

Stage 1: Query Preprocessing The raw clinical query is preprocessed:

Intent classification: Is this a drug dosing question? A diagnostic question? A procedure question? Routes to the appropriate document collection.
Query expansion: Adds synonyms and related terms using a medical thesaurus (SNOMED/UMLS). "Chest pain" → ["chest pain", "angina", "substernal discomfort", "cardiac chest pain"].
HyDE (Hypothetical Document Embeddings): Generate a hypothetical ideal answer to the query, then embed that hypothetical answer for retrieval. This is counterintuitive but highly effective — the hypothetical answer is in the same embedding space as real answers.

Stage 2: Hybrid Retrieval Production RAG systems use hybrid retrieval: combining dense embedding search with sparse BM25 keyword search, then fusing the results.

Dense search captures semantic similarity. Sparse BM25 captures exact keyword matches (critical for clinical terms, drug names, and ICD codes that must match exactly). Neither alone is sufficient:

text

Hybrid Score = α × dense_score + (1 - α) × sparse_score

Where α is typically 0.7 for clinical text
(more weight on semantic, some weight on exact match)

Stage 3: Re-ranking The top-K candidates from retrieval (e.g., top 20) are re-ranked using a cross-encoder model. Unlike the bi-encoder used for initial retrieval (which embeds query and document independently), a cross-encoder processes the query and document together, capturing fine-grained relevance signals.

Cross-encoder re-ranking is the single highest-ROI quality improvement for a mature RAG system, typically improving precision@5 by 15–25% at the cost of 50–100ms additional latency.

Stage 4: Context Building The top re-ranked chunks (typically 5–10) are assembled into the LLM context along with:

The system prompt (defines the AI's role and output format)
Patient-specific context (if available — encounter context from EHR)
The original query
Explicit instructions to cite sources and only use provided context

Stage 5: Generation and Validation The LLM generates a response grounded in the provided context. A validation layer verifies:

Each factual claim is attributed to a retrieved source
Cited sources actually contain the claimed information (prevents fabricated citations)
The response does not contain direct PHI if the output channel is not PHI-authorized

Implementation Patterns

Pattern 1: Complete Clinical RAG Pipeline

python

"""
Clinical RAG pipeline for HMS knowledge retrieval.
Context: Clinician at a Reference Healthcare Organization asks a clinical question about
antibiotic selection for a specific patient scenario. The system retrieves from
the organization's clinical guidelines and drug formulary.

Educational Example — Illustrative Workflow. Not intended for clinical decision making.

Dependencies:
    anthropic>=0.49.0
    chromadb>=0.6.0
    python-dotenv>=1.0.0
"""
from __future__ import annotations
import os
from dataclasses import dataclass, field
from typing import Any
import anthropic
import chromadb
from dotenv import load_dotenv

load_dotenv()

CLINICAL_SYSTEM_PROMPT = """You are a clinical knowledge assistant for a Reference Healthcare Organization.
This is an educational example of a clinical RAG system prompt.
Not intended for clinical decision making.

Your role:
- Answer clinical questions based ONLY on the provided retrieved context
- Cite specific sources for every factual claim (format: [Source: {document_name}, Section: {section}])
- If the provided context does not contain enough information to answer safely, say so explicitly
- Never extrapolate beyond what is in the provided context
- All responses require physician review before clinical application

If you cannot find the answer in the provided context, respond:
"The provided clinical guidelines do not contain sufficient information to answer this question. 
Please consult the relevant specialist or current institutional protocols."

Output format:
1. Direct answer (2-3 sentences)
2. Supporting details with citations
3. Relevant cautions or contraindications from the sources
4. Recommendation to verify with current protocols"""


@dataclass
class RetrievedChunk:
    content: str
    source: str
    section: str
    similarity: float
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass
class ClinicalRAGResponse:
    answer: str
    retrieved_chunks: list[RetrievedChunk]
    query: str
    model: str
    input_tokens: int
    output_tokens: int


class ClinicalRAGPipeline:
    """
    Production RAG pipeline for HMS clinical knowledge retrieval.
    Implements hybrid retrieval with metadata filtering.
    """

    def __init__(
        self,
        anthropic_client: anthropic.Anthropic,
        chroma_client: chromadb.Client,
        collection_name: str = "clinical_guidelines",
        model: str = "claude-opus-4-8",
        top_k_retrieve: int = 20,
        top_k_context: int = 8,
        similarity_threshold: float = 0.60,
    ) -> None:
        self.anthropic = anthropic_client
        self.collection = chroma_client.get_collection(collection_name)
        self.model = model
        self.top_k_retrieve = top_k_retrieve
        self.top_k_context = top_k_context
        self.similarity_threshold = similarity_threshold

    def _embed_query(self, query: str) -> list[float]:
        response = self.anthropic.embeddings.create(
            model="voyage-medical-2",
            input=[query],
        )
        return response.data[0].embedding

    def _retrieve(
        self,
        query: str,
        document_type_filter: str | None = None,
    ) -> list[RetrievedChunk]:
        """Retrieve relevant chunks using vector similarity search."""
        query_embedding = self._embed_query(query)

        where_filter = {}
        if document_type_filter:
            where_filter = {"content_type": {"$eq": document_type_filter}}

        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=self.top_k_retrieve,
            where=where_filter if where_filter else None,
            include=["documents", "metadatas", "distances"],
        )

        chunks = []
        for doc, meta, distance in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        ):
            similarity = 1.0 - distance
            if similarity >= self.similarity_threshold:
                chunks.append(RetrievedChunk(
                    content=doc,
                    source=meta.get("document", "Unknown"),
                    section=meta.get("section", "General"),
                    similarity=round(similarity, 4),
                    metadata=meta,
                ))

        return sorted(chunks, key=lambda c: c.similarity, reverse=True)

    def _build_context(self, chunks: list[RetrievedChunk]) -> str:
        """Format retrieved chunks into LLM context with clear source attribution."""
        context_parts = []
        for i, chunk in enumerate(chunks[: self.top_k_context], start=1):
            context_parts.append(
                f"[Source {i}: {chunk.source} | Section: {chunk.section} "
                f"| Relevance: {chunk.similarity:.2f}]\n{chunk.content}"
            )
        return "\n\n---\n\n".join(context_parts)

    def query(
        self,
        clinical_question: str,
        patient_context: str | None = None,
    ) -> ClinicalRAGResponse:
        """
        Process a clinical question through the RAG pipeline.

        patient_context: Optional structured patient context from EHR
        (allergies, current medications, relevant diagnoses). This is
        combined with retrieved knowledge to enable patient-specific answers.
        """
        chunks = self._retrieve(clinical_question)
        context = self._build_context(chunks)

        user_message = f"Clinical Question: {clinical_question}"
        if patient_context:
            user_message = (
                f"Patient Context:\n{patient_context}\n\n"
                f"Clinical Question: {clinical_question}"
            )

        full_user_message = (
            f"{user_message}\n\n"
            f"Retrieved Clinical Knowledge:\n\n{context}"
        )

        response = self.anthropic.messages.create(
            model=self.model,
            max_tokens=1536,
            system=CLINICAL_SYSTEM_PROMPT,
            messages=[{"role": "user", "content": full_user_message}],
        )

        return ClinicalRAGResponse(
            answer=response.content[0].text,
            retrieved_chunks=chunks[: self.top_k_context],
            query=clinical_question,
            model=self.model,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
        )

Healthcare Context — HMS Scenario

Clinical RAG Knowledge Base Architecture

A Reference Healthcare Organization's clinical RAG knowledge base may contain multiple document collections, each with different chunking, retrieval, and access control requirements. An example structure:

Collection	Content	Chunk Strategy	Update Frequency	Access
`clinical_guidelines`	UpToDate, hospital protocols, clinical pathways	Semantic 512T	Quarterly	All clinical staff
`drug_formulary`	HMS formulary, drug monographs, interactions	Per-drug entry	Monthly	Clinical + pharmacy
`research_literature`	Filtered PubMed abstracts, key clinical trials	Fixed 384T	Weekly	Physicians + APPs
`payer_policies`	Insurance prior auth criteria, coverage policies	Fixed 512T	Monthly	Care coordinators
`regulatory_docs`	HIPAA policies, Joint Commission standards	Fixed 256T	Annual	Compliance + legal

Patient Context Injection

The highest-value clinical RAG systems combine retrieved knowledge with patient-specific EHR context:

python

def build_patient_context_for_rag(
    patient_fhir_bundle: dict,
) -> str:
    """
    Extract clinically relevant context from FHIR R4 bundle
    for injection alongside retrieved knowledge.
    PHI is authorized for use within the clinical application boundary.
    """
    context_parts = []

    # Active allergies — critical safety context
    allergies = extract_allergies(patient_fhir_bundle)
    if allergies:
        context_parts.append(
            f"ALLERGIES (verify before any medication order): "
            f"{', '.join(a['substance'] + ' → ' + a['reaction'] for a in allergies)}"
        )

    # Active medications — drug interaction context
    medications = extract_active_medications(patient_fhir_bundle)
    if medications:
        context_parts.append(
            f"ACTIVE MEDICATIONS: {', '.join(m['name'] + ' ' + m['dose'] for m in medications)}"
        )

    # Relevant diagnoses — indication context
    diagnoses = extract_active_diagnoses(patient_fhir_bundle)
    if diagnoses:
        context_parts.append(
            f"ACTIVE DIAGNOSES: {', '.join(d['display'] for d in diagnoses)}"
        )

    return "\n".join(context_parts)

This patient context injection enables answers like: "Azithromycin is first-line for atypical community-acquired pneumonia. However, the patient has an active QT prolongation diagnosis and is on methadone — azithromycin is contraindicated in this patient due to additive QT prolongation risk. Doxycycline would be an alternative."

Without patient context, the RAG system can only answer the general question. With patient context, it can answer the patient-specific question — the difference between a reference tool and clinical decision support.

Trade-offs and Considerations

RAG Failure Modes

Failure Mode	Cause	Detection	Mitigation
Wrong retrieval	Low relevance threshold, poor embeddings	RAGAS context recall metric	Better chunking, domain-specific embedder
Ignored retrieval	LLM uses parametric memory instead of context	RAGAS faithfulness metric	Stricter system prompt: "only use provided context"
Hallucinated citations	LLM fabricates source attribution	Citation verification pipeline	Validate that cited source contains claimed content
Context too long	Top-k too large, context budget exceeded	Monitor context window usage	Reduce top_k; use re-ranker to select best 5 not top 20
Stale knowledge	Index not updated when documents change	Monitor document update timestamps	Incremental re-indexing pipeline with change detection
PHI leakage	Clinical chunks contain patient identifiers	PHI scanner on ingested documents	De-identify before indexing OR enforce strict access control on collection

Advanced RAG Patterns

Corrective RAG (CRAG): After initial retrieval, the LLM grades the relevance of each retrieved document. If retrieval quality is low, it triggers web search or a broader retrieval before generating.

Self-RAG: The LLM decides whether retrieval is needed at all, and after generation, assesses whether the generated text is supported by the retrieved evidence (introspection tokens: [Retrieve], [Supported], [Unsupported]).

RAG Fusion: Run the same query against the retrieval system multiple times with rephrased versions, then fuse and re-rank the results using Reciprocal Rank Fusion (RRF). Increases recall at the cost of retrieval latency.

These patterns improve quality but add complexity. Implement them only after the baseline RAG system has been evaluated and found insufficient.

Comparison Table

Retrieval Strategy	Pros	Cons	Best For
Dense-only (vector search)	Simple, captures semantics	Misses exact clinical term matches	General clinical questions
Sparse-only (BM25)	Exact match, drug names/ICD codes	No semantic understanding	Formulary lookup, code search
Hybrid (dense + sparse)	Best of both	Slightly more complex	Production clinical RAG
Hybrid + Re-ranking	Highest precision	50–100ms latency overhead	High-stakes clinical decisions
HyDE	Improves recall for complex queries	Extra LLM call for hypothesis	Complex diagnostic questions

Interview Questions

Q1: Design a RAG system for a hospital's clinical decision support platform

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

I'd begin by clarifying scope: what clinical questions does this answer? What's the scale (concurrent users, query volume)? What's the latency SLA? What EHR system is in use? I'll assume: clinical staff Q&A over institutional guidelines and formulary, 300 concurrent users, p95 < 3 seconds, Epic EHR.

Architecture: Two separate pipelines — indexing (offline) and query (real-time). For indexing: document ingest from an EHR's FHIR DocumentReference API and document management systems (SharePoint, Confluence, or equivalent) → format-specific parsers → semantic chunking at 512 tokens with 64 token overlap → embedding with a domain-appropriate clinical embedding model (e.g., voyage-medical-2 or equivalent — verify current availability) → storage in a vector database (Pinecone, Weaviate, or equivalent) with metadata in PostgreSQL (section hierarchy, document type, last updated, access level).

For the query pipeline: clinical staff submits a question through a Claude Desktop or web interface → query classification (diagnostic vs. pharmacology vs. procedure vs. administrative) routes to the appropriate Pinecone collection → hybrid retrieval (dense + BM25) retrieves top-20 candidates → cross-encoder re-ranks to top-8 → patient context from Epic FHIR API appended → Claude Opus generates response citing sources → citation validation confirms each citation's source actually contains the cited content → response with citations delivered.

HIPAA architecture: the entire pipeline runs in the hospital's AWS account with a Bedrock endpoint. No PHI leaves the hospital's VPC. Access control enforced at the collection level — payer policy documents accessible only to care coordinators, not to all clinical staff.

Evaluation: RAGAS metrics run weekly against a 200-question golden dataset (built with clinical informatics team): faithfulness > 0.85, context recall > 0.80, answer relevancy > 0.75. Degrade alerts if any metric drops > 10%.

Key Points to Hit:

Separate indexing and query pipelines
Hybrid retrieval (dense + sparse) — not just vector search
Cross-encoder re-ranking for quality
Patient context from EHR for patient-specific answers
HIPAA-compliant architecture (within VPC, BAA)
RAGAS evaluation pipeline with specific metrics and thresholds
Citation validation to prevent fabricated source attribution

Q2: What causes "hallucinated citations" in RAG systems and how do you prevent them?

Category: Technical Depth Difficulty: Senior Role: AI Architect

Answer Framework:

A hallucinated citation occurs when the LLM attributes a factual claim to a retrieved source that either doesn't exist or doesn't contain the claimed information. This is distinct from factual hallucination — it is hallucination specifically of the attribution layer.

The cause: LLMs are trained to produce coherent, well-structured text. When instructed to cite sources, they are rewarded for producing citations that look credible. If the actual retrieved sources are not explicitly available in the context, the model will generate plausible-sounding citations from its parametric memory — which may be fabricated. Even when the retrieved sources are available, the model may attribute a claim to the wrong source within the context if the sources are numerous or similar.

Prevention has two layers: structural and validation. Structurally, the system prompt should explicitly list retrieved sources with numbered identifiers ([Source 1], [Source 2]) and instruct the model to use only these identifiers. This makes citation to a non-existent source impossible — there is no [Source 17] if only 8 were retrieved.

Validation: a post-generation step verifies each citation. For each claim-citation pair, retrieve the content of the cited source and use a lightweight model to verify that the cited text actually supports the claim. If it doesn't, either regenerate with stricter instructions or flag for human review. This verification step adds ~200–500ms but is non-negotiable for clinical applications where incorrect attribution could misdirect clinical decision-making.

Summary

Key Takeaways:

RAG grounds LLM responses in up-to-date, institution-specific knowledge — it is the solution to knowledge cutoff and hallucination for enterprise domains
The indexing pipeline (parse → chunk → embed → store) runs offline; the query pipeline (retrieve → rerank → augment → generate) runs at inference time
Hybrid retrieval (dense semantic + sparse BM25) outperforms either alone for clinical text — use both
Cross-encoder re-ranking is the highest-ROI quality improvement for mature RAG systems
Patient context injection from EHR elevates RAG from reference tool to patient-specific clinical decision support
RAG failure mode #1: hallucinated citations — prevent with structured citation format + post-generation validation
Evaluate RAG continuously with RAGAS: faithfulness, context recall, answer relevancy — don't guess at quality

The one thing to remember: RAG is a retrieval problem as much as an LLM problem. The quality of the answer is bounded by the quality of retrieval. Every engineering hour spent on chunking, embedding model selection, and hybrid retrieval pays dividends in answer quality that prompt engineering alone cannot buy.

Next: Prompt Engineering | Previous: Embeddings and Vector Spaces

Retrieval-Augmented Generation#

Executive Summary#

Prerequisites#

Core Concepts#

Why RAG Exists#

RAG vs. Fine-Tuning — The Fundamental Choice#

Architecture#

Complete RAG System Architecture#

The Indexing Pipeline#

The Query Pipeline#

Implementation Patterns#

Pattern 1: Complete Clinical RAG Pipeline#

Healthcare Context — HMS Scenario#

Clinical RAG Knowledge Base Architecture#

Patient Context Injection#

Trade-offs and Considerations#

RAG Failure Modes#

Advanced RAG Patterns#

Comparison Table#

Interview Questions#

Q1: Design a RAG system for a hospital's clinical decision support platform#

Q2: What causes "hallucinated citations" in RAG systems and how do you prevent them?#

Further Reading#

Summary#