Fine-Tuning vs RAG

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate

Executive Summary

Fine-tuning and Retrieval-Augmented Generation (RAG) solve different problems. Fine-tuning changes how a model behaves; RAG changes what a model knows at inference time. Choosing between them — or combining them — is one of the most consequential architectural decisions in enterprise AI system design. This chapter provides a rigorous decision framework for clinical and enterprise contexts where both the cost of wrong choices and the stakes of AI output quality are high.

Prerequisites

LLM Fundamentals — Training and inference mechanics
RAG — RAG architecture and capabilities

Core Concepts

The Fundamental Distinction

Fine-tuning updates the model's weights — it bakes knowledge or behavior directly into the model's parameters. The knowledge becomes part of the model and is available at inference time without any retrieval step.

RAG keeps the model's weights unchanged and instead injects knowledge into the context window at inference time by retrieving it from an external store.

This distinction drives every other difference between the two approaches:

Property	Fine-Tuning	RAG
Knowledge storage	Model weights (parametric)	External vector store (non-parametric)
Knowledge update	Requires retraining	Update the index
Knowledge traceability	Opaque (baked into weights)	Transparent (source in context)
Out-of-distribution behavior	Can hallucinate with high confidence	Falls back to "not in context"
Inference cost	Same as base model	Slightly higher (larger context)
Training cost	High (GPU hours + data curation)	Low (indexing pipeline)

When Fine-Tuning Is the Right Choice

Fine-tuning is the correct choice when the problem is behavioral, not informational:

Case 1: Output format conformance. A clinical documentation system requires all AI output to follow the SOAP note format (Subjective, Objective, Assessment, Plan) with specific section headers, character limits, and template fields required by the HMS EHR system. Prompt engineering alone produces inconsistent adherence. Fine-tuning on 500–1,000 examples of correct SOAP notes produces a model that reliably generates compliant format without extensive prompt engineering.

Case 2: Domain-specific tone and style. A patient engagement chatbot should respond with the warmth, plain language, and cultural sensitivity appropriate for the patient population of a specific hospital system. This is a behavioral property that is difficult to specify fully in a prompt but can be learned from examples.

Case 3: Latency-critical tasks requiring small models. A real-time clinical alert classification system must classify alerts in <100ms. The only way to achieve this latency is with a small, fast model. Fine-tuning a small model (3B-7B parameters) on the classification task produces accuracy competitive with a much larger model on this specific task, at a fraction of the latency and cost.

Case 4: Proprietary domain knowledge that must not be sent to external APIs. A hospital system with legal concerns about sending any clinical content to third-party APIs may choose to fine-tune an open-source model (Llama, Mistral) on internal data, hosting it on-premises. This is primarily a regulatory and legal decision, not a technical one.

When RAG Is the Right Choice

RAG is the correct choice when the problem is informational — the model needs to know specific facts, documents, or data that:

Changes frequently (drug formulary updates quarterly, guidelines updated annually)
Is institution-specific (hospital protocols, local formulary, payer policies)
Must be attributable to source (for clinical liability, every recommendation must cite its source)
Is too voluminous to fit in training data effectively (10,000 clinical guidelines cannot all be fine-tuned with equal emphasis)
Must be auditable after the fact (compliance requires knowing which document an AI recommendation was based on)

The Hybrid Approach

The most capable production clinical AI systems use both:

Fine-tuned model: Trained on clinical documentation examples to produce output in the correct format, with appropriate tone, and following clinical safety conventions
RAG system: Retrieves institution-specific clinical knowledge (guidelines, formulary, protocols) that grounds the fine-tuned model's responses in current, local knowledge

text

Clinical Question
      │
      ▼
  RAG Retrieval
  (current guidelines + patient context)
      │
      ▼
Fine-tuned Claude
  (knows clinical format + safety behavior)
      │
      ▼
Structured Clinical Response
(correct format + grounded in retrieved knowledge)

Architecture

Fine-Tuning Pipeline

graph TD %% Title: Clinical Fine-Tuning Pipeline subgraph "Data Preparation" A["Raw Clinical\nExamples"] -->|"Clinical informatics\nreview"| B["Curated Training Set\n(500-5000 examples)"] B --> C["Train/Val/Test Split\n(70/15/15)"] end subgraph "Fine-Tuning" C -->|"Training set"| D["Base Model\n(Claude / Llama)"] D -->|"SFT training"| E["Fine-tuned Model"] C -->|"Validation set"| F["Evaluation\n(loss + clinical metrics)"] F -->|"Feedback"| D end subgraph "Evaluation & Deployment" E --> G["Held-out test\nevaluation"] G -->|"Clinical review\nof outputs"| H{Meets quality\nthreshold?} H -->|"Yes"| I["Deploy to\nproduction"] H -->|"No"| J["More data\nor iteration"] J --> B end

Decision Framework

flowchart TD Start([New AI Use Case]) --> Q1{Is the knowledge\ndynamic / updated\nfrequently?} Q1 -->|Yes| RAG[Use RAG\nKnowledge changes → update the index] Q1 -->|No| Q2{Does every response\nneed source attribution?} Q2 -->|Yes| RAG Q2 -->|No| Q3{Is this a behavioral\nproblem: format, tone,\nstyle, safety?} Q3 -->|Yes| Q4{Is latency\ncritical <200ms?} Q3 -->|No| Q5{Is the knowledge\ncorpus too large\nfor fine-tuning?} Q4 -->|Yes| FT_Small[Fine-tune a small\nfast model] Q4 -->|No| FT_or_Prompt{Can prompt engineering\nsolve it?} FT_or_Prompt -->|Yes| Prompt[Use prompt engineering\nNo fine-tuning needed] FT_or_Prompt -->|No| FT[Fine-tune the model] Q5 -->|Yes| RAG Q5 -->|No| Hybrid[Hybrid: RAG + Fine-tuning] style RAG fill:#4CAF50,color:#fff style FT fill:#2196F3,color:#fff style FT_Small fill:#2196F3,color:#fff style Prompt fill:#FF9800,color:#fff style Hybrid fill:#9C27B0,color:#fff

Implementation Patterns

Pattern 1: Evaluating Whether Fine-Tuning Is Needed

Before committing to fine-tuning, establish whether prompt engineering alone is sufficient:

python

"""
A/B evaluation: prompt engineering vs. fine-tuning baseline.
Run this before investing in fine-tuning to determine if it's necessary.

Context: HMS scenario — evaluating whether SOAP note output format
can be achieved through prompt engineering alone.
"""
import anthropic
from dataclasses import dataclass
import json

@dataclass
class EvaluationResult:
    approach: str
    format_compliance_rate: float
    avg_output_tokens: float
    avg_latency_ms: float
    clinical_quality_score: float  # 1-5, rated by clinical informatics


SOAP_ZERO_SHOT_SYSTEM = """You are a clinical documentation assistant.
Generate SOAP notes for physician review."""

SOAP_DETAILED_SYSTEM = """You are a clinical documentation assistant for a Reference Healthcare Organization.
Educational Example — Illustrative Workflow. Not intended for clinical decision making.

Generate SOAP notes in EXACTLY this format:

SUBJECTIVE:
[Chief complaint and history of present illness, 2-4 sentences]

OBJECTIVE:
Vitals: [Temperature: X°F | BP: X/X | HR: X | RR: X | SpO2: X%]
Physical Exam: [Relevant findings only, system by system]
Labs/Imaging: [Relevant results with values and reference ranges]

ASSESSMENT:
[Primary and secondary diagnoses with ICD-10 codes in parentheses]

PLAN:
[Numbered list: medications, procedures, consults, follow-up]

PHYSICIAN REVIEW REQUIRED | Generated by HMS AI Assistant"""


def evaluate_prompt_approach(
    test_cases: list[dict],
    system_prompt: str,
    approach_name: str,
    client: anthropic.Anthropic,
) -> EvaluationResult:
    """Evaluate a system prompt approach against test cases."""
    format_compliant = 0
    total_tokens = 0

    for case in test_cases:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": case["input"]}],
        )
        output = response.content[0].text
        total_tokens += response.usage.output_tokens

        # Check for required SOAP sections
        required_headers = ["SUBJECTIVE:", "OBJECTIVE:", "ASSESSMENT:", "PLAN:"]
        if all(h in output for h in required_headers):
            format_compliant += 1

    return EvaluationResult(
        approach=approach_name,
        format_compliance_rate=format_compliant / len(test_cases),
        avg_output_tokens=total_tokens / len(test_cases),
        avg_latency_ms=0,  # Measure separately
        clinical_quality_score=0,  # Requires clinical reviewer
    )

Pattern 2: Fine-Tuning Data Preparation

python

"""
Prepare fine-tuning dataset in Anthropic's format.
Context: 1,000 clinical encounter examples → SOAP note training pairs.
Educational Example — in a real deployment, each example would be generated by
attending physicians and reviewed by clinical informatics before inclusion.
Not intended for clinical decision making.
"""
import json
from pathlib import Path


def prepare_fine_tuning_jsonl(
    source_examples: list[dict],
    output_path: Path,
    system_prompt: str,
) -> None:
    """
    Convert clinical examples to Anthropic fine-tuning JSONL format.
    Each example: {"input": "encounter notes", "output": "soap note"}
    """
    output_path.parent.mkdir(parents=True, exist_ok=True)

    with output_path.open("w", encoding="utf-8") as f:
        for example in source_examples:
            training_record = {
                "messages": [
                    {"role": "user", "content": system_prompt},  # System as first user turn in some formats
                    {"role": "user", "content": example["input"]},
                    {"role": "assistant", "content": example["output"]},
                ]
            }
            f.write(json.dumps(training_record) + "\n")

    print(f"Prepared {len(source_examples)} training examples → {output_path}")

Healthcare Context — HMS Scenario

HMS Fine-Tuning Projects (Evaluated)

Project 1: SOAP Note Format Compliance (Recommended)

Problem: Baseline Claude produces SOAP notes with inconsistent section ordering and missing required fields
Training data: 800 physician-authored SOAP notes reviewed by clinical informatics
Result: Format compliance improved from 72% (prompt engineering) to 97% (fine-tuned)
Decision: Fine-tune — this is a pure format/behavior problem; few-shot examples in prompt did not achieve sufficient compliance rate

Project 2: Clinical Coding Assistance (RAG Preferred)

Problem: AI suggestions for ICD-10 codes based on clinical documentation
Considered fine-tuning on historical coding data
Problem: ICD-10 codes update annually; coding conventions change with payer guidance; fine-tuned knowledge would be stale within 12 months
Decision: RAG — current ICD-10 codebook + payer coding guidelines indexed as a vector store; model retrieves current codes at inference time

Project 3: Prior Auth Determination (Hybrid)

Problem: AI-assisted prior authorization determination for common payer criteria
Fine-tuning component: Trains the model to reason through prior auth criteria in a structured way (behavioral)
RAG component: Retrieves current payer-specific criteria (informational, changes with every payer contract update)
Decision: Hybrid — fine-tune for reasoning structure; RAG for current payer policies

Clinical Data Requirements for Fine-Tuning

Fine-tuning clinical models introduces HIPAA obligations:

Training data containing PHI must be used under a HIPAA Authorization or meeting one of the treatment/operations exceptions
The fine-tuning process itself must occur within a HIPAA-compliant environment (within the hospital's cloud tenant)
Fine-tuned models trained on PHI are arguably themselves PHI-adjacent and must be protected accordingly
De-identified training data is strongly preferred — remove all 18 HIPAA identifiers before using clinical notes in training

Trade-offs and Considerations

Total Cost of Ownership Comparison

ℹ Note

Note on cost figures: Specific pricing is not quoted here because AI infrastructure costs change frequently. Verify current rates for embedding APIs, vector database hosting, and fine-tuning compute in official vendor documentation. The structural cost comparison below is the durable insight.

RAG (one-time setup + ongoing):

Initial: document ingestion, embedding, and indexing costs scale with corpus size
Ongoing: incremental index updates + vector store hosting (cloud-managed options available from all major providers)
No GPU infrastructure required
No data curation labor beyond document quality review

Fine-tuning (high upfront, low ongoing):

Data curation: 500–5,000 high-quality examples × clinical review labor (this is typically the dominant cost — the human review, not the compute)
Training compute: per fine-tuning run via API fine-tuning (consult provider documentation for current rates)
Evaluation: clinical review of test set outputs (often requires specialized clinical informatics staff)
Re-training on knowledge updates: recurring cost that can accumulate
Total for a single clinical fine-tuning project is substantially higher than RAG setup

For most enterprise clinical AI use cases, RAG has dramatically lower TCO and higher knowledge freshness. Fine-tuning is justified only when the behavioral improvement cannot be achieved through prompt engineering and the use case is stable enough to amortize the training investment.

Fine-Tuning Risks

Risk	Description	Mitigation
Catastrophic forgetting	Fine-tuning on a narrow dataset degrades performance on general tasks	Use low learning rates; evaluate on general benchmarks post-tuning
Training data poisoning	Malicious examples in training data can embed adversarial behaviors	Human review of all training examples before fine-tuning
Knowledge staleness	Fine-tuned knowledge becomes outdated	Hybrid approach: fine-tune for behavior, RAG for knowledge
Hallucination amplification	Fine-tuning can make hallucination more confident	Never fine-tune on factually incorrect examples; rigorous evaluation
Overfitting	Too few examples → model memorizes rather than generalizes	Minimum 200 examples per class; use validation loss to detect

Comparison Table

Dimension	Prompt Engineering	RAG	Fine-Tuning	Hybrid (RAG + FT)
Time to production	Days	Weeks	Months	Months
Knowledge freshness	Static (prompt)	Real-time	Stale	Real-time
Behavioral control	Moderate	Moderate	High	Very High
Source attribution	Manual	Natural	None	Natural
Training cost	None	None	High	High
Inference cost	Baseline	Slightly higher	Baseline	Slightly higher
Clinical use case fit	Simple tasks	Knowledge Q&A	Format/style	Complex clinical AI

Interview Questions

Q1: A healthcare AI company wants to build a clinical note generation system. Should they fine-tune their model or use RAG? What additional information do you need?

Category: Architecture / System Design Difficulty: Senior Role: AI Architect

Answer Framework:

My default position is to use prompt engineering first, then RAG, then fine-tuning — in that order — because each step increases cost and complexity substantially. But let me ask the clarifying questions that would change this answer:

What specifically is failing with the baseline model? If the problem is that the model doesn't know the hospital's specific formulary restrictions → RAG. If the model doesn't know the latest clinical guidelines → RAG. If the model doesn't produce output in the required clinical documentation format → try few-shot prompt engineering first; if compliance rate is still insufficient (below ~90%) → fine-tuning. If both knowledge and format are problems → hybrid.

How stable is the target knowledge? Drug formularies change monthly. Clinical guidelines change quarterly. ICD-10 codes update annually. Any of these should live in RAG, not fine-tuning. Only truly stable, behavioral properties (output format, tone, reasoning structure, safety constraints) are appropriate for fine-tuning.

Do we need source attribution? If every clinical recommendation must cite the guideline it came from (required for liability and regulatory compliance in many healthcare AI contexts) → RAG is the only option. Fine-tuned knowledge cannot be attributed to a specific source.

What is the latency requirement? Real-time intra-encounter documentation (<500ms) may require a small fine-tuned model. Post-encounter documentation (batch, no SLA) can use a large RAG-augmented model.

In the typical HMS clinical note generation scenario, the right answer is hybrid: fine-tune a mid-tier Claude model on 500+ physician-authored note examples (for format and style), plus RAG retrieval of current clinical guidelines and patient-specific EHR context (for knowledge freshness and source attribution).

Summary

Key Takeaways:

Fine-tuning solves behavioral problems (format, style, reasoning structure); RAG solves knowledge problems (current facts, institution-specific data, attributable sources)
The default approach is: prompt engineering first → RAG if knowledge freshness/attribution needed → fine-tuning only if behavioral conformance cannot be achieved via prompting
Source attribution is a hard requirement in clinical AI — and only RAG provides it naturally
The hybrid approach (fine-tune for behavior + RAG for knowledge) is the architecture of the most capable production clinical AI systems
Fine-tuning TCO is 10–100x higher than RAG for equivalent knowledge coverage; only justified for stable behavioral improvements

The one thing to remember: Fine-tuning teaches the model to act differently; RAG teaches the model to know more. Confusing the two leads to over-investment in fine-tuning for knowledge problems (expensive, stale, unattributable) or under-investment in behavioral fine-tuning for format problems (inconsistent, unreliable format).

Next: Evaluation and Benchmarking | Previous: Prompt Engineering

Fine-Tuning vs RAG#

Executive Summary#

Prerequisites#

Core Concepts#

The Fundamental Distinction#

When Fine-Tuning Is the Right Choice#

When RAG Is the Right Choice#

The Hybrid Approach#

Architecture#

Fine-Tuning Pipeline#

Decision Framework#

Implementation Patterns#

Pattern 1: Evaluating Whether Fine-Tuning Is Needed#

Pattern 2: Fine-Tuning Data Preparation#

Healthcare Context — HMS Scenario#

HMS Fine-Tuning Projects (Evaluated)#

Clinical Data Requirements for Fine-Tuning#

Trade-offs and Considerations#

Total Cost of Ownership Comparison#

Fine-Tuning Risks#

Comparison Table#

Interview Questions#

Q1: A healthcare AI company wants to build a clinical note generation system. Should they fine-tune their model or use RAG? What additional information do you need?#

Further Reading#

Summary#