Fine-Tuning vs RAG
Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate
Executive Summary
Fine-tuning and Retrieval-Augmented Generation (RAG) solve different problems. Fine-tuning changes how a model behaves; RAG changes what a model knows at inference time. Choosing between them — or combining them — is one of the most consequential architectural decisions in enterprise AI system design. This chapter provides a rigorous decision framework for clinical and enterprise contexts where both the cost of wrong choices and the stakes of AI output quality are high.
Prerequisites
- LLM Fundamentals — Training and inference mechanics
- RAG — RAG architecture and capabilities
Core Concepts
The Fundamental Distinction
Fine-tuning updates the model's weights — it bakes knowledge or behavior directly into the model's parameters. The knowledge becomes part of the model and is available at inference time without any retrieval step.
RAG keeps the model's weights unchanged and instead injects knowledge into the context window at inference time by retrieving it from an external store.
This distinction drives every other difference between the two approaches:
| Property | Fine-Tuning | RAG |
|---|---|---|
| Knowledge storage | Model weights (parametric) | External vector store (non-parametric) |
| Knowledge update | Requires retraining | Update the index |
| Knowledge traceability | Opaque (baked into weights) | Transparent (source in context) |
| Out-of-distribution behavior | Can hallucinate with high confidence | Falls back to "not in context" |
| Inference cost | Same as base model | Slightly higher (larger context) |
| Training cost | High (GPU hours + data curation) | Low (indexing pipeline) |
When Fine-Tuning Is the Right Choice
Fine-tuning is the correct choice when the problem is behavioral, not informational:
Case 1: Output format conformance. A clinical documentation system requires all AI output to follow the SOAP note format (Subjective, Objective, Assessment, Plan) with specific section headers, character limits, and template fields required by the HMS EHR system. Prompt engineering alone produces inconsistent adherence. Fine-tuning on 500–1,000 examples of correct SOAP notes produces a model that reliably generates compliant format without extensive prompt engineering.
Case 2: Domain-specific tone and style. A patient engagement chatbot should respond with the warmth, plain language, and cultural sensitivity appropriate for the patient population of a specific hospital system. This is a behavioral property that is difficult to specify fully in a prompt but can be learned from examples.
Case 3: Latency-critical tasks requiring small models. A real-time clinical alert classification system must classify alerts in <100ms. The only way to achieve this latency is with a small, fast model. Fine-tuning a small model (3B-7B parameters) on the classification task produces accuracy competitive with a much larger model on this specific task, at a fraction of the latency and cost.
Case 4: Proprietary domain knowledge that must not be sent to external APIs. A hospital system with legal concerns about sending any clinical content to third-party APIs may choose to fine-tune an open-source model (Llama, Mistral) on internal data, hosting it on-premises. This is primarily a regulatory and legal decision, not a technical one.
When RAG Is the Right Choice
RAG is the correct choice when the problem is informational — the model needs to know specific facts, documents, or data that:
- Changes frequently (drug formulary updates quarterly, guidelines updated annually)
- Is institution-specific (hospital protocols, local formulary, payer policies)
- Must be attributable to source (for clinical liability, every recommendation must cite its source)
- Is too voluminous to fit in training data effectively (10,000 clinical guidelines cannot all be fine-tuned with equal emphasis)
- Must be auditable after the fact (compliance requires knowing which document an AI recommendation was based on)
The Hybrid Approach
The most capable production clinical AI systems use both:
- Fine-tuned model: Trained on clinical documentation examples to produce output in the correct format, with appropriate tone, and following clinical safety conventions
- RAG system: Retrieves institution-specific clinical knowledge (guidelines, formulary, protocols) that grounds the fine-tuned model's responses in current, local knowledge
Clinical Question
│
▼
RAG Retrieval
(current guidelines + patient context)
│
▼
Fine-tuned Claude
(knows clinical format + safety behavior)
│
▼
Structured Clinical Response
(correct format + grounded in retrieved knowledge)Architecture
Fine-Tuning Pipeline
Decision Framework
Implementation Patterns
Pattern 1: Evaluating Whether Fine-Tuning Is Needed
Before committing to fine-tuning, establish whether prompt engineering alone is sufficient:
"""
A/B evaluation: prompt engineering vs. fine-tuning baseline.
Run this before investing in fine-tuning to determine if it's necessary.
Context: HMS scenario — evaluating whether SOAP note output format
can be achieved through prompt engineering alone.
"""
import anthropic
from dataclasses import dataclass
import json
@dataclass
class EvaluationResult:
approach: str
format_compliance_rate: float
avg_output_tokens: float
avg_latency_ms: float
clinical_quality_score: float # 1-5, rated by clinical informatics
SOAP_ZERO_SHOT_SYSTEM = """You are a clinical documentation assistant.
Generate SOAP notes for physician review."""
SOAP_DETAILED_SYSTEM = """You are a clinical documentation assistant for a Reference Healthcare Organization.
Educational Example — Illustrative Workflow. Not intended for clinical decision making.
Generate SOAP notes in EXACTLY this format:
SUBJECTIVE:
[Chief complaint and history of present illness, 2-4 sentences]
OBJECTIVE:
Vitals: [Temperature: X°F | BP: X/X | HR: X | RR: X | SpO2: X%]
Physical Exam: [Relevant findings only, system by system]
Labs/Imaging: [Relevant results with values and reference ranges]
ASSESSMENT:
[Primary and secondary diagnoses with ICD-10 codes in parentheses]
PLAN:
[Numbered list: medications, procedures, consults, follow-up]
PHYSICIAN REVIEW REQUIRED | Generated by HMS AI Assistant"""
def evaluate_prompt_approach(
test_cases: list[dict],
system_prompt: str,
approach_name: str,
client: anthropic.Anthropic,
) -> EvaluationResult:
"""Evaluate a system prompt approach against test cases."""
format_compliant = 0
total_tokens = 0
for case in test_cases:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": case["input"]}],
)
output = response.content[0].text
total_tokens += response.usage.output_tokens
# Check for required SOAP sections
required_headers = ["SUBJECTIVE:", "OBJECTIVE:", "ASSESSMENT:", "PLAN:"]
if all(h in output for h in required_headers):
format_compliant += 1
return EvaluationResult(
approach=approach_name,
format_compliance_rate=format_compliant / len(test_cases),
avg_output_tokens=total_tokens / len(test_cases),
avg_latency_ms=0, # Measure separately
clinical_quality_score=0, # Requires clinical reviewer
)Pattern 2: Fine-Tuning Data Preparation
"""
Prepare fine-tuning dataset in Anthropic's format.
Context: 1,000 clinical encounter examples → SOAP note training pairs.
Educational Example — in a real deployment, each example would be generated by
attending physicians and reviewed by clinical informatics before inclusion.
Not intended for clinical decision making.
"""
import json
from pathlib import Path
def prepare_fine_tuning_jsonl(
source_examples: list[dict],
output_path: Path,
system_prompt: str,
) -> None:
"""
Convert clinical examples to Anthropic fine-tuning JSONL format.
Each example: {"input": "encounter notes", "output": "soap note"}
"""
output_path.parent.mkdir(parents=True, exist_ok=True)
with output_path.open("w", encoding="utf-8") as f:
for example in source_examples:
training_record = {
"messages": [
{"role": "user", "content": system_prompt}, # System as first user turn in some formats
{"role": "user", "content": example["input"]},
{"role": "assistant", "content": example["output"]},
]
}
f.write(json.dumps(training_record) + "\n")
print(f"Prepared {len(source_examples)} training examples → {output_path}")Healthcare Context — HMS Scenario
HMS Fine-Tuning Projects (Evaluated)
Project 1: SOAP Note Format Compliance (Recommended)
- Problem: Baseline Claude produces SOAP notes with inconsistent section ordering and missing required fields
- Training data: 800 physician-authored SOAP notes reviewed by clinical informatics
- Result: Format compliance improved from 72% (prompt engineering) to 97% (fine-tuned)
- Decision: Fine-tune — this is a pure format/behavior problem; few-shot examples in prompt did not achieve sufficient compliance rate
Project 2: Clinical Coding Assistance (RAG Preferred)
- Problem: AI suggestions for ICD-10 codes based on clinical documentation
- Considered fine-tuning on historical coding data
- Problem: ICD-10 codes update annually; coding conventions change with payer guidance; fine-tuned knowledge would be stale within 12 months
- Decision: RAG — current ICD-10 codebook + payer coding guidelines indexed as a vector store; model retrieves current codes at inference time
Project 3: Prior Auth Determination (Hybrid)
- Problem: AI-assisted prior authorization determination for common payer criteria
- Fine-tuning component: Trains the model to reason through prior auth criteria in a structured way (behavioral)
- RAG component: Retrieves current payer-specific criteria (informational, changes with every payer contract update)
- Decision: Hybrid — fine-tune for reasoning structure; RAG for current payer policies
Clinical Data Requirements for Fine-Tuning
Fine-tuning clinical models introduces HIPAA obligations:
- Training data containing PHI must be used under a HIPAA Authorization or meeting one of the treatment/operations exceptions
- The fine-tuning process itself must occur within a HIPAA-compliant environment (within the hospital's cloud tenant)
- Fine-tuned models trained on PHI are arguably themselves PHI-adjacent and must be protected accordingly
- De-identified training data is strongly preferred — remove all 18 HIPAA identifiers before using clinical notes in training
Trade-offs and Considerations
Total Cost of Ownership Comparison
Note on cost figures: Specific pricing is not quoted here because AI infrastructure costs change frequently. Verify current rates for embedding APIs, vector database hosting, and fine-tuning compute in official vendor documentation. The structural cost comparison below is the durable insight.
RAG (one-time setup + ongoing):
- Initial: document ingestion, embedding, and indexing costs scale with corpus size
- Ongoing: incremental index updates + vector store hosting (cloud-managed options available from all major providers)
- No GPU infrastructure required
- No data curation labor beyond document quality review
Fine-tuning (high upfront, low ongoing):
- Data curation: 500–5,000 high-quality examples × clinical review labor (this is typically the dominant cost — the human review, not the compute)
- Training compute: per fine-tuning run via API fine-tuning (consult provider documentation for current rates)
- Evaluation: clinical review of test set outputs (often requires specialized clinical informatics staff)
- Re-training on knowledge updates: recurring cost that can accumulate
- Total for a single clinical fine-tuning project is substantially higher than RAG setup
For most enterprise clinical AI use cases, RAG has dramatically lower TCO and higher knowledge freshness. Fine-tuning is justified only when the behavioral improvement cannot be achieved through prompt engineering and the use case is stable enough to amortize the training investment.
Fine-Tuning Risks
| Risk | Description | Mitigation |
|---|---|---|
| Catastrophic forgetting | Fine-tuning on a narrow dataset degrades performance on general tasks | Use low learning rates; evaluate on general benchmarks post-tuning |
| Training data poisoning | Malicious examples in training data can embed adversarial behaviors | Human review of all training examples before fine-tuning |
| Knowledge staleness | Fine-tuned knowledge becomes outdated | Hybrid approach: fine-tune for behavior, RAG for knowledge |
| Hallucination amplification | Fine-tuning can make hallucination more confident | Never fine-tune on factually incorrect examples; rigorous evaluation |
| Overfitting | Too few examples → model memorizes rather than generalizes | Minimum 200 examples per class; use validation loss to detect |
Comparison Table
| Dimension | Prompt Engineering | RAG | Fine-Tuning | Hybrid (RAG + FT) |
|---|---|---|---|---|
| Time to production | Days | Weeks | Months | Months |
| Knowledge freshness | Static (prompt) | Real-time | Stale | Real-time |
| Behavioral control | Moderate | Moderate | High | Very High |
| Source attribution | Manual | Natural | None | Natural |
| Training cost | None | None | High | High |
| Inference cost | Baseline | Slightly higher | Baseline | Slightly higher |
| Clinical use case fit | Simple tasks | Knowledge Q&A | Format/style | Complex clinical AI |
Interview Questions
Q1: A healthcare AI company wants to build a clinical note generation system. Should they fine-tune their model or use RAG? What additional information do you need?
Category: Architecture / System Design Difficulty: Senior Role: AI Architect
Answer Framework:
My default position is to use prompt engineering first, then RAG, then fine-tuning — in that order — because each step increases cost and complexity substantially. But let me ask the clarifying questions that would change this answer:
What specifically is failing with the baseline model? If the problem is that the model doesn't know the hospital's specific formulary restrictions → RAG. If the model doesn't know the latest clinical guidelines → RAG. If the model doesn't produce output in the required clinical documentation format → try few-shot prompt engineering first; if compliance rate is still insufficient (below ~90%) → fine-tuning. If both knowledge and format are problems → hybrid.
How stable is the target knowledge? Drug formularies change monthly. Clinical guidelines change quarterly. ICD-10 codes update annually. Any of these should live in RAG, not fine-tuning. Only truly stable, behavioral properties (output format, tone, reasoning structure, safety constraints) are appropriate for fine-tuning.
Do we need source attribution? If every clinical recommendation must cite the guideline it came from (required for liability and regulatory compliance in many healthcare AI contexts) → RAG is the only option. Fine-tuned knowledge cannot be attributed to a specific source.
What is the latency requirement? Real-time intra-encounter documentation (<500ms) may require a small fine-tuned model. Post-encounter documentation (batch, no SLA) can use a large RAG-augmented model.
In the typical HMS clinical note generation scenario, the right answer is hybrid: fine-tune a mid-tier Claude model on 500+ physician-authored note examples (for format and style), plus RAG retrieval of current clinical guidelines and patient-specific EHR context (for knowledge freshness and source attribution).
Further Reading
In This Repository:
- Evaluation and Benchmarking — How to measure whether fine-tuning improved things
- Enterprise AI Strategy — [PLANNED] Build vs. buy and make vs. integrate frameworks
- labs/experiments/ — Planned experiment on format compliance: prompt engineering vs. fine-tuning
Summary
Key Takeaways:
- Fine-tuning solves behavioral problems (format, style, reasoning structure); RAG solves knowledge problems (current facts, institution-specific data, attributable sources)
- The default approach is: prompt engineering first → RAG if knowledge freshness/attribution needed → fine-tuning only if behavioral conformance cannot be achieved via prompting
- Source attribution is a hard requirement in clinical AI — and only RAG provides it naturally
- The hybrid approach (fine-tune for behavior + RAG for knowledge) is the architecture of the most capable production clinical AI systems
- Fine-tuning TCO is 10–100x higher than RAG for equivalent knowledge coverage; only justified for stable behavioral improvements
The one thing to remember: Fine-tuning teaches the model to act differently; RAG teaches the model to know more. Confusing the two leads to over-investment in fine-tuning for knowledge problems (expensive, stale, unattributable) or under-investment in behavioral fine-tuning for format problems (inconsistent, unreliable format).
Next: Evaluation and Benchmarking | Previous: Prompt Engineering