Fine-Tuning vs RAG
Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate
Executive Summary
Fine-tuning and Retrieval-Augmented Generation (RAG) solve different problems. Fine-tuning changes how a model behaves; RAG changes what a model knows at inference time. Choosing between them — or combining them — is one of the most consequential architectural decisions in enterprise AI system design. This chapter provides a rigorous decision framework for clinical and enterprise contexts where both the cost of wrong choices and the stakes of AI output quality are high.
Trade-offs and Considerations
Total Cost of Ownership Comparison
Note on cost figures: Specific pricing is not quoted here because AI infrastructure costs change frequently. Verify current rates for embedding APIs, vector database hosting, and fine-tuning compute in official vendor documentation. The structural cost comparison below is the durable insight.
RAG (one-time setup + ongoing):
- Initial: document ingestion, embedding, and indexing costs scale with corpus size
- Ongoing: incremental index updates + vector store hosting (cloud-managed options available from all major providers)
- No GPU infrastructure required
- No data curation labor beyond document quality review
Fine-tuning (high upfront, low ongoing):
- Data curation: 500–5,000 high-quality examples × clinical review labor (this is typically the dominant cost — the human review, not the compute)
- Training compute: per fine-tuning run via API fine-tuning (consult provider documentation for current rates)
- Evaluation: clinical review of test set outputs (often requires specialized clinical informatics staff)
- Re-training on knowledge updates: recurring cost that can accumulate
- Total for a single clinical fine-tuning project is substantially higher than RAG setup
For most enterprise clinical AI use cases, RAG has dramatically lower TCO and higher knowledge freshness. Fine-tuning is justified only when the behavioral improvement cannot be achieved through prompt engineering and the use case is stable enough to amortize the training investment.
Fine-Tuning Risks
| Risk | Description | Mitigation |
|---|---|---|
| Catastrophic forgetting | Fine-tuning on a narrow dataset degrades performance on general tasks | Use low learning rates; evaluate on general benchmarks post-tuning |
| Training data poisoning | Malicious examples in training data can embed adversarial behaviors | Human review of all training examples before fine-tuning |
| Knowledge staleness | Fine-tuned knowledge becomes outdated | Hybrid approach: fine-tune for behavior, RAG for knowledge |
| Hallucination amplification | Fine-tuning can make hallucination more confident | Never fine-tune on factually incorrect examples; rigorous evaluation |
| Overfitting | Too few examples → model memorizes rather than generalizes | Minimum 200 examples per class; use validation loss to detect |
Interview Questions
Q1: A healthcare AI company wants to build a clinical note generation system. Should they fine-tune their model or use RAG? What additional information do you need?
Category: Architecture / System Design Difficulty: Senior Role: AI Architect
Answer Framework:
My default position is to use prompt engineering first, then RAG, then fine-tuning — in that order — because each step increases cost and complexity substantially. But let me ask the clarifying questions that would change this answer:
What specifically is failing with the baseline model? If the problem is that the model doesn't know the hospital's specific formulary restrictions → RAG. If the model doesn't know the latest clinical guidelines → RAG. If the model doesn't produce output in the required clinical documentation format → try few-shot prompt engineering first; if compliance rate is still insufficient (below ~90%) → fine-tuning. If both knowledge and format are problems → hybrid.
How stable is the target knowledge? Drug formularies change monthly. Clinical guidelines change quarterly. ICD-10 codes update annually. Any of these should live in RAG, not fine-tuning. Only truly stable, behavioral properties (output format, tone, reasoning structure, safety constraints) are appropriate for fine-tuning.
Do we need source attribution? If every clinical recommendation must cite the guideline it came from (required for liability and regulatory compliance in many healthcare AI contexts) → RAG is the only option. Fine-tuned knowledge cannot be attributed to a specific source.
What is the latency requirement? Real-time intra-encounter documentation (<500ms) may require a small fine-tuned model. Post-encounter documentation (batch, no SLA) can use a large RAG-augmented model.
In the typical HMS clinical note generation scenario, the right answer is hybrid: fine-tune a mid-tier Claude model on 500+ physician-authored note examples (for format and style), plus RAG retrieval of current clinical guidelines and patient-specific EHR context (for knowledge freshness and source attribution).
Further Reading
In This Repository:
- Evaluation and Benchmarking — How to measure whether fine-tuning improved things
- Enterprise AI Strategy — [PLANNED] Build vs. buy and make vs. integrate frameworks
- labs/experiments/ — Planned experiment on format compliance: prompt engineering vs. fine-tuning