Fine-Tuning vs RAG

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate


Executive Summary

Fine-tuning and Retrieval-Augmented Generation (RAG) solve different problems. Fine-tuning changes how a model behaves; RAG changes what a model knows at inference time. Choosing between them — or combining them — is one of the most consequential architectural decisions in enterprise AI system design. This chapter provides a rigorous decision framework for clinical and enterprise contexts where both the cost of wrong choices and the stakes of AI output quality are high.


Trade-offs and Considerations

Total Cost of Ownership Comparison

ℹ Note

Note on cost figures: Specific pricing is not quoted here because AI infrastructure costs change frequently. Verify current rates for embedding APIs, vector database hosting, and fine-tuning compute in official vendor documentation. The structural cost comparison below is the durable insight.

RAG (one-time setup + ongoing):

  • Initial: document ingestion, embedding, and indexing costs scale with corpus size
  • Ongoing: incremental index updates + vector store hosting (cloud-managed options available from all major providers)
  • No GPU infrastructure required
  • No data curation labor beyond document quality review

Fine-tuning (high upfront, low ongoing):

  • Data curation: 500–5,000 high-quality examples × clinical review labor (this is typically the dominant cost — the human review, not the compute)
  • Training compute: per fine-tuning run via API fine-tuning (consult provider documentation for current rates)
  • Evaluation: clinical review of test set outputs (often requires specialized clinical informatics staff)
  • Re-training on knowledge updates: recurring cost that can accumulate
  • Total for a single clinical fine-tuning project is substantially higher than RAG setup

For most enterprise clinical AI use cases, RAG has dramatically lower TCO and higher knowledge freshness. Fine-tuning is justified only when the behavioral improvement cannot be achieved through prompt engineering and the use case is stable enough to amortize the training investment.

Fine-Tuning Risks

Risk Description Mitigation
Catastrophic forgetting Fine-tuning on a narrow dataset degrades performance on general tasks Use low learning rates; evaluate on general benchmarks post-tuning
Training data poisoning Malicious examples in training data can embed adversarial behaviors Human review of all training examples before fine-tuning
Knowledge staleness Fine-tuned knowledge becomes outdated Hybrid approach: fine-tune for behavior, RAG for knowledge
Hallucination amplification Fine-tuning can make hallucination more confident Never fine-tune on factually incorrect examples; rigorous evaluation
Overfitting Too few examples → model memorizes rather than generalizes Minimum 200 examples per class; use validation loss to detect

Interview Questions

Q1: A healthcare AI company wants to build a clinical note generation system. Should they fine-tune their model or use RAG? What additional information do you need?

Category: Architecture / System Design Difficulty: Senior Role: AI Architect

Answer Framework:

My default position is to use prompt engineering first, then RAG, then fine-tuning — in that order — because each step increases cost and complexity substantially. But let me ask the clarifying questions that would change this answer:

What specifically is failing with the baseline model? If the problem is that the model doesn't know the hospital's specific formulary restrictions → RAG. If the model doesn't know the latest clinical guidelines → RAG. If the model doesn't produce output in the required clinical documentation format → try few-shot prompt engineering first; if compliance rate is still insufficient (below ~90%) → fine-tuning. If both knowledge and format are problems → hybrid.

How stable is the target knowledge? Drug formularies change monthly. Clinical guidelines change quarterly. ICD-10 codes update annually. Any of these should live in RAG, not fine-tuning. Only truly stable, behavioral properties (output format, tone, reasoning structure, safety constraints) are appropriate for fine-tuning.

Do we need source attribution? If every clinical recommendation must cite the guideline it came from (required for liability and regulatory compliance in many healthcare AI contexts) → RAG is the only option. Fine-tuned knowledge cannot be attributed to a specific source.

What is the latency requirement? Real-time intra-encounter documentation (<500ms) may require a small fine-tuned model. Post-encounter documentation (batch, no SLA) can use a large RAG-augmented model.

In the typical HMS clinical note generation scenario, the right answer is hybrid: fine-tune a mid-tier Claude model on 500+ physician-authored note examples (for format and style), plus RAG retrieval of current clinical guidelines and patient-specific EHR context (for knowledge freshness and source attribution).


Further Reading

In This Repository: