Fine-Tuning vs RAG

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate


Trade-offs and Considerations

Total Cost of Ownership Comparison

ℹ Note

Note on cost figures: Specific pricing is not quoted here because AI infrastructure costs change frequently. Verify current rates for embedding APIs, vector database hosting, and fine-tuning compute in official vendor documentation. The structural cost comparison below is the durable insight.

RAG (one-time setup + ongoing):

  • Initial: document ingestion, embedding, and indexing costs scale with corpus size
  • Ongoing: incremental index updates + vector store hosting (cloud-managed options available from all major providers)
  • No GPU infrastructure required
  • No data curation labor beyond document quality review

Fine-tuning (high upfront, low ongoing):

  • Data curation: 500–5,000 high-quality examples × clinical review labor (this is typically the dominant cost — the human review, not the compute)
  • Training compute: per fine-tuning run via API fine-tuning (consult provider documentation for current rates)
  • Evaluation: clinical review of test set outputs (often requires specialized clinical informatics staff)
  • Re-training on knowledge updates: recurring cost that can accumulate
  • Total for a single clinical fine-tuning project is substantially higher than RAG setup

For most enterprise clinical AI use cases, RAG has dramatically lower TCO and higher knowledge freshness. Fine-tuning is justified only when the behavioral improvement cannot be achieved through prompt engineering and the use case is stable enough to amortize the training investment.

Fine-Tuning Risks

Risk Description Mitigation
Catastrophic forgetting Fine-tuning on a narrow dataset degrades performance on general tasks Use low learning rates; evaluate on general benchmarks post-tuning
Training data poisoning Malicious examples in training data can embed adversarial behaviors Human review of all training examples before fine-tuning
Knowledge staleness Fine-tuned knowledge becomes outdated Hybrid approach: fine-tune for behavior, RAG for knowledge
Hallucination amplification Fine-tuning can make hallucination more confident Never fine-tune on factually incorrect examples; rigorous evaluation
Overfitting Too few examples → model memorizes rather than generalizes Minimum 200 examples per class; use validation loss to detect

Comparison Table

Dimension Prompt Engineering RAG Fine-Tuning Hybrid (RAG + FT)
Time to production Days Weeks Months Months
Knowledge freshness Static (prompt) Real-time Stale Real-time
Behavioral control Moderate Moderate High Very High
Source attribution Manual Natural None Natural
Training cost None None High High
Inference cost Baseline Slightly higher Baseline Slightly higher
Clinical use case fit Simple tasks Knowledge Q&A Format/style Complex clinical AI

Interview Questions

Q1: A healthcare AI company wants to build a clinical note generation system. Should they fine-tune their model or use RAG? What additional information do you need?

Category: Architecture / System Design Difficulty: Senior Role: AI Architect

Answer Framework:

My default position is to use prompt engineering first, then RAG, then fine-tuning — in that order — because each step increases cost and complexity substantially. But let me ask the clarifying questions that would change this answer:

What specifically is failing with the baseline model? If the problem is that the model doesn't know the hospital's specific formulary restrictions → RAG. If the model doesn't know the latest clinical guidelines → RAG. If the model doesn't produce output in the required clinical documentation format → try few-shot prompt engineering first; if compliance rate is still insufficient (below ~90%) → fine-tuning. If both knowledge and format are problems → hybrid.

How stable is the target knowledge? Drug formularies change monthly. Clinical guidelines change quarterly. ICD-10 codes update annually. Any of these should live in RAG, not fine-tuning. Only truly stable, behavioral properties (output format, tone, reasoning structure, safety constraints) are appropriate for fine-tuning.

Do we need source attribution? If every clinical recommendation must cite the guideline it came from (required for liability and regulatory compliance in many healthcare AI contexts) → RAG is the only option. Fine-tuned knowledge cannot be attributed to a specific source.

What is the latency requirement? Real-time intra-encounter documentation (<500ms) may require a small fine-tuned model. Post-encounter documentation (batch, no SLA) can use a large RAG-augmented model.

In the typical HMS clinical note generation scenario, the right answer is hybrid: fine-tune a mid-tier Claude model on 500+ physician-authored note examples (for format and style), plus RAG retrieval of current clinical guidelines and patient-specific EHR context (for knowledge freshness and source attribution).