AI Foundations — Quick Reference

Last Updated: 2026-06-30 Full Chapters: docs/01-AI-Foundations/


LLMs — Core Facts

Property Value Notes
Generation method Autoregressive — one token at a time Latency scales with OUTPUT length
Token ≈ words 1 token ≈ 0.75 English words Clinical text: 1 token ≈ 0.6–0.7 words
Cost driver Input tokens + output tokens separately priced Input < output in cost
Claude Opus context 200,000 tokens ≈ 150,000 words ~300 pages
Hallucination cause Statistical next-token prediction, not knowledge lookup Mitigate with RAG + validation
Temperature = 0 Deterministic (always top token) Use for classification, extraction
Temperature = 1 Model's natural distribution Use for creative drafting

Embeddings — Core Facts

Property Value Notes
What they are Fixed-length dense vectors (768–3,072 dims) Text → geometry
Similarity metric Cosine similarity Range: -1 to +1; use 0.60+ for retrieval
Clinical threshold ≥ 0.60 relevance; ≥ 0.90 near-duplicate Calibrate per domain
Chunk size sweet spot 512 tokens, 64 token overlap Balance precision vs. context
Domain model win Clinical models (e.g., voyage-medical-2) vs. general: +10–20% recall Evaluate domain-specific models for clinical use cases
Model lock-in Ingestion and query model MUST match Upgrade = full re-embed
1M chunks storage ~6 GB at 1,536 dims × 4 bytes HNSW index adds ~20% overhead

RAG — Architecture at a Glance

text
INDEXING (offline):
Documents → Parse → Chunk → Embed → Vector Store + Metadata DB

QUERY (real-time, target: <3s total):
Query
  → Preprocess (intent + synonym expansion)       ~50ms
  → Hybrid retrieval (dense + BM25, top-20)      ~100ms
  → Cross-encoder re-rank (top-8)                ~150ms
  → Context build (+ patient EHR context)         ~50ms
  → LLM generate (Claude Opus, streamed)          ~1–2s
  → Citation validate                             ~100ms
  → Response with citations

RAG Failure Modes — Quick Reference

Failure Symptom Fix
Wrong retrieval Irrelevant sources cited Better chunking; domain embedding model
Ignored context Model uses own knowledge instead Stricter system prompt: "ONLY use provided context"
Hallucinated citations Source doesn't contain the claim Post-generation citation validation
Stale knowledge Outdated answers Incremental index with change detection
Low faithfulness RAGAS faithfulness < 0.85 Reduce top-k; improve re-ranking

Prompt Engineering — Essential Patterns

System Prompt Anatomy

text
1. Role definition (who is the model + operational context)
2. Knowledge scope (only use provided context / retrieved content)
3. Hard constraints (never do X; always include Y)
4. Output format (JSON schema / section headers / exact structure)
5. Edge case handling (what to do when uncertain)

Temperature Selection

Task Type Temperature Reason
Classification, extraction 0.0 Deterministic — no variation needed
Clinical documentation 0.2–0.3 Low variation for medical accuracy
Summarization 0.3–0.5 Some flexibility in phrasing
Creative drafting 0.7–1.0 Variation is desirable

Few-Shot: When to Use

Add 2–5 examples in the system prompt when:

  • Output format is complex and hard to describe in rules alone
  • Task requires judgment (what counts as "significant"?)
  • Zero-shot compliance rate is below ~85%

Fine-Tuning vs. RAG — Decision Table

Use This When
Prompt Engineering Format is achievable; knowledge is stable; few-shot compliance ≥ 85%
RAG Knowledge is dynamic or fresh; source citation required; institution-specific facts; knowledge too large for fine-tuning
Fine-Tuning Behavioral conformance needed (format, tone, style); latency critical (small fast model); knowledge is stable and static
Hybrid Both behavioral AND knowledge problems; highest-stakes production clinical AI

Evaluation — RAGAS Metrics

Metric Measures Clinical Threshold
Faithfulness Are answer claims in the retrieved context? ≥ 0.88
Answer Relevancy Does the answer address the question? ≥ 0.80
Context Precision Were retrieved docs relevant? ≥ 0.75
Context Recall Were all relevant docs retrieved? ≥ 0.75

Alert threshold: degrade alert if any metric drops > 10% from baseline.


Context Window Management

Budget Template (200K Claude)

Component Budget Priority
System prompt 2K Cached — write once
Patient EHR context 3–15K Critical, always include
Retrieved knowledge (RAG) 8–12K Highest relevance first + last
Conversation history 5K (last 5 turns) Summarize older turns
Output reservation 2–4K Based on expected response length
Total used ~30–40K 170K headroom

Long Document Strategies

  • Fits in context (<150K tokens): Direct inclusion — include document then query
  • Doesn't fit, need specific facts: RAG — retrieve relevant chunks
  • Doesn't fit, need full understanding: Map-Reduce — process chunks in parallel, synthesize
  • Sequential document with dependencies: Hierarchical summarization — level-by-level compression

Multimodal — Quick Reference

Input Type Best API PHI Handling Regulatory Status
Printed PDF lab report Claude vision or AWS Textract + text LLM Strip metadata; BAA required Not SaMD
Handwritten referral Claude vision (best for handwriting) BAA required Not SaMD
DICOM radiology Use cleared radiology AI + Claude for knowledge support Strip DICOM headers; no diagnostic use SaMD requires 510(k)
Clinical photos (non-diagnostic) Claude vision BAA required Not SaMD

Never use an LLM to diagnose from medical images without FDA 510(k) clearance.


Model Selection Guide

Use Case Model Tier Why
Clinical reasoning, complex documentation Claude Opus 4.8 Frontier reasoning required
RAG response generation, summarization Claude Sonnet 4.6 Sufficient quality, 50% cheaper
Classification, routing, extraction Claude Haiku 4.5 20x cheaper, <500ms latency
Conversation history summarization Claude Haiku 4.5 Internal step, not user-facing

Common Interview Questions

  1. "Why do LLMs hallucinate and how do you mitigate it in a clinical system?" → RAG + citation grounding + validation + human review
  2. "When would you choose RAG over fine-tuning?" → Knowledge is dynamic, citation required, institution-specific → RAG
  3. "Design a RAG system for a hospital's clinical guidelines" → See full system design in interview/architecture/01-design-enterprise-rag.md [PLANNED]
  4. "What is RAGAS faithfulness and why does it matter for clinical AI?" → Measures whether claims are grounded in retrieved context; most safety-critical metric
  5. "How does the context window affect RAG architecture?" → Creates retrieval budget; "lost in the middle" → order matters

See Also