AI Foundations — Quick Reference

Last Updated: 2026-06-30 Full Chapters: docs/01-AI-Foundations/

LLMs — Core Facts

Property	Value	Notes
Generation method	Autoregressive — one token at a time	Latency scales with OUTPUT length
Token ≈ words	1 token ≈ 0.75 English words	Clinical text: 1 token ≈ 0.6–0.7 words
Cost driver	Input tokens + output tokens separately priced	Input < output in cost
Claude Opus context	200,000 tokens ≈ 150,000 words	~300 pages
Hallucination cause	Statistical next-token prediction, not knowledge lookup	Mitigate with RAG + validation
Temperature = 0	Deterministic (always top token)	Use for classification, extraction
Temperature = 1	Model's natural distribution	Use for creative drafting

Embeddings — Core Facts

Property	Value	Notes
What they are	Fixed-length dense vectors (768–3,072 dims)	Text → geometry
Similarity metric	Cosine similarity	Range: -1 to +1; use 0.60+ for retrieval
Clinical threshold	≥ 0.60 relevance; ≥ 0.90 near-duplicate	Calibrate per domain
Chunk size sweet spot	512 tokens, 64 token overlap	Balance precision vs. context
Domain model win	Clinical models (e.g., voyage-medical-2) vs. general: +10–20% recall	Evaluate domain-specific models for clinical use cases
Model lock-in	Ingestion and query model MUST match	Upgrade = full re-embed
1M chunks storage	~6 GB at 1,536 dims × 4 bytes	HNSW index adds ~20% overhead

RAG — Architecture at a Glance

text

INDEXING (offline):
Documents → Parse → Chunk → Embed → Vector Store + Metadata DB

QUERY (real-time, target: <3s total):
Query
  → Preprocess (intent + synonym expansion)       ~50ms
  → Hybrid retrieval (dense + BM25, top-20)      ~100ms
  → Cross-encoder re-rank (top-8)                ~150ms
  → Context build (+ patient EHR context)         ~50ms
  → LLM generate (Claude Opus, streamed)          ~1–2s
  → Citation validate                             ~100ms
  → Response with citations

RAG Failure Modes — Quick Reference

Failure	Symptom	Fix
Wrong retrieval	Irrelevant sources cited	Better chunking; domain embedding model
Ignored context	Model uses own knowledge instead	Stricter system prompt: "ONLY use provided context"
Hallucinated citations	Source doesn't contain the claim	Post-generation citation validation
Stale knowledge	Outdated answers	Incremental index with change detection
Low faithfulness	RAGAS faithfulness < 0.85	Reduce top-k; improve re-ranking

Prompt Engineering — Essential Patterns

System Prompt Anatomy

text

1. Role definition (who is the model + operational context)
2. Knowledge scope (only use provided context / retrieved content)
3. Hard constraints (never do X; always include Y)
4. Output format (JSON schema / section headers / exact structure)
5. Edge case handling (what to do when uncertain)

Temperature Selection

Task Type	Temperature	Reason
Classification, extraction	0.0	Deterministic — no variation needed
Clinical documentation	0.2–0.3	Low variation for medical accuracy
Summarization	0.3–0.5	Some flexibility in phrasing
Creative drafting	0.7–1.0	Variation is desirable

Few-Shot: When to Use

Add 2–5 examples in the system prompt when:

Output format is complex and hard to describe in rules alone
Task requires judgment (what counts as "significant"?)
Zero-shot compliance rate is below ~85%

Fine-Tuning vs. RAG — Decision Table

Use This	When
Prompt Engineering	Format is achievable; knowledge is stable; few-shot compliance ≥ 85%
RAG	Knowledge is dynamic or fresh; source citation required; institution-specific facts; knowledge too large for fine-tuning
Fine-Tuning	Behavioral conformance needed (format, tone, style); latency critical (small fast model); knowledge is stable and static
Hybrid	Both behavioral AND knowledge problems; highest-stakes production clinical AI

Evaluation — RAGAS Metrics

Metric	Measures	Clinical Threshold
Faithfulness	Are answer claims in the retrieved context?	≥ 0.88
Answer Relevancy	Does the answer address the question?	≥ 0.80
Context Precision	Were retrieved docs relevant?	≥ 0.75
Context Recall	Were all relevant docs retrieved?	≥ 0.75

Alert threshold: degrade alert if any metric drops > 10% from baseline.

Context Window Management

Budget Template (200K Claude)

Component	Budget	Priority
System prompt	2K	Cached — write once
Patient EHR context	3–15K	Critical, always include
Retrieved knowledge (RAG)	8–12K	Highest relevance first + last
Conversation history	5K (last 5 turns)	Summarize older turns
Output reservation	2–4K	Based on expected response length
Total used	~30–40K	170K headroom

Long Document Strategies

Fits in context (<150K tokens): Direct inclusion — include document then query
Doesn't fit, need specific facts: RAG — retrieve relevant chunks
Doesn't fit, need full understanding: Map-Reduce — process chunks in parallel, synthesize
Sequential document with dependencies: Hierarchical summarization — level-by-level compression

Multimodal — Quick Reference

Input Type	Best API	PHI Handling	Regulatory Status
Printed PDF lab report	Claude vision or AWS Textract + text LLM	Strip metadata; BAA required	Not SaMD
Handwritten referral	Claude vision (best for handwriting)	BAA required	Not SaMD
DICOM radiology	Use cleared radiology AI + Claude for knowledge support	Strip DICOM headers; no diagnostic use	SaMD requires 510(k)
Clinical photos (non-diagnostic)	Claude vision	BAA required	Not SaMD

Never use an LLM to diagnose from medical images without FDA 510(k) clearance.

Model Selection Guide

Use Case	Model Tier	Why
Clinical reasoning, complex documentation	Claude Opus 4.8	Frontier reasoning required
RAG response generation, summarization	Claude Sonnet 4.6	Sufficient quality, 50% cheaper
Classification, routing, extraction	Claude Haiku 4.5	20x cheaper, <500ms latency
Conversation history summarization	Claude Haiku 4.5	Internal step, not user-facing

Common Interview Questions

"Why do LLMs hallucinate and how do you mitigate it in a clinical system?" → RAG + citation grounding + validation + human review
"When would you choose RAG over fine-tuning?" → Knowledge is dynamic, citation required, institution-specific → RAG
"Design a RAG system for a hospital's clinical guidelines" → See full system design in interview/architecture/01-design-enterprise-rag.md [PLANNED]
"What is RAGAS faithfulness and why does it matter for clinical AI?" → Measures whether claims are grounded in retrieved context; most safety-critical metric
"How does the context window affect RAG architecture?" → Creates retrieval budget; "lost in the middle" → order matters

AI Foundations — Quick Reference#

LLMs — Core Facts#

Embeddings — Core Facts#

RAG — Architecture at a Glance#

RAG Failure Modes — Quick Reference#

Prompt Engineering — Essential Patterns#

System Prompt Anatomy#

Temperature Selection#

Few-Shot: When to Use#

Fine-Tuning vs. RAG — Decision Table#

Evaluation — RAGAS Metrics#

Context Window Management#

Budget Template (200K Claude)#

Long Document Strategies#

Multimodal — Quick Reference#

Model Selection Guide#

Common Interview Questions#

See Also#

AI Foundations — Quick Reference

LLMs — Core Facts

Embeddings — Core Facts

RAG — Architecture at a Glance

RAG Failure Modes — Quick Reference

Prompt Engineering — Essential Patterns

System Prompt Anatomy

Temperature Selection

Few-Shot: When to Use

Fine-Tuning vs. RAG — Decision Table

Evaluation — RAGAS Metrics

Context Window Management

Budget Template (200K Claude)

Long Document Strategies

Multimodal — Quick Reference

Model Selection Guide

Common Interview Questions

See Also