AI Foundations — Quick Reference#
Last Updated: 2026-06-30 Full Chapters: docs/01-AI-Foundations/
LLMs — Core Facts#
Property
Value
Notes
Generation method
Autoregressive — one token at a time
Latency scales with OUTPUT length
Token ≈ words
1 token ≈ 0.75 English words
Clinical text: 1 token ≈ 0.6–0.7 words
Cost driver
Input tokens + output tokens separately priced
Input < output in cost
Claude Opus context
200,000 tokens ≈ 150,000 words
~300 pages
Hallucination cause
Statistical next-token prediction, not knowledge lookup
Mitigate with RAG + validation
Temperature = 0
Deterministic (always top token)
Use for classification, extraction
Temperature = 1
Model's natural distribution
Use for creative drafting
Embeddings — Core Facts#
Property
Value
Notes
What they are
Fixed-length dense vectors (768–3,072 dims)
Text → geometry
Similarity metric
Cosine similarity
Range: -1 to +1; use 0.60+ for retrieval
Clinical threshold
≥ 0.60 relevance; ≥ 0.90 near-duplicate
Calibrate per domain
Chunk size sweet spot
512 tokens, 64 token overlap
Balance precision vs. context
Domain model win
Clinical models (e.g., voyage-medical-2) vs. general: +10–20% recall
Evaluate domain-specific models for clinical use cases
Model lock-in
Ingestion and query model MUST match
Upgrade = full re-embed
1M chunks storage
~6 GB at 1,536 dims × 4 bytes
HNSW index adds ~20% overhead
RAG — Architecture at a Glance#
INDEXING (offline):
Documents → Parse → Chunk → Embed → Vector Store + Metadata DB
QUERY (real-time, target: <3s total):
Query
→ Preprocess (intent + synonym expansion) ~50ms
→ Hybrid retrieval (dense + BM25, top-20) ~100ms
→ Cross-encoder re-rank (top-8) ~150ms
→ Context build (+ patient EHR context) ~50ms
→ LLM generate (Claude Opus, streamed) ~1–2s
→ Citation validate ~100ms
→ Response with citations
RAG Failure Modes — Quick Reference#
Failure
Symptom
Fix
Wrong retrieval
Irrelevant sources cited
Better chunking; domain embedding model
Ignored context
Model uses own knowledge instead
Stricter system prompt: "ONLY use provided context"
Hallucinated citations
Source doesn't contain the claim
Post-generation citation validation
Stale knowledge
Outdated answers
Incremental index with change detection
Low faithfulness
RAGAS faithfulness < 0.85
Reduce top-k; improve re-ranking
Prompt Engineering — Essential Patterns#
System Prompt Anatomy#
1. Role definition (who is the model + operational context)
2. Knowledge scope (only use provided context / retrieved content)
3. Hard constraints (never do X; always include Y)
4. Output format (JSON schema / section headers / exact structure)
5. Edge case handling (what to do when uncertain)
Temperature Selection#
Task Type
Temperature
Reason
Classification, extraction
0.0
Deterministic — no variation needed
Clinical documentation
0.2–0.3
Low variation for medical accuracy
Summarization
0.3–0.5
Some flexibility in phrasing
Creative drafting
0.7–1.0
Variation is desirable
Few-Shot: When to Use#
Add 2–5 examples in the system prompt when:
Output format is complex and hard to describe in rules alone
Task requires judgment (what counts as "significant"?)
Zero-shot compliance rate is below ~85%
Fine-Tuning vs. RAG — Decision Table#
Use This
When
Prompt Engineering
Format is achievable; knowledge is stable; few-shot compliance ≥ 85%
RAG
Knowledge is dynamic or fresh; source citation required; institution-specific facts; knowledge too large for fine-tuning
Fine-Tuning
Behavioral conformance needed (format, tone, style); latency critical (small fast model); knowledge is stable and static
Hybrid
Both behavioral AND knowledge problems; highest-stakes production clinical AI
Evaluation — RAGAS Metrics#
Metric
Measures
Clinical Threshold
Faithfulness
Are answer claims in the retrieved context?
≥ 0.88
Answer Relevancy
Does the answer address the question?
≥ 0.80
Context Precision
Were retrieved docs relevant?
≥ 0.75
Context Recall
Were all relevant docs retrieved?
≥ 0.75
Alert threshold: degrade alert if any metric drops > 10% from baseline.
Context Window Management#
Budget Template (200K Claude)#
Component
Budget
Priority
System prompt
2K
Cached — write once
Patient EHR context
3–15K
Critical, always include
Retrieved knowledge (RAG)
8–12K
Highest relevance first + last
Conversation history
5K (last 5 turns)
Summarize older turns
Output reservation
2–4K
Based on expected response length
Total used
~30–40K
170K headroom
Long Document Strategies#
Fits in context (<150K tokens): Direct inclusion — include document then query
Doesn't fit, need specific facts: RAG — retrieve relevant chunks
Doesn't fit, need full understanding: Map-Reduce — process chunks in parallel, synthesize
Sequential document with dependencies: Hierarchical summarization — level-by-level compression
Multimodal — Quick Reference#
Input Type
Best API
PHI Handling
Regulatory Status
Printed PDF lab report
Claude vision or AWS Textract + text LLM
Strip metadata; BAA required
Not SaMD
Handwritten referral
Claude vision (best for handwriting)
BAA required
Not SaMD
DICOM radiology
Use cleared radiology AI + Claude for knowledge support
Strip DICOM headers; no diagnostic use
SaMD requires 510(k)
Clinical photos (non-diagnostic)
Claude vision
BAA required
Not SaMD
Never use an LLM to diagnose from medical images without FDA 510(k) clearance.
Model Selection Guide#
Use Case
Model Tier
Why
Clinical reasoning, complex documentation
Claude Opus 4.8
Frontier reasoning required
RAG response generation, summarization
Claude Sonnet 4.6
Sufficient quality, 50% cheaper
Classification, routing, extraction
Claude Haiku 4.5
20x cheaper, <500ms latency
Conversation history summarization
Claude Haiku 4.5
Internal step, not user-facing
Common Interview Questions#
"Why do LLMs hallucinate and how do you mitigate it in a clinical system?" → RAG + citation grounding + validation + human review
"When would you choose RAG over fine-tuning?" → Knowledge is dynamic, citation required, institution-specific → RAG
"Design a RAG system for a hospital's clinical guidelines" → See full system design in interview/architecture/01-design-enterprise-rag.md [PLANNED]
"What is RAGAS faithfulness and why does it matter for clinical AI?" → Measures whether claims are grounded in retrieved context; most safety-critical metric
"How does the context window affect RAG architecture?" → Creates retrieval budget; "lost in the middle" → order matters
See Also#
← Previous Behavioral Interview Questions Next → Agentic AI — Quick Reference