LLM Fundamentals

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Foundational


Trade-offs and Considerations

Model Selection: Capability vs. Cost

Model Tier Examples Token Cost Latency Best For
Frontier (large) Claude Opus 4.8, GPT-4o, Gemini 2.5 Pro $$$$ 2–5s Complex reasoning, clinical documentation, multi-step analysis
Mid-tier Claude Sonnet 4.6, GPT-4o-mini $$ 0.5–2s RAG augmented generation, summarization, structured extraction
Small/Fast Claude Haiku 4.5, GPT-4o-mini $ 0.1–0.5s Classification, routing, simple Q&A, high-volume tasks

Rule of thumb: Use the smallest model that achieves acceptable quality on your specific task. Always measure quality empirically — don't assume larger is better.

Failure Modes

Failure Mode Description Enterprise Mitigation
Hallucination Model generates plausible-sounding but incorrect information RAG + citation grounding + output validation
Sycophancy Model agrees with incorrect statements from the user Calibrated prompting; don't provide the "answer" in the prompt
Context degradation Quality drops in very long contexts ("lost in the middle") Keep critical information at prompt start and end
Prompt injection Malicious content overrides system instructions Input sanitization; see docs/06-Security/02-prompt-injection-defense.md
Inference latency spikes Provider-side latency variability Timeouts + retry with exponential backoff + circuit breaker

Comparison Table

Dimension Claude (Anthropic) GPT-4o (OpenAI) Gemini 2.5 Pro (Google)
Context window 200K tokens 128K tokens 1M tokens
Healthcare BAA Yes (via AWS Bedrock) Yes (Azure OpenAI) Yes (Vertex AI)
Tool use / function calling Yes Yes Yes
Extended thinking Yes (Claude 3.7+) Yes (o1/o3 series) Yes (thinking mode)
Multimodal Yes (vision + docs) Yes (vision) Yes (vision + audio)
Best for HMS Clinical reasoning, documentation Code generation, structured output Long-document processing

Interview Questions

Q1: Explain why LLMs hallucinate and what architectural patterns mitigate this in a clinical system

Category: Technical Depth Difficulty: Senior Role: AI Architect

Answer Framework:

Hallucination is not a bug — it is an inherent consequence of how LLMs work. During pre-training, the model learns to predict the next token based on statistical patterns in training data. When the model encounters a question whose answer was rare or absent in training data, it generates a plausible-sounding continuation rather than a factual one, because "plausible continuation" is exactly what the training objective optimized for.

There are three categories of hallucination: (1) factual fabrication — making up drug interactions or clinical facts that don't exist; (2) confident speculation — stating uncertain information as fact; (3) source confabulation — citing papers or guidelines that don't exist. In clinical settings, all three are dangerous.

Architectural mitigations: First, Retrieval-Augmented Generation grounds the model's response in retrieved, authoritative documents — the model cannot generate a clinical fact that isn't present in the retrieved context. Second, structured output with source citations forces the model to attribute each claim to a specific retrieved document, enabling downstream validation. Third, output validation pipelines can check generated clinical facts against authoritative databases (drug databases, formularies) before presenting to clinicians. Fourth, human-in-the-loop review is non-negotiable for any AI output that influences a clinical decision.

Key Points to Hit:

  • Hallucination is architectural, not a fixable bug
  • RAG is the primary mitigation — ground output in retrieved facts
  • Citations are essential — every clinical claim should cite its source
  • Validation pipelines can catch fabricated facts before clinician exposure
  • Human review is required for any output that influences clinical decisions

Follow-up Questions:

  • "Can you eliminate hallucination completely?" (No — mitigate to acceptable levels)
  • "How do you measure hallucination rate in your clinical system?"

Red Flags:

  • "We'll just use a bigger model" — model size does not eliminate hallucination
  • "We'll fine-tune it to not hallucinate" — fine-tuning does not fix hallucination

Q2: How does the context window affect the architecture of a RAG system for clinical knowledge retrieval?

Category: Architecture Difficulty: Senior Role: AI Architect

Answer Framework:

The context window creates a hard constraint that shapes every RAG design decision. Consider a 500-bed hospital's knowledge base: clinical guidelines, drug formularies, formulary restrictions, and clinical protocols may total 50GB of text. The model's context window — even at 200K tokens (roughly 150K words) — can hold perhaps 0.01% of this knowledge at inference time. RAG's job is to select the right 0.01%.

This constraint drives several architectural decisions. Chunk size must balance two competing forces: smaller chunks increase retrieval precision (you retrieve the exact paragraph that answers the question) but may lack the surrounding context needed for coherent answers. Larger chunks provide more context but reduce precision. The sweet spot for clinical documents is typically 512–1024 tokens with meaningful overlap.

Retrieved chunks compete for context window space with the system prompt, conversation history, and output reservation. For a 200K context window, roughly: 2K system prompt + 5K conversation history + 10 chunks × 1K = 22K context, leaving 178K for output and overhead. This means context window management is an active budget, not just a limit.

The "lost in the middle" phenomenon means retrieved chunks placed in the center of a long context are retrieved less reliably by the model. In clinical RAG systems, place the most relevant retrieved chunks first or last in the context block, not sandwiched between less relevant ones.

Key Points to Hit:

  • Context window creates a retrieval budget — RAG's job is to select correctly
  • Chunk size trade-off: precision vs. context coherence
  • Context window is a budget to manage, not just a limit
  • "Lost in the middle" — position of retrieved content affects quality

Q3: Why would you choose a smaller model over a frontier model for a high-volume clinical task?

Category: Architecture Difficulty: Mid-Level Role: AI Architect / ML Engineer

Answer Framework:

Cost and latency. At enterprise scale, the economics of model selection are decisive. A clinical intake triage system that classifies 10,000 patient messages daily using a frontier model incurs meaningfully higher costs than the same task running on a small, fast model — often an order of magnitude difference. (Verify current pricing in official vendor documentation — specific rates are not quoted here because they change frequently.) Over a year, the compounding difference is significant for a task that doesn't require frontier reasoning capability.

The right selection process: first, define the minimum acceptable quality threshold for the task (e.g., 95% classification accuracy, measured on a held-out test set). Then evaluate the cheapest model that meets that threshold. Classification, entity extraction, routing, and simple Q&A tasks rarely require frontier model capability. Complex multi-step clinical reasoning, nuanced documentation generation, and tasks requiring broad medical knowledge are where frontier models justify their cost.

A two-tier architecture is the practical pattern: a fast, cheap model for high-volume routing and triage, with expensive frontier model invocation reserved for tasks that require it, triggered by the routing decision.

Red Flags:

  • "We always use the best model available for patient safety" — conflates model size with reliability; small models can be as reliable as large ones for in-distribution tasks