LLM Fundamentals

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Foundational


Executive Summary

Large Language Models (LLMs) are the engine powering every enterprise AI initiative today. Understanding how they generate text — not just that they do — is what separates architects who can make sound technology decisions from those who treat LLMs as black boxes. This chapter covers the transformer architecture, the training process, inference mechanics, tokenization, and context windows with the depth required to reason about production system design.


Prerequisites

  • Familiarity with REST APIs and Python
  • General software architecture concepts
  • No prior ML knowledge required

Core Concepts

Why Understanding LLM Internals Matters

An architect who does not understand how an LLM generates text cannot reason about:

  • Why hallucinations occur and what system-level mitigations are possible
  • Why context window size affects cost and latency at specific token thresholds
  • Why temperature affects output variability and how to set it for clinical applications
  • Why some tasks require larger models and others can use smaller, cheaper models
  • Why inference latency scales with output length, not just input length

None of these are theoretical concerns. They manifest as production incidents, cost overruns, and architectural dead ends at enterprise scale.

The Transformer Architecture

Every major LLM — Claude, GPT-4, Gemini, Llama — is built on the transformer architecture introduced in the 2017 paper Attention Is All You Need. The transformer has two critical properties that make it useful for language:

Self-attention: Every token in a sequence can attend to every other token, allowing the model to capture long-range dependencies. When a clinical note says "the patient was prescribed [drug] but refused due to [side effect she experienced three paragraphs earlier]," self-attention is what allows the model to connect these distant tokens.

Parallelism during training: Unlike recurrent networks (RNNs), transformers process all tokens in parallel during training. This enables training on the massive datasets required for LLMs to develop broad knowledge.

The transformer consists of stacked encoder-decoder layers (for translation-style tasks) or decoder-only layers (for language generation, which is what all frontier LLMs use). Each layer applies:

  1. Multi-head self-attention — Allows the model to attend to different positions simultaneously, each head potentially capturing different types of relationships (syntactic, semantic, positional)
  2. Feed-forward network — Per-token transformation that adds capacity for factual recall and reasoning
  3. Layer normalization and residual connections — Training stability mechanisms
text
Input Tokens → Embedding Layer → [Attention → FFN] × N layers → Output Logits

The output at each position is a probability distribution over the vocabulary — the model's belief about which token should come next.

How Text Generation Works: Autoregressive Decoding

LLMs generate text one token at a time, in a left-to-right process called autoregressive decoding:

  1. The input tokens (prompt) are processed by all N transformer layers
  2. The output logits for the next token position are computed
  3. A token is sampled from the output distribution (or the highest-probability token is selected)
  4. The selected token is appended to the sequence
  5. The model processes the extended sequence and generates the next token
  6. Repeat until a stop token is generated or the maximum length is reached

This has a critical implication: inference latency scales linearly with output length. A response that is 500 tokens takes approximately 5x longer to generate than one that is 100 tokens, all else being equal. For enterprise applications with latency SLAs, this means output length must be managed explicitly.

Tokenization

LLMs do not operate on characters or words — they operate on tokens, which are subword units produced by algorithms like Byte Pair Encoding (BPE). Understanding tokenization has direct architectural implications:

Token counts are not word counts. Common English words are typically 1 token. Uncommon words, clinical terminology, and proper nouns may be 2–4 tokens. Code can be highly token-efficient or inefficient depending on structure.

Clinical terminology is token-expensive. Medical terms like "thrombocytopenia," "methylprednisolone," or "hemoglobinopathy" are rare enough that they may be split into 3–5 tokens each. A clinical discharge summary that would be ~800 words in plain English may consume 1,400–2,000 tokens. This affects cost and context window usage.

Non-English text uses more tokens. Languages not well-represented in training data consume more tokens per word. For healthcare systems serving non-English-speaking populations, this increases API costs.

Rough rule of thumb: 1 token ≈ 0.75 words in English clinical text. Always measure actual token counts for your specific data domain.

The Context Window

The context window is the total number of tokens an LLM can process in a single inference call — input tokens plus output tokens combined.

Context windows have grown dramatically:

  • GPT-3 (2020): 4K tokens
  • GPT-4 (2023): 8K–32K tokens
  • Claude 3 (2024): 200K tokens
  • Claude 4 / Gemini 2.5 (2025–2026): 1M+ tokens

Architectural Implication: A larger context window does not mean larger is always better. LLMs exhibit the "lost in the middle" phenomenon — information placed in the middle of a very long context is retrieved less reliably than information at the beginning or end. For RAG systems, this means the position of retrieved chunks within the context matters, not just their presence.

For the HMS scenario: a 500-bed hospital's clinical note database spans terabytes. The context window, however large, can hold only a fraction of this data at inference time. RAG exists because the context window cannot hold all relevant knowledge simultaneously.

Training: Pre-training and Alignment

Pre-training produces the base model:

  • Trained on hundreds of billions to trillions of tokens from web text, books, code, and scientific papers
  • Objective: predict the next token (self-supervised — no human labels required)
  • Produces a model with broad knowledge but no instruction-following behavior

Supervised Fine-Tuning (SFT) teaches the model to follow instructions:

  • Trained on human-written examples of question-answer pairs
  • Transforms the raw pre-trained model into an assistant

Reinforcement Learning from Human Feedback (RLHF) aligns the model with human preferences:

  • Human raters compare model outputs and provide preference signals
  • The model learns to produce outputs that humans prefer (more helpful, less harmful)

Constitutional AI (CAI) — Anthropic's approach used in Claude:

  • Rather than only human raters, uses an AI system trained on explicit principles to critique outputs
  • Produces more consistent alignment with clearly defined values

Enterprise Implication: Pre-trained base models are not safe for enterprise deployment — they will comply with any instruction, including harmful ones. Always deploy aligned models (SFT + RLHF/CAI). For internal model hosting, never deploy raw base models to production without alignment.


Architecture

LLM Inference Architecture

In enterprise deployment, LLMs are accessed through one of three patterns:

Pattern Cost Latency Data Control HIPAA Operational Complexity
API-based Per-token Low Provider-dependent BAA required Very Low
Cloud-managed Per-token + cloud Low-Medium Within cloud tenant BAA available Low
Self-hosted GPU CapEx + OpEx Variable Complete Full control Very High

For the HMS scenario: Cloud-managed (AWS Bedrock with Claude, or Azure OpenAI) is the recommended pattern. It provides HIPAA BAA coverage, keeps PHI within the hospital's existing cloud environment, and avoids the GPU infrastructure burden that a 500-bed hospital's IT organization cannot sustain.


Implementation Patterns

Pattern 1: Basic Inference with Streaming

Streaming is the correct default for all user-facing applications. It reduces perceived latency dramatically — users see the first tokens within milliseconds rather than waiting for the full response.

python
"""
Basic LLM inference with streaming — HMS discharge note assistant
Context: A hospitalist requests a draft assessment and plan section
for a patient being discharged after a 3-day pneumonia admission.
"""
import anthropic
from typing import Iterator

client = anthropic.Anthropic()


def generate_assessment_draft(
    diagnosis: str,
    hospital_course: str,
    discharge_medications: list[str],
) -> Iterator[str]:
    """Stream an AI-drafted assessment and plan for physician review."""
    with client.messages.stream(
        model="claude-opus-4-8",
        max_tokens=1024,
        system=(
            "You are a clinical documentation assistant. Generate concise, "
            "structured clinical documentation drafts for physician review and editing. "
            "Always indicate that output requires physician review before use. "
            "Do not include specific dosing recommendations."
        ),
        messages=[{
            "role": "user",
            "content": (
                f"Draft an Assessment and Plan section.\n"
                f"Primary diagnosis: {diagnosis}\n"
                f"Hospital course: {hospital_course}\n"
                f"Discharge medications: {', '.join(discharge_medications)}"
            ),
        }],
    ) as stream:
        for text in stream.text_stream:
            yield text


if __name__ == "__main__":
    for chunk in generate_assessment_draft(
        diagnosis="Community-acquired pneumonia, right lower lobe",
        hospital_course="3-day inpatient course with IV antibiotics, improving O2 saturation",
        discharge_medications=["Amoxicillin-clavulanate 875mg BID x 5 days", "Azithromycin 500mg daily x 3 days"],
    ):
        print(chunk, end="", flush=True)
    print("\n\n[PHYSICIAN REVIEW REQUIRED BEFORE USE IN MEDICAL RECORD]")

Pattern 2: Controlling Output with Temperature

python
"""
Temperature selection for clinical vs. creative tasks.
Temperature controls the randomness of token sampling:
- 0.0 = always select the highest probability token (deterministic)
- 1.0 = sample proportionally to the model's probability distribution
- >1.0 = increase randomness beyond the model's natural distribution
"""
import anthropic

client = anthropic.Anthropic()


def classify_clinical_intent(note_text: str) -> dict:
    """
    Clinical classification — use low temperature for deterministic output.
    For structured clinical tasks, we want the model's most confident answer,
    not creative variation.
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Haiku is sufficient for classification
        max_tokens=128,
        temperature=0.0,  # Deterministic — classification should not vary
        system=(
            "Classify the clinical note. Respond with JSON only: "
            '{"intent": "admission|discharge|progress|consult|procedure", "confidence": "high|medium|low"}'
        ),
        messages=[{"role": "user", "content": note_text}],
    )
    import json
    return json.loads(response.content[0].text)

Healthcare Context — HMS Scenario

Token Cost Modeling for HMS

At a Reference Healthcare Organization processing thousands of daily encounters:

⊕ Healthcare Example

Educational Example — Illustrative Workflow. Not intended for clinical decision making.

Discharge summary generation (per encounter):

  • Input: ~3,000 tokens (encounter context, medications, labs, diagnoses)
  • Output: ~800 tokens (draft summary)
  • Total: ~3,800 tokens per encounter

Daily token volume (illustrative): 2,000 encounters × 3,800 tokens = 7.6M tokens/day Monthly volume (illustrative): ~228M tokens/month

The cost calculation depends entirely on the chosen model, provider, and current pricing. As an illustrative example using frontier model rates (verify current pricing in official vendor documentation — rates change frequently):

  • At $X/1M input tokens: 2,000 × 3,000 × $X/1M × 30 days = monthly input cost
  • At $Y/1M output tokens: 2,000 × 800 × $Y/1M × 30 days = monthly output cost

The architectural principle — not the specific number — is what matters: frontier model API costs scale linearly with token volume, and at thousands of daily encounters this becomes a significant line item that must be modeled and optimized.

Cost Management Principle: Not every clinical task requires the highest-capability model. Triage and classification can use Claude Haiku (~20x cheaper). Reserve Opus for complex clinical reasoning tasks. Model selection by use case is the single most impactful cost lever available.

Clinical Safety Properties

LLM Property Clinical Risk Mitigation
Hallucination Fabricated drug interactions or doses RAG with authoritative clinical knowledge base + structured output validation
Stochasticity Different answers to the same clinical question Low temperature for clinical decisions; always show source citations
Knowledge cutoff Drug withdrawals, guideline updates after training RAG system with current clinical databases
Context window limits Cannot hold full patient lifetime history Structured retrieval of clinically relevant encounters only

Trade-offs and Considerations

Model Selection: Capability vs. Cost

Model Tier Examples Token Cost Latency Best For
Frontier (large) Claude Opus 4.8, GPT-4o, Gemini 2.5 Pro $$$$ 2–5s Complex reasoning, clinical documentation, multi-step analysis
Mid-tier Claude Sonnet 4.6, GPT-4o-mini $$ 0.5–2s RAG augmented generation, summarization, structured extraction
Small/Fast Claude Haiku 4.5, GPT-4o-mini $ 0.1–0.5s Classification, routing, simple Q&A, high-volume tasks

Rule of thumb: Use the smallest model that achieves acceptable quality on your specific task. Always measure quality empirically — don't assume larger is better.

Failure Modes

Failure Mode Description Enterprise Mitigation
Hallucination Model generates plausible-sounding but incorrect information RAG + citation grounding + output validation
Sycophancy Model agrees with incorrect statements from the user Calibrated prompting; don't provide the "answer" in the prompt
Context degradation Quality drops in very long contexts ("lost in the middle") Keep critical information at prompt start and end
Prompt injection Malicious content overrides system instructions Input sanitization; see docs/06-Security/02-prompt-injection-defense.md
Inference latency spikes Provider-side latency variability Timeouts + retry with exponential backoff + circuit breaker

Comparison Table

Dimension Claude (Anthropic) GPT-4o (OpenAI) Gemini 2.5 Pro (Google)
Context window 200K tokens 128K tokens 1M tokens
Healthcare BAA Yes (via AWS Bedrock) Yes (Azure OpenAI) Yes (Vertex AI)
Tool use / function calling Yes Yes Yes
Extended thinking Yes (Claude 3.7+) Yes (o1/o3 series) Yes (thinking mode)
Multimodal Yes (vision + docs) Yes (vision) Yes (vision + audio)
Best for HMS Clinical reasoning, documentation Code generation, structured output Long-document processing

Interview Questions

Q1: Explain why LLMs hallucinate and what architectural patterns mitigate this in a clinical system

Category: Technical Depth Difficulty: Senior Role: AI Architect

Answer Framework:

Hallucination is not a bug — it is an inherent consequence of how LLMs work. During pre-training, the model learns to predict the next token based on statistical patterns in training data. When the model encounters a question whose answer was rare or absent in training data, it generates a plausible-sounding continuation rather than a factual one, because "plausible continuation" is exactly what the training objective optimized for.

There are three categories of hallucination: (1) factual fabrication — making up drug interactions or clinical facts that don't exist; (2) confident speculation — stating uncertain information as fact; (3) source confabulation — citing papers or guidelines that don't exist. In clinical settings, all three are dangerous.

Architectural mitigations: First, Retrieval-Augmented Generation grounds the model's response in retrieved, authoritative documents — the model cannot generate a clinical fact that isn't present in the retrieved context. Second, structured output with source citations forces the model to attribute each claim to a specific retrieved document, enabling downstream validation. Third, output validation pipelines can check generated clinical facts against authoritative databases (drug databases, formularies) before presenting to clinicians. Fourth, human-in-the-loop review is non-negotiable for any AI output that influences a clinical decision.

Key Points to Hit:

  • Hallucination is architectural, not a fixable bug
  • RAG is the primary mitigation — ground output in retrieved facts
  • Citations are essential — every clinical claim should cite its source
  • Validation pipelines can catch fabricated facts before clinician exposure
  • Human review is required for any output that influences clinical decisions

Follow-up Questions:

  • "Can you eliminate hallucination completely?" (No — mitigate to acceptable levels)
  • "How do you measure hallucination rate in your clinical system?"

Red Flags:

  • "We'll just use a bigger model" — model size does not eliminate hallucination
  • "We'll fine-tune it to not hallucinate" — fine-tuning does not fix hallucination

Q2: How does the context window affect the architecture of a RAG system for clinical knowledge retrieval?

Category: Architecture Difficulty: Senior Role: AI Architect

Answer Framework:

The context window creates a hard constraint that shapes every RAG design decision. Consider a 500-bed hospital's knowledge base: clinical guidelines, drug formularies, formulary restrictions, and clinical protocols may total 50GB of text. The model's context window — even at 200K tokens (roughly 150K words) — can hold perhaps 0.01% of this knowledge at inference time. RAG's job is to select the right 0.01%.

This constraint drives several architectural decisions. Chunk size must balance two competing forces: smaller chunks increase retrieval precision (you retrieve the exact paragraph that answers the question) but may lack the surrounding context needed for coherent answers. Larger chunks provide more context but reduce precision. The sweet spot for clinical documents is typically 512–1024 tokens with meaningful overlap.

Retrieved chunks compete for context window space with the system prompt, conversation history, and output reservation. For a 200K context window, roughly: 2K system prompt + 5K conversation history + 10 chunks × 1K = 22K context, leaving 178K for output and overhead. This means context window management is an active budget, not just a limit.

The "lost in the middle" phenomenon means retrieved chunks placed in the center of a long context are retrieved less reliably by the model. In clinical RAG systems, place the most relevant retrieved chunks first or last in the context block, not sandwiched between less relevant ones.

Key Points to Hit:

  • Context window creates a retrieval budget — RAG's job is to select correctly
  • Chunk size trade-off: precision vs. context coherence
  • Context window is a budget to manage, not just a limit
  • "Lost in the middle" — position of retrieved content affects quality

Q3: Why would you choose a smaller model over a frontier model for a high-volume clinical task?

Category: Architecture Difficulty: Mid-Level Role: AI Architect / ML Engineer

Answer Framework:

Cost and latency. At enterprise scale, the economics of model selection are decisive. A clinical intake triage system that classifies 10,000 patient messages daily using a frontier model incurs meaningfully higher costs than the same task running on a small, fast model — often an order of magnitude difference. (Verify current pricing in official vendor documentation — specific rates are not quoted here because they change frequently.) Over a year, the compounding difference is significant for a task that doesn't require frontier reasoning capability.

The right selection process: first, define the minimum acceptable quality threshold for the task (e.g., 95% classification accuracy, measured on a held-out test set). Then evaluate the cheapest model that meets that threshold. Classification, entity extraction, routing, and simple Q&A tasks rarely require frontier model capability. Complex multi-step clinical reasoning, nuanced documentation generation, and tasks requiring broad medical knowledge are where frontier models justify their cost.

A two-tier architecture is the practical pattern: a fast, cheap model for high-volume routing and triage, with expensive frontier model invocation reserved for tasks that require it, triggered by the routing decision.

Red Flags:

  • "We always use the best model available for patient safety" — conflates model size with reliability; small models can be as reliable as large ones for in-distribution tasks

Further Reading

In This Repository:

External:

  • Attention Is All You Need (Vaswani et al., 2017) — The transformer architecture paper
  • Anthropic docs: docs.anthropic.com — Claude API reference, model capabilities

Summary

Key Takeaways:

  • LLMs generate text token-by-token, one token at a time — inference latency scales with output length
  • Self-attention is what enables LLMs to capture long-range dependencies; it is why they can connect a medication mentioned on page 1 to a contraindication mentioned on page 5
  • The context window is a budget, not just a limit — manage it explicitly in every system design
  • Hallucination is architectural and cannot be eliminated; it must be mitigated through RAG, citations, validation, and human review
  • Model selection by task is the single most impactful cost management lever in enterprise AI
  • Cloud-managed LLM inference (AWS Bedrock, Azure OpenAI) is the correct pattern for HIPAA-covered clinical AI — it provides BAA coverage without GPU infrastructure burden

The one thing to remember: LLMs are next-token predictors, not knowledge databases. Every enterprise AI architecture decision should account for this fundamental property — hallucination, latency scaling, cost-per-token, and context limits all follow directly from it.


Next: Embeddings and Vector Spaces | Previous: Preface