LLM Fundamentals
Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Foundational
Executive Summary
Large Language Models (LLMs) are the engine powering every enterprise AI initiative today. Understanding how they generate text — not just that they do — is what separates architects who can make sound technology decisions from those who treat LLMs as black boxes. This chapter covers the transformer architecture, the training process, inference mechanics, tokenization, and context windows with the depth required to reason about production system design.
Prerequisites
- Familiarity with REST APIs and Python
- General software architecture concepts
- No prior ML knowledge required
Core Concepts
Why Understanding LLM Internals Matters
An architect who does not understand how an LLM generates text cannot reason about:
- Why hallucinations occur and what system-level mitigations are possible
- Why context window size affects cost and latency at specific token thresholds
- Why temperature affects output variability and how to set it for clinical applications
- Why some tasks require larger models and others can use smaller, cheaper models
- Why inference latency scales with output length, not just input length
None of these are theoretical concerns. They manifest as production incidents, cost overruns, and architectural dead ends at enterprise scale.
The Transformer Architecture
Every major LLM — Claude, GPT-4, Gemini, Llama — is built on the transformer architecture introduced in the 2017 paper Attention Is All You Need. The transformer has two critical properties that make it useful for language:
Self-attention: Every token in a sequence can attend to every other token, allowing the model to capture long-range dependencies. When a clinical note says "the patient was prescribed [drug] but refused due to [side effect she experienced three paragraphs earlier]," self-attention is what allows the model to connect these distant tokens.
Parallelism during training: Unlike recurrent networks (RNNs), transformers process all tokens in parallel during training. This enables training on the massive datasets required for LLMs to develop broad knowledge.
The transformer consists of stacked encoder-decoder layers (for translation-style tasks) or decoder-only layers (for language generation, which is what all frontier LLMs use). Each layer applies:
- Multi-head self-attention — Allows the model to attend to different positions simultaneously, each head potentially capturing different types of relationships (syntactic, semantic, positional)
- Feed-forward network — Per-token transformation that adds capacity for factual recall and reasoning
- Layer normalization and residual connections — Training stability mechanisms
Input Tokens → Embedding Layer → [Attention → FFN] × N layers → Output LogitsThe output at each position is a probability distribution over the vocabulary — the model's belief about which token should come next.
How Text Generation Works: Autoregressive Decoding
LLMs generate text one token at a time, in a left-to-right process called autoregressive decoding:
- The input tokens (prompt) are processed by all N transformer layers
- The output logits for the next token position are computed
- A token is sampled from the output distribution (or the highest-probability token is selected)
- The selected token is appended to the sequence
- The model processes the extended sequence and generates the next token
- Repeat until a stop token is generated or the maximum length is reached
This has a critical implication: inference latency scales linearly with output length. A response that is 500 tokens takes approximately 5x longer to generate than one that is 100 tokens, all else being equal. For enterprise applications with latency SLAs, this means output length must be managed explicitly.
Tokenization
LLMs do not operate on characters or words — they operate on tokens, which are subword units produced by algorithms like Byte Pair Encoding (BPE). Understanding tokenization has direct architectural implications:
Token counts are not word counts. Common English words are typically 1 token. Uncommon words, clinical terminology, and proper nouns may be 2–4 tokens. Code can be highly token-efficient or inefficient depending on structure.
Clinical terminology is token-expensive. Medical terms like "thrombocytopenia," "methylprednisolone," or "hemoglobinopathy" are rare enough that they may be split into 3–5 tokens each. A clinical discharge summary that would be ~800 words in plain English may consume 1,400–2,000 tokens. This affects cost and context window usage.
Non-English text uses more tokens. Languages not well-represented in training data consume more tokens per word. For healthcare systems serving non-English-speaking populations, this increases API costs.
Rough rule of thumb: 1 token ≈ 0.75 words in English clinical text. Always measure actual token counts for your specific data domain.
The Context Window
The context window is the total number of tokens an LLM can process in a single inference call — input tokens plus output tokens combined.
Context windows have grown dramatically:
- GPT-3 (2020): 4K tokens
- GPT-4 (2023): 8K–32K tokens
- Claude 3 (2024): 200K tokens
- Claude 4 / Gemini 2.5 (2025–2026): 1M+ tokens
Architectural Implication: A larger context window does not mean larger is always better. LLMs exhibit the "lost in the middle" phenomenon — information placed in the middle of a very long context is retrieved less reliably than information at the beginning or end. For RAG systems, this means the position of retrieved chunks within the context matters, not just their presence.
For the HMS scenario: a 500-bed hospital's clinical note database spans terabytes. The context window, however large, can hold only a fraction of this data at inference time. RAG exists because the context window cannot hold all relevant knowledge simultaneously.
Training: Pre-training and Alignment
Pre-training produces the base model:
- Trained on hundreds of billions to trillions of tokens from web text, books, code, and scientific papers
- Objective: predict the next token (self-supervised — no human labels required)
- Produces a model with broad knowledge but no instruction-following behavior
Supervised Fine-Tuning (SFT) teaches the model to follow instructions:
- Trained on human-written examples of question-answer pairs
- Transforms the raw pre-trained model into an assistant
Reinforcement Learning from Human Feedback (RLHF) aligns the model with human preferences:
- Human raters compare model outputs and provide preference signals
- The model learns to produce outputs that humans prefer (more helpful, less harmful)
Constitutional AI (CAI) — Anthropic's approach used in Claude:
- Rather than only human raters, uses an AI system trained on explicit principles to critique outputs
- Produces more consistent alignment with clearly defined values
Enterprise Implication: Pre-trained base models are not safe for enterprise deployment — they will comply with any instruction, including harmful ones. Always deploy aligned models (SFT + RLHF/CAI). For internal model hosting, never deploy raw base models to production without alignment.
Architecture
LLM Inference Architecture
In enterprise deployment, LLMs are accessed through one of three patterns:
| Pattern | Cost | Latency | Data Control | HIPAA | Operational Complexity |
|---|---|---|---|---|---|
| API-based | Per-token | Low | Provider-dependent | BAA required | Very Low |
| Cloud-managed | Per-token + cloud | Low-Medium | Within cloud tenant | BAA available | Low |
| Self-hosted | GPU CapEx + OpEx | Variable | Complete | Full control | Very High |
For the HMS scenario: Cloud-managed (AWS Bedrock with Claude, or Azure OpenAI) is the recommended pattern. It provides HIPAA BAA coverage, keeps PHI within the hospital's existing cloud environment, and avoids the GPU infrastructure burden that a 500-bed hospital's IT organization cannot sustain.
Implementation Patterns
Pattern 1: Basic Inference with Streaming
Streaming is the correct default for all user-facing applications. It reduces perceived latency dramatically — users see the first tokens within milliseconds rather than waiting for the full response.
"""
Basic LLM inference with streaming — HMS discharge note assistant
Context: A hospitalist requests a draft assessment and plan section
for a patient being discharged after a 3-day pneumonia admission.
"""
import anthropic
from typing import Iterator
client = anthropic.Anthropic()
def generate_assessment_draft(
diagnosis: str,
hospital_course: str,
discharge_medications: list[str],
) -> Iterator[str]:
"""Stream an AI-drafted assessment and plan for physician review."""
with client.messages.stream(
model="claude-opus-4-8",
max_tokens=1024,
system=(
"You are a clinical documentation assistant. Generate concise, "
"structured clinical documentation drafts for physician review and editing. "
"Always indicate that output requires physician review before use. "
"Do not include specific dosing recommendations."
),
messages=[{
"role": "user",
"content": (
f"Draft an Assessment and Plan section.\n"
f"Primary diagnosis: {diagnosis}\n"
f"Hospital course: {hospital_course}\n"
f"Discharge medications: {', '.join(discharge_medications)}"
),
}],
) as stream:
for text in stream.text_stream:
yield text
if __name__ == "__main__":
for chunk in generate_assessment_draft(
diagnosis="Community-acquired pneumonia, right lower lobe",
hospital_course="3-day inpatient course with IV antibiotics, improving O2 saturation",
discharge_medications=["Amoxicillin-clavulanate 875mg BID x 5 days", "Azithromycin 500mg daily x 3 days"],
):
print(chunk, end="", flush=True)
print("\n\n[PHYSICIAN REVIEW REQUIRED BEFORE USE IN MEDICAL RECORD]")Pattern 2: Controlling Output with Temperature
"""
Temperature selection for clinical vs. creative tasks.
Temperature controls the randomness of token sampling:
- 0.0 = always select the highest probability token (deterministic)
- 1.0 = sample proportionally to the model's probability distribution
- >1.0 = increase randomness beyond the model's natural distribution
"""
import anthropic
client = anthropic.Anthropic()
def classify_clinical_intent(note_text: str) -> dict:
"""
Clinical classification — use low temperature for deterministic output.
For structured clinical tasks, we want the model's most confident answer,
not creative variation.
"""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Haiku is sufficient for classification
max_tokens=128,
temperature=0.0, # Deterministic — classification should not vary
system=(
"Classify the clinical note. Respond with JSON only: "
'{"intent": "admission|discharge|progress|consult|procedure", "confidence": "high|medium|low"}'
),
messages=[{"role": "user", "content": note_text}],
)
import json
return json.loads(response.content[0].text)Healthcare Context — HMS Scenario
Token Cost Modeling for HMS
At a Reference Healthcare Organization processing thousands of daily encounters:
Educational Example — Illustrative Workflow. Not intended for clinical decision making.
Discharge summary generation (per encounter):
- Input: ~3,000 tokens (encounter context, medications, labs, diagnoses)
- Output: ~800 tokens (draft summary)
- Total: ~3,800 tokens per encounter
Daily token volume (illustrative): 2,000 encounters × 3,800 tokens = 7.6M tokens/day Monthly volume (illustrative): ~228M tokens/month
The cost calculation depends entirely on the chosen model, provider, and current pricing. As an illustrative example using frontier model rates (verify current pricing in official vendor documentation — rates change frequently):
- At $X/1M input tokens: 2,000 × 3,000 × $X/1M × 30 days = monthly input cost
- At $Y/1M output tokens: 2,000 × 800 × $Y/1M × 30 days = monthly output cost
The architectural principle — not the specific number — is what matters: frontier model API costs scale linearly with token volume, and at thousands of daily encounters this becomes a significant line item that must be modeled and optimized.
Cost Management Principle: Not every clinical task requires the highest-capability model. Triage and classification can use Claude Haiku (~20x cheaper). Reserve Opus for complex clinical reasoning tasks. Model selection by use case is the single most impactful cost lever available.
Clinical Safety Properties
| LLM Property | Clinical Risk | Mitigation |
|---|---|---|
| Hallucination | Fabricated drug interactions or doses | RAG with authoritative clinical knowledge base + structured output validation |
| Stochasticity | Different answers to the same clinical question | Low temperature for clinical decisions; always show source citations |
| Knowledge cutoff | Drug withdrawals, guideline updates after training | RAG system with current clinical databases |
| Context window limits | Cannot hold full patient lifetime history | Structured retrieval of clinically relevant encounters only |
Trade-offs and Considerations
Model Selection: Capability vs. Cost
| Model Tier | Examples | Token Cost | Latency | Best For |
|---|---|---|---|---|
| Frontier (large) | Claude Opus 4.8, GPT-4o, Gemini 2.5 Pro | $$$$ | 2–5s | Complex reasoning, clinical documentation, multi-step analysis |
| Mid-tier | Claude Sonnet 4.6, GPT-4o-mini | $$ | 0.5–2s | RAG augmented generation, summarization, structured extraction |
| Small/Fast | Claude Haiku 4.5, GPT-4o-mini | $ | 0.1–0.5s | Classification, routing, simple Q&A, high-volume tasks |
Rule of thumb: Use the smallest model that achieves acceptable quality on your specific task. Always measure quality empirically — don't assume larger is better.
Failure Modes
| Failure Mode | Description | Enterprise Mitigation |
|---|---|---|
| Hallucination | Model generates plausible-sounding but incorrect information | RAG + citation grounding + output validation |
| Sycophancy | Model agrees with incorrect statements from the user | Calibrated prompting; don't provide the "answer" in the prompt |
| Context degradation | Quality drops in very long contexts ("lost in the middle") | Keep critical information at prompt start and end |
| Prompt injection | Malicious content overrides system instructions | Input sanitization; see docs/06-Security/02-prompt-injection-defense.md |
| Inference latency spikes | Provider-side latency variability | Timeouts + retry with exponential backoff + circuit breaker |
Comparison Table
| Dimension | Claude (Anthropic) | GPT-4o (OpenAI) | Gemini 2.5 Pro (Google) |
|---|---|---|---|
| Context window | 200K tokens | 128K tokens | 1M tokens |
| Healthcare BAA | Yes (via AWS Bedrock) | Yes (Azure OpenAI) | Yes (Vertex AI) |
| Tool use / function calling | Yes | Yes | Yes |
| Extended thinking | Yes (Claude 3.7+) | Yes (o1/o3 series) | Yes (thinking mode) |
| Multimodal | Yes (vision + docs) | Yes (vision) | Yes (vision + audio) |
| Best for HMS | Clinical reasoning, documentation | Code generation, structured output | Long-document processing |
Interview Questions
Q1: Explain why LLMs hallucinate and what architectural patterns mitigate this in a clinical system
Category: Technical Depth Difficulty: Senior Role: AI Architect
Answer Framework:
Hallucination is not a bug — it is an inherent consequence of how LLMs work. During pre-training, the model learns to predict the next token based on statistical patterns in training data. When the model encounters a question whose answer was rare or absent in training data, it generates a plausible-sounding continuation rather than a factual one, because "plausible continuation" is exactly what the training objective optimized for.
There are three categories of hallucination: (1) factual fabrication — making up drug interactions or clinical facts that don't exist; (2) confident speculation — stating uncertain information as fact; (3) source confabulation — citing papers or guidelines that don't exist. In clinical settings, all three are dangerous.
Architectural mitigations: First, Retrieval-Augmented Generation grounds the model's response in retrieved, authoritative documents — the model cannot generate a clinical fact that isn't present in the retrieved context. Second, structured output with source citations forces the model to attribute each claim to a specific retrieved document, enabling downstream validation. Third, output validation pipelines can check generated clinical facts against authoritative databases (drug databases, formularies) before presenting to clinicians. Fourth, human-in-the-loop review is non-negotiable for any AI output that influences a clinical decision.
Key Points to Hit:
- Hallucination is architectural, not a fixable bug
- RAG is the primary mitigation — ground output in retrieved facts
- Citations are essential — every clinical claim should cite its source
- Validation pipelines can catch fabricated facts before clinician exposure
- Human review is required for any output that influences clinical decisions
Follow-up Questions:
- "Can you eliminate hallucination completely?" (No — mitigate to acceptable levels)
- "How do you measure hallucination rate in your clinical system?"
Red Flags:
- "We'll just use a bigger model" — model size does not eliminate hallucination
- "We'll fine-tune it to not hallucinate" — fine-tuning does not fix hallucination
Q2: How does the context window affect the architecture of a RAG system for clinical knowledge retrieval?
Category: Architecture Difficulty: Senior Role: AI Architect
Answer Framework:
The context window creates a hard constraint that shapes every RAG design decision. Consider a 500-bed hospital's knowledge base: clinical guidelines, drug formularies, formulary restrictions, and clinical protocols may total 50GB of text. The model's context window — even at 200K tokens (roughly 150K words) — can hold perhaps 0.01% of this knowledge at inference time. RAG's job is to select the right 0.01%.
This constraint drives several architectural decisions. Chunk size must balance two competing forces: smaller chunks increase retrieval precision (you retrieve the exact paragraph that answers the question) but may lack the surrounding context needed for coherent answers. Larger chunks provide more context but reduce precision. The sweet spot for clinical documents is typically 512–1024 tokens with meaningful overlap.
Retrieved chunks compete for context window space with the system prompt, conversation history, and output reservation. For a 200K context window, roughly: 2K system prompt + 5K conversation history + 10 chunks × 1K = 22K context, leaving 178K for output and overhead. This means context window management is an active budget, not just a limit.
The "lost in the middle" phenomenon means retrieved chunks placed in the center of a long context are retrieved less reliably by the model. In clinical RAG systems, place the most relevant retrieved chunks first or last in the context block, not sandwiched between less relevant ones.
Key Points to Hit:
- Context window creates a retrieval budget — RAG's job is to select correctly
- Chunk size trade-off: precision vs. context coherence
- Context window is a budget to manage, not just a limit
- "Lost in the middle" — position of retrieved content affects quality
Q3: Why would you choose a smaller model over a frontier model for a high-volume clinical task?
Category: Architecture Difficulty: Mid-Level Role: AI Architect / ML Engineer
Answer Framework:
Cost and latency. At enterprise scale, the economics of model selection are decisive. A clinical intake triage system that classifies 10,000 patient messages daily using a frontier model incurs meaningfully higher costs than the same task running on a small, fast model — often an order of magnitude difference. (Verify current pricing in official vendor documentation — specific rates are not quoted here because they change frequently.) Over a year, the compounding difference is significant for a task that doesn't require frontier reasoning capability.
The right selection process: first, define the minimum acceptable quality threshold for the task (e.g., 95% classification accuracy, measured on a held-out test set). Then evaluate the cheapest model that meets that threshold. Classification, entity extraction, routing, and simple Q&A tasks rarely require frontier model capability. Complex multi-step clinical reasoning, nuanced documentation generation, and tasks requiring broad medical knowledge are where frontier models justify their cost.
A two-tier architecture is the practical pattern: a fast, cheap model for high-volume routing and triage, with expensive frontier model invocation reserved for tasks that require it, triggered by the routing decision.
Red Flags:
- "We always use the best model available for patient safety" — conflates model size with reliability; small models can be as reliable as large ones for in-distribution tasks
Further Reading
In This Repository:
- Embeddings and Vector Spaces — How LLMs produce representations used in RAG
- Retrieval-Augmented Generation — The primary enterprise pattern for grounding LLM responses
- Prompt Engineering — How to control LLM behavior through careful prompt design
- AI Infrastructure — Infrastructure for hosting and serving LLMs
External:
- Attention Is All You Need (Vaswani et al., 2017) — The transformer architecture paper
- Anthropic docs: docs.anthropic.com — Claude API reference, model capabilities
Summary
Key Takeaways:
- LLMs generate text token-by-token, one token at a time — inference latency scales with output length
- Self-attention is what enables LLMs to capture long-range dependencies; it is why they can connect a medication mentioned on page 1 to a contraindication mentioned on page 5
- The context window is a budget, not just a limit — manage it explicitly in every system design
- Hallucination is architectural and cannot be eliminated; it must be mitigated through RAG, citations, validation, and human review
- Model selection by task is the single most impactful cost management lever in enterprise AI
- Cloud-managed LLM inference (AWS Bedrock, Azure OpenAI) is the correct pattern for HIPAA-covered clinical AI — it provides BAA coverage without GPU infrastructure burden
The one thing to remember: LLMs are next-token predictors, not knowledge databases. Every enterprise AI architecture decision should account for this fundamental property — hallucination, latency scaling, cost-per-token, and context limits all follow directly from it.
Next: Embeddings and Vector Spaces | Previous: Preface