Context Window Management
Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate
Executive Summary
The context window is a finite, expensive resource that must be actively managed in every production LLM system. Failing to manage it leads to two failure modes: exceeding the limit and causing errors, or filling it with low-value content that crowds out high-value content. This chapter covers context window budgeting, the "lost in the middle" problem, conversation management strategies, long-document processing, and the techniques used in production clinical systems where both context window cost and the quality of information retrieval directly affect clinical outcomes.
Prerequisites
- LLM Fundamentals — Context window mechanics, token costs
- RAG — Retrieval as a context management strategy
Core Concepts
The Context Window Budget
Every LLM call has a total context budget: the maximum number of tokens that can be processed in a single inference call. This budget is shared between:
Total Context Budget
├── System prompt (stable, cached)
├── Conversation history (grows with multi-turn interaction)
├── Retrieved context / RAG chunks
├── User message (current turn)
└── Output reservation (space for the response)
Budget Constraint:
system_prompt + history + retrieved + user_message + output_reservation
≤ max_context_windowFor a 200K token Claude context window and a clinical RAG application:
| Component | Tokens | Notes |
|---|---|---|
| System prompt | 2,000 | Cached after first call |
| Conversation history (last 5 turns) | 5,000 | Grows with each turn |
| Patient EHR context | 3,000 | Current encounter data |
| Retrieved clinical guidelines | 8,192 | 8 chunks × 1,024 tokens |
| User question | 100 | Average clinical query |
| Output reservation | 2,048 | Sufficient for clinical response |
| Total used | ~20,340 | |
| Context remaining | ~179,660 | Available headroom |
In this configuration, context management is straightforward — there is ample headroom. Context management becomes critical in two scenarios: high-volume conversations (context history accumulates) and large document processing (content exceeds what the context can hold).
The "Lost in the Middle" Problem
Research has demonstrated that LLMs reliably attend to information at the beginning and end of the context window, but attention to information in the middle of very long contexts degrades significantly. In a 100K token context, a key fact placed at position 50K may be retrieved less reliably than the same fact placed at position 1K or 99K.
Context position quality map:
┌─────────────────────────────────────────────────────┐
│ HIGH ATTENTION ██████████░░░░░░░░░░░████████████ │
│ (early tokens) (degraded middle) (late tokens) │
└─────────────────────────────────────────────────────┘Implications for RAG design: Place the most relevant retrieved chunks at the beginning of the retrieved context block, not in the middle. If using 8 retrieved chunks, put the highest-similarity chunks first and last, with less relevant chunks in the middle.
Implications for conversation design: In long conversations, the most recent turns (placed last in the context) are most reliably attended to. Summarize early conversation turns rather than dropping them entirely.
Architecture
Context Management Strategies
Strategy 1: Conversation History Management
In a multi-turn clinical consultation, conversation history grows with each turn. Without management, a long conversation will eventually exceed the context limit or degrade quality as early context dilutes later, more relevant turns.
Progressive summarization: After N turns (typically 10–15), summarize the early turns into a compact summary and replace them with the summary. The summary is prepended to the retained recent turns.
"""
Progressive conversation history summarization for HMS clinical consultation.
Context: A care coordinator is having an extended conversation with the
clinical AI about a complex patient's care plan. After 10 turns, early
turns are summarized to preserve context budget.
"""
import anthropic
from dataclasses import dataclass, field
@dataclass
class ConversationMessage:
role: str # "user" or "assistant"
content: str
turn_index: int
@dataclass
class ManagedConversation:
messages: list[ConversationMessage] = field(default_factory=list)
summary: str = ""
max_recent_turns: int = 10
summary_trigger_turns: int = 15 # Summarize when history exceeds this
def should_summarize(conversation: ManagedConversation) -> bool:
return len(conversation.messages) >= conversation.summary_trigger_turns
def summarize_early_turns(
conversation: ManagedConversation,
client: anthropic.Anthropic,
) -> str:
"""
Summarize the earliest turns of the conversation to free context budget.
Retains the most recent max_recent_turns turns verbatim.
"""
turns_to_summarize = conversation.messages[:-conversation.max_recent_turns]
if not turns_to_summarize:
return conversation.summary
conversation_text = "\n".join(
f"{msg.role.upper()}: {msg.content}"
for msg in turns_to_summarize
)
# Include previous summary if it exists
prior_context = ""
if conversation.summary:
prior_context = f"Prior conversation summary:\n{conversation.summary}\n\n"
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Small model for summarization
max_tokens=512,
system=(
"You are summarizing a clinical consultation conversation. "
"Capture: clinical questions asked, key clinical facts discussed, "
"decisions made, and any pending items. Be concise. Clinical accuracy required."
),
messages=[{
"role": "user",
"content": f"{prior_context}Summarize this conversation:\n\n{conversation_text}",
}],
)
return response.content[0].text
def build_context_for_llm(
conversation: ManagedConversation,
current_user_message: str,
) -> list[dict]:
"""Build the messages array for the LLM API call."""
messages = []
# Include summary as a system-level context if it exists
if conversation.summary:
messages.append({
"role": "user",
"content": f"[Conversation context from earlier turns]:\n{conversation.summary}",
})
messages.append({
"role": "assistant",
"content": "Understood. I have the context from our earlier discussion.",
})
# Include recent turns verbatim
recent = conversation.messages[-conversation.max_recent_turns:]
for msg in recent:
messages.append({"role": msg.role, "content": msg.content})
# Add current message
messages.append({"role": "user", "content": current_user_message})
return messagesStrategy 2: Long Document Processing — Map-Reduce
When a document exceeds the context window, process it in chunks and combine the results:
"""
Map-reduce pattern for processing a long clinical document.
Context: HMS scenario — analyzing a 200-page Joint Commission survey
report to extract all action items and deficiency findings.
"""
import anthropic
from math import ceil
client = anthropic.Anthropic()
def chunk_document(text: str, max_tokens_per_chunk: int = 50_000) -> list[str]:
"""Split document into chunks that fit within context budget."""
# Approximate: 1 token ≈ 4 characters for clinical text
chars_per_chunk = max_tokens_per_chunk * 4
chunks = []
for i in range(0, len(text), chars_per_chunk):
chunks.append(text[i : i + chars_per_chunk])
return chunks
def map_extract_findings(chunk: str, chunk_index: int, total_chunks: int) -> list[dict]:
"""
MAP phase: Extract action items from a single document chunk.
Each chunk is processed independently — enables parallel processing.
"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=(
"Extract all action items, deficiency findings, and required corrections "
"from this document chunk. Return as JSON array: "
'[{"finding": "...", "category": "...", "priority": "immediate|30-day|90-day", '
'"responsible_department": "..."}]. '
"If no findings, return empty array []."
),
messages=[{
"role": "user",
"content": (
f"Document chunk {chunk_index + 1} of {total_chunks}:\n\n{chunk}"
),
}],
)
import json
return json.loads(response.content[0].text)
def reduce_findings(all_findings: list[list[dict]]) -> dict:
"""
REDUCE phase: Synthesize findings from all chunks into a consolidated report.
"""
flat_findings = [f for chunk_findings in all_findings for f in chunk_findings]
if not flat_findings:
return {"total_findings": 0, "immediate_action": [], "findings": []}
import json
findings_json = json.dumps(flat_findings, indent=2)
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
system=(
"You are synthesizing quality improvement findings from a survey report. "
"Deduplicate findings, organize by priority and department, and produce "
"a structured action plan in JSON format."
),
messages=[{
"role": "user",
"content": f"Synthesize these findings into a consolidated action plan:\n\n{findings_json}",
}],
)
return json.loads(response.content[0].text)
def process_long_document(document_text: str) -> dict:
"""Process a document that exceeds the context window using map-reduce."""
chunks = chunk_document(document_text)
# MAP phase (can be parallelized in production)
chunk_findings = [
map_extract_findings(chunk, i, len(chunks))
for i, chunk in enumerate(chunks)
]
# REDUCE phase
return reduce_findings(chunk_findings)Strategy 3: Hierarchical Summarization
For very long documents where facts need to be preserved through the hierarchy:
"""
Hierarchical summarization for HMS patient record processing.
Context: Generating a comprehensive patient summary from 5 years
of clinical notes (>500 pages) for a new specialist consultation.
"""
def hierarchical_summarize(
sections: list[str],
client: anthropic.Anthropic,
intermediate_max_tokens: int = 1024,
final_max_tokens: int = 2048,
target_specialty: str = "cardiology",
) -> str:
"""
Two-level hierarchical summarization:
Level 1: Summarize each section independently
Level 2: Synthesize section summaries into final clinical summary
target_specialty focuses the summary on clinically relevant details
for the receiving specialist.
"""
section_summaries = []
for i, section in enumerate(sections):
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Fast, cheap for intermediate step
max_tokens=intermediate_max_tokens,
system=(
f"Summarize this clinical note section for a {target_specialty} specialist. "
"Focus on: diagnoses, relevant findings, medications, procedures, "
"and clinical course. Preserve dates and clinical values precisely."
),
messages=[{"role": "user", "content": section}],
)
section_summaries.append(response.content[0].text)
# Final synthesis
combined = "\n\n---\n\n".join(
f"Period {i+1} summary:\n{s}" for i, s in enumerate(section_summaries)
)
final_response = client.messages.create(
model="claude-opus-4-8",
max_tokens=final_max_tokens,
system=(
f"Create a comprehensive clinical summary for a {target_specialty} consultation. "
"Synthesize the periodic summaries into a coherent clinical narrative. "
"Highlight trends, significant events, and items most relevant to the consultation."
),
messages=[{"role": "user", "content": combined}],
)
return final_response.content[0].textHealthcare Context — HMS Scenario
Clinical Context Window Budget — Reference Healthcare Organization
Different clinical AI applications at HMS have different context budgets (illustrative — token counts vary by model and implementation):
| Application | System Prompt | Patient Context | Retrieved Knowledge | History | Output | Total |
|---|---|---|---|---|---|---|
| Clinical Q&A | 2K | 3K | 8K | 5K | 2K | ~20K |
| Discharge Summary | 1.5K | 15K (full encounter) | 2K | 0 | 4K | ~22.5K |
| Prior Auth Agent | 2K | 5K | 12K (payer criteria) | 3K | 3K | ~25K |
| Full Chart Review | 1.5K | 100K (5yr history) | 2K | 0 | 5K | ~108.5K |
Full chart review (108.5K tokens) requires Claude's 200K context window — it cannot be done with GPT-4's 128K window. This is a real architectural constraint that drives model selection for specific clinical workflows.
Clinical Conversation Management
In a multi-turn clinical consultation over a complex patient, conversation management must preserve:
- The patient's allergy list (safety-critical, must never drop from context)
- The active problem list (clinical reference throughout the conversation)
- Medications that were discussed (prevents contradictory recommendations across turns)
- Decisions that were made (maintains clinical coherence)
These items must be explicitly preserved in the conversation summary — they cannot be treated as disposable conversation history.
Trade-offs and Considerations
Longer Context vs. Better Retrieval
With 200K+ context windows available, the tempting design is to put everything in the context rather than invest in a retrieval system. This has hidden costs:
| Factor | Small Context + Good Retrieval | Large Context (everything in) |
|---|---|---|
| Cost per query | Low (only relevant content) | High (full context every time) |
| Latency | Lower (shorter processing) | Higher (quadratic attention cost) |
| Quality | High (focused relevant content) | Degraded by "lost in the middle" |
| Freshness | Real-time (index updates) | Stale (need to rebuild context) |
| Audit trail | Clear (retrieved sources logged) | Unclear (what did the model use?) |
Recommendation: Even with large context windows available, use RAG for knowledge retrieval. The context window is for conversation coherence and patient-specific context, not for replacing a knowledge retrieval system.
Cost Impact of Context Length
Context length drives cost directly. Token pricing applies to every input token, so a 200K context call costs roughly 100x more in input tokens than a 2K context call. The ratio is a stable architectural principle even as specific per-token rates change:
Illustrative ratio (verify current pricing in official vendor documentation):
- 2K context call: 2,000 tokens × rate = X
- 20K context call: 20,000 tokens × rate = 10X
- 200K context call: 200,000 tokens × rate = 100XFor a Reference Healthcare Organization with thousands of daily clinical AI queries: the difference between 20K and 200K average context is typically an order of magnitude difference in daily API costs. Context length is therefore a primary cost lever — optimize it before scaling.
Interview Questions
Q1: How does the "lost in the middle" problem affect RAG system design?
Category: Technical Depth Difficulty: Senior Role: AI Architect
Answer Framework:
The "lost in the middle" problem (Liu et al., 2023) describes the empirically observed phenomenon where LLMs have degraded recall for information placed in the middle of long contexts compared to information at the beginning or end. In a 32K token context, information placed at position ~16K may be recalled 15–20% less reliably than information placed at position 1K or 31K.
For RAG system design, this has several implications. First, retrieved chunks should be ordered with the most relevant chunks at the beginning and end of the retrieved context block, not sandwiched in the middle with less relevant chunks at the extremes. Second, critical patient safety information (allergy list, high-alert medications) should be placed at the top of the context, where attention is strongest. Third, when using large numbers of retrieved chunks (top-k > 10), the middle chunks provide diminishing return — a re-ranking strategy that places the two best chunks first and last, with others in the middle, outperforms naive ordering.
Fourth, this phenomenon suggests that increasing context size beyond what's needed for the specific query may actually harm quality, not help it. A 20K context with well-positioned retrieved chunks may outperform a 100K context where the relevant information is diluted among irrelevant content in the middle. This is a key argument against the "just put everything in the context" design approach.
Further Reading
In This Repository:
- LLM Fundamentals — Context window size and token cost basics
- RAG — Retrieval as the primary context management strategy
- Prompt Engineering — System prompt design within context budget
Summary
Key Takeaways:
- The context window is a budget to be actively managed, not a limit to occasionally hit
- "Lost in the middle": information in the center of long contexts is retrieved less reliably — place critical content first or last
- Conversation history must be managed through progressive summarization in multi-turn applications
- Long documents require map-reduce (parallel chunk processing) or hierarchical summarization (sequential level-by-level compression)
- Even with large context windows (200K+), RAG is superior to "put everything in context" for knowledge retrieval — it is cheaper, fresher, faster, and more auditable
- Context length drives cost linearly — a 10x longer context costs 10x more in input tokens
The one thing to remember: The context window is not a storage location — it is the model's working memory for a single inference call. Manage it with the same discipline you would apply to RAM: put in only what's needed for the current task, in the positions where it will be used most effectively.
Next: Multimodal AI | Previous: Evaluation and Benchmarking