Context Window Management

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate


Executive Summary

The context window is a finite, expensive resource that must be actively managed in every production LLM system. Failing to manage it leads to two failure modes: exceeding the limit and causing errors, or filling it with low-value content that crowds out high-value content. This chapter covers context window budgeting, the "lost in the middle" problem, conversation management strategies, long-document processing, and the techniques used in production clinical systems where both context window cost and the quality of information retrieval directly affect clinical outcomes.


Prerequisites

  • LLM Fundamentals — Context window mechanics, token costs
  • RAG — Retrieval as a context management strategy

Core Concepts

The Context Window Budget

Every LLM call has a total context budget: the maximum number of tokens that can be processed in a single inference call. This budget is shared between:

text
Total Context Budget
├── System prompt (stable, cached)
├── Conversation history (grows with multi-turn interaction)
├── Retrieved context / RAG chunks
├── User message (current turn)
└── Output reservation (space for the response)

Budget Constraint:
  system_prompt + history + retrieved + user_message + output_reservation
  ≤ max_context_window

For a 200K token Claude context window and a clinical RAG application:

Component Tokens Notes
System prompt 2,000 Cached after first call
Conversation history (last 5 turns) 5,000 Grows with each turn
Patient EHR context 3,000 Current encounter data
Retrieved clinical guidelines 8,192 8 chunks × 1,024 tokens
User question 100 Average clinical query
Output reservation 2,048 Sufficient for clinical response
Total used ~20,340
Context remaining ~179,660 Available headroom

In this configuration, context management is straightforward — there is ample headroom. Context management becomes critical in two scenarios: high-volume conversations (context history accumulates) and large document processing (content exceeds what the context can hold).

The "Lost in the Middle" Problem

Research has demonstrated that LLMs reliably attend to information at the beginning and end of the context window, but attention to information in the middle of very long contexts degrades significantly. In a 100K token context, a key fact placed at position 50K may be retrieved less reliably than the same fact placed at position 1K or 99K.

text
Context position quality map:
┌─────────────────────────────────────────────────────┐
│ HIGH ATTENTION    ██████████░░░░░░░░░░░████████████ │
│ (early tokens)   (degraded middle)   (late tokens)  │
└─────────────────────────────────────────────────────┘

Implications for RAG design: Place the most relevant retrieved chunks at the beginning of the retrieved context block, not in the middle. If using 8 retrieved chunks, put the highest-similarity chunks first and last, with less relevant chunks in the middle.

Implications for conversation design: In long conversations, the most recent turns (placed last in the context) are most reliably attended to. Summarize early conversation turns rather than dropping them entirely.


Architecture

Context Management Strategies

Strategy 1: Conversation History Management

In a multi-turn clinical consultation, conversation history grows with each turn. Without management, a long conversation will eventually exceed the context limit or degrade quality as early context dilutes later, more relevant turns.

Progressive summarization: After N turns (typically 10–15), summarize the early turns into a compact summary and replace them with the summary. The summary is prepended to the retained recent turns.

python
"""
Progressive conversation history summarization for HMS clinical consultation.
Context: A care coordinator is having an extended conversation with the
clinical AI about a complex patient's care plan. After 10 turns, early
turns are summarized to preserve context budget.
"""
import anthropic
from dataclasses import dataclass, field

@dataclass
class ConversationMessage:
    role: str  # "user" or "assistant"
    content: str
    turn_index: int

@dataclass
class ManagedConversation:
    messages: list[ConversationMessage] = field(default_factory=list)
    summary: str = ""
    max_recent_turns: int = 10
    summary_trigger_turns: int = 15  # Summarize when history exceeds this


def should_summarize(conversation: ManagedConversation) -> bool:
    return len(conversation.messages) >= conversation.summary_trigger_turns


def summarize_early_turns(
    conversation: ManagedConversation,
    client: anthropic.Anthropic,
) -> str:
    """
    Summarize the earliest turns of the conversation to free context budget.
    Retains the most recent max_recent_turns turns verbatim.
    """
    turns_to_summarize = conversation.messages[:-conversation.max_recent_turns]
    
    if not turns_to_summarize:
        return conversation.summary

    conversation_text = "\n".join(
        f"{msg.role.upper()}: {msg.content}"
        for msg in turns_to_summarize
    )

    # Include previous summary if it exists
    prior_context = ""
    if conversation.summary:
        prior_context = f"Prior conversation summary:\n{conversation.summary}\n\n"

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Small model for summarization
        max_tokens=512,
        system=(
            "You are summarizing a clinical consultation conversation. "
            "Capture: clinical questions asked, key clinical facts discussed, "
            "decisions made, and any pending items. Be concise. Clinical accuracy required."
        ),
        messages=[{
            "role": "user",
            "content": f"{prior_context}Summarize this conversation:\n\n{conversation_text}",
        }],
    )

    return response.content[0].text


def build_context_for_llm(
    conversation: ManagedConversation,
    current_user_message: str,
) -> list[dict]:
    """Build the messages array for the LLM API call."""
    messages = []

    # Include summary as a system-level context if it exists
    if conversation.summary:
        messages.append({
            "role": "user",
            "content": f"[Conversation context from earlier turns]:\n{conversation.summary}",
        })
        messages.append({
            "role": "assistant",
            "content": "Understood. I have the context from our earlier discussion.",
        })

    # Include recent turns verbatim
    recent = conversation.messages[-conversation.max_recent_turns:]
    for msg in recent:
        messages.append({"role": msg.role, "content": msg.content})

    # Add current message
    messages.append({"role": "user", "content": current_user_message})

    return messages

Strategy 2: Long Document Processing — Map-Reduce

When a document exceeds the context window, process it in chunks and combine the results:

python
"""
Map-reduce pattern for processing a long clinical document.
Context: HMS scenario — analyzing a 200-page Joint Commission survey
report to extract all action items and deficiency findings.
"""
import anthropic
from math import ceil

client = anthropic.Anthropic()


def chunk_document(text: str, max_tokens_per_chunk: int = 50_000) -> list[str]:
    """Split document into chunks that fit within context budget."""
    # Approximate: 1 token ≈ 4 characters for clinical text
    chars_per_chunk = max_tokens_per_chunk * 4
    chunks = []
    for i in range(0, len(text), chars_per_chunk):
        chunks.append(text[i : i + chars_per_chunk])
    return chunks


def map_extract_findings(chunk: str, chunk_index: int, total_chunks: int) -> list[dict]:
    """
    MAP phase: Extract action items from a single document chunk.
    Each chunk is processed independently — enables parallel processing.
    """
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=(
            "Extract all action items, deficiency findings, and required corrections "
            "from this document chunk. Return as JSON array: "
            '[{"finding": "...", "category": "...", "priority": "immediate|30-day|90-day", '
            '"responsible_department": "..."}]. '
            "If no findings, return empty array []."
        ),
        messages=[{
            "role": "user",
            "content": (
                f"Document chunk {chunk_index + 1} of {total_chunks}:\n\n{chunk}"
            ),
        }],
    )
    import json
    return json.loads(response.content[0].text)


def reduce_findings(all_findings: list[list[dict]]) -> dict:
    """
    REDUCE phase: Synthesize findings from all chunks into a consolidated report.
    """
    flat_findings = [f for chunk_findings in all_findings for f in chunk_findings]

    if not flat_findings:
        return {"total_findings": 0, "immediate_action": [], "findings": []}

    import json
    findings_json = json.dumps(flat_findings, indent=2)

    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=4096,
        system=(
            "You are synthesizing quality improvement findings from a survey report. "
            "Deduplicate findings, organize by priority and department, and produce "
            "a structured action plan in JSON format."
        ),
        messages=[{
            "role": "user",
            "content": f"Synthesize these findings into a consolidated action plan:\n\n{findings_json}",
        }],
    )

    return json.loads(response.content[0].text)


def process_long_document(document_text: str) -> dict:
    """Process a document that exceeds the context window using map-reduce."""
    chunks = chunk_document(document_text)
    
    # MAP phase (can be parallelized in production)
    chunk_findings = [
        map_extract_findings(chunk, i, len(chunks))
        for i, chunk in enumerate(chunks)
    ]

    # REDUCE phase
    return reduce_findings(chunk_findings)

Strategy 3: Hierarchical Summarization

For very long documents where facts need to be preserved through the hierarchy:

python
"""
Hierarchical summarization for HMS patient record processing.
Context: Generating a comprehensive patient summary from 5 years
of clinical notes (>500 pages) for a new specialist consultation.
"""

def hierarchical_summarize(
    sections: list[str],
    client: anthropic.Anthropic,
    intermediate_max_tokens: int = 1024,
    final_max_tokens: int = 2048,
    target_specialty: str = "cardiology",
) -> str:
    """
    Two-level hierarchical summarization:
    Level 1: Summarize each section independently
    Level 2: Synthesize section summaries into final clinical summary

    target_specialty focuses the summary on clinically relevant details
    for the receiving specialist.
    """
    section_summaries = []
    for i, section in enumerate(sections):
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",  # Fast, cheap for intermediate step
            max_tokens=intermediate_max_tokens,
            system=(
                f"Summarize this clinical note section for a {target_specialty} specialist. "
                "Focus on: diagnoses, relevant findings, medications, procedures, "
                "and clinical course. Preserve dates and clinical values precisely."
            ),
            messages=[{"role": "user", "content": section}],
        )
        section_summaries.append(response.content[0].text)

    # Final synthesis
    combined = "\n\n---\n\n".join(
        f"Period {i+1} summary:\n{s}" for i, s in enumerate(section_summaries)
    )
    final_response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=final_max_tokens,
        system=(
            f"Create a comprehensive clinical summary for a {target_specialty} consultation. "
            "Synthesize the periodic summaries into a coherent clinical narrative. "
            "Highlight trends, significant events, and items most relevant to the consultation."
        ),
        messages=[{"role": "user", "content": combined}],
    )
    return final_response.content[0].text

Healthcare Context — HMS Scenario

Clinical Context Window Budget — Reference Healthcare Organization

Different clinical AI applications at HMS have different context budgets (illustrative — token counts vary by model and implementation):

Application System Prompt Patient Context Retrieved Knowledge History Output Total
Clinical Q&A 2K 3K 8K 5K 2K ~20K
Discharge Summary 1.5K 15K (full encounter) 2K 0 4K ~22.5K
Prior Auth Agent 2K 5K 12K (payer criteria) 3K 3K ~25K
Full Chart Review 1.5K 100K (5yr history) 2K 0 5K ~108.5K

Full chart review (108.5K tokens) requires Claude's 200K context window — it cannot be done with GPT-4's 128K window. This is a real architectural constraint that drives model selection for specific clinical workflows.

Clinical Conversation Management

In a multi-turn clinical consultation over a complex patient, conversation management must preserve:

  • The patient's allergy list (safety-critical, must never drop from context)
  • The active problem list (clinical reference throughout the conversation)
  • Medications that were discussed (prevents contradictory recommendations across turns)
  • Decisions that were made (maintains clinical coherence)

These items must be explicitly preserved in the conversation summary — they cannot be treated as disposable conversation history.


Trade-offs and Considerations

Longer Context vs. Better Retrieval

With 200K+ context windows available, the tempting design is to put everything in the context rather than invest in a retrieval system. This has hidden costs:

Factor Small Context + Good Retrieval Large Context (everything in)
Cost per query Low (only relevant content) High (full context every time)
Latency Lower (shorter processing) Higher (quadratic attention cost)
Quality High (focused relevant content) Degraded by "lost in the middle"
Freshness Real-time (index updates) Stale (need to rebuild context)
Audit trail Clear (retrieved sources logged) Unclear (what did the model use?)

Recommendation: Even with large context windows available, use RAG for knowledge retrieval. The context window is for conversation coherence and patient-specific context, not for replacing a knowledge retrieval system.

Cost Impact of Context Length

Context length drives cost directly. Token pricing applies to every input token, so a 200K context call costs roughly 100x more in input tokens than a 2K context call. The ratio is a stable architectural principle even as specific per-token rates change:

text
Illustrative ratio (verify current pricing in official vendor documentation):
- 2K context call:   2,000 tokens × rate = X
- 20K context call: 20,000 tokens × rate = 10X
- 200K context call: 200,000 tokens × rate = 100X

For a Reference Healthcare Organization with thousands of daily clinical AI queries: the difference between 20K and 200K average context is typically an order of magnitude difference in daily API costs. Context length is therefore a primary cost lever — optimize it before scaling.


Interview Questions

Q1: How does the "lost in the middle" problem affect RAG system design?

Category: Technical Depth Difficulty: Senior Role: AI Architect

Answer Framework:

The "lost in the middle" problem (Liu et al., 2023) describes the empirically observed phenomenon where LLMs have degraded recall for information placed in the middle of long contexts compared to information at the beginning or end. In a 32K token context, information placed at position ~16K may be recalled 15–20% less reliably than information placed at position 1K or 31K.

For RAG system design, this has several implications. First, retrieved chunks should be ordered with the most relevant chunks at the beginning and end of the retrieved context block, not sandwiched in the middle with less relevant chunks at the extremes. Second, critical patient safety information (allergy list, high-alert medications) should be placed at the top of the context, where attention is strongest. Third, when using large numbers of retrieved chunks (top-k > 10), the middle chunks provide diminishing return — a re-ranking strategy that places the two best chunks first and last, with others in the middle, outperforms naive ordering.

Fourth, this phenomenon suggests that increasing context size beyond what's needed for the specific query may actually harm quality, not help it. A 20K context with well-positioned retrieved chunks may outperform a 100K context where the relevant information is diluted among irrelevant content in the middle. This is a key argument against the "just put everything in the context" design approach.


Further Reading

In This Repository:

  • LLM Fundamentals — Context window size and token cost basics
  • RAG — Retrieval as the primary context management strategy
  • Prompt Engineering — System prompt design within context budget

Summary

Key Takeaways:

  • The context window is a budget to be actively managed, not a limit to occasionally hit
  • "Lost in the middle": information in the center of long contexts is retrieved less reliably — place critical content first or last
  • Conversation history must be managed through progressive summarization in multi-turn applications
  • Long documents require map-reduce (parallel chunk processing) or hierarchical summarization (sequential level-by-level compression)
  • Even with large context windows (200K+), RAG is superior to "put everything in context" for knowledge retrieval — it is cheaper, fresher, faster, and more auditable
  • Context length drives cost linearly — a 10x longer context costs 10x more in input tokens

The one thing to remember: The context window is not a storage location — it is the model's working memory for a single inference call. Manage it with the same discipline you would apply to RAM: put in only what's needed for the current task, in the positions where it will be used most effectively.


Next: Multimodal AI | Previous: Evaluation and Benchmarking