Context Window Management

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate

Executive Summary

The context window is a finite, expensive resource that must be actively managed in every production LLM system. Failing to manage it leads to two failure modes: exceeding the limit and causing errors, or filling it with low-value content that crowds out high-value content. This chapter covers context window budgeting, the "lost in the middle" problem, conversation management strategies, long-document processing, and the techniques used in production clinical systems where both context window cost and the quality of information retrieval directly affect clinical outcomes.

Trade-offs and Considerations

Longer Context vs. Better Retrieval

With 200K+ context windows available, the tempting design is to put everything in the context rather than invest in a retrieval system. This has hidden costs:

Factor	Small Context + Good Retrieval	Large Context (everything in)
Cost per query	Low (only relevant content)	High (full context every time)
Latency	Lower (shorter processing)	Higher (quadratic attention cost)
Quality	High (focused relevant content)	Degraded by "lost in the middle"
Freshness	Real-time (index updates)	Stale (need to rebuild context)
Audit trail	Clear (retrieved sources logged)	Unclear (what did the model use?)

Recommendation: Even with large context windows available, use RAG for knowledge retrieval. The context window is for conversation coherence and patient-specific context, not for replacing a knowledge retrieval system.

Cost Impact of Context Length

Context length drives cost directly. Token pricing applies to every input token, so a 200K context call costs roughly 100x more in input tokens than a 2K context call. The ratio is a stable architectural principle even as specific per-token rates change:

text

Illustrative ratio (verify current pricing in official vendor documentation):
- 2K context call:   2,000 tokens × rate = X
- 20K context call: 20,000 tokens × rate = 10X
- 200K context call: 200,000 tokens × rate = 100X

For a Reference Healthcare Organization with thousands of daily clinical AI queries: the difference between 20K and 200K average context is typically an order of magnitude difference in daily API costs. Context length is therefore a primary cost lever — optimize it before scaling.

Interview Questions

Q1: How does the "lost in the middle" problem affect RAG system design?

Category: Technical Depth Difficulty: Senior Role: AI Architect

Answer Framework:

The "lost in the middle" problem (Liu et al., 2023) describes the empirically observed phenomenon where LLMs have degraded recall for information placed in the middle of long contexts compared to information at the beginning or end. In a 32K token context, information placed at position ~16K may be recalled 15–20% less reliably than information placed at position 1K or 31K.

For RAG system design, this has several implications. First, retrieved chunks should be ordered with the most relevant chunks at the beginning and end of the retrieved context block, not sandwiched in the middle with less relevant chunks at the extremes. Second, critical patient safety information (allergy list, high-alert medications) should be placed at the top of the context, where attention is strongest. Third, when using large numbers of retrieved chunks (top-k > 10), the middle chunks provide diminishing return — a re-ranking strategy that places the two best chunks first and last, with others in the middle, outperforms naive ordering.

Fourth, this phenomenon suggests that increasing context size beyond what's needed for the specific query may actually harm quality, not help it. A 20K context with well-positioned retrieved chunks may outperform a 100K context where the relevant information is diluted among irrelevant content in the middle. This is a key argument against the "just put everything in the context" design approach.

Context Window Management#

Executive Summary#

Trade-offs and Considerations#

Longer Context vs. Better Retrieval#

Cost Impact of Context Length#

Interview Questions#

Q1: How does the "lost in the middle" problem affect RAG system design?#

Further Reading#