Memory Systems

Conceptual Explanation

Memory in agent systems is not a single system — it is a stack of complementary stores, each operating at a different time horizon:

text

Time Horizon    Memory Type       Store Type           Access Pattern
─────────────   ───────────────   ──────────────────   ──────────────────
Current turn    Working memory    Context window        Direct (always in context)
Current session Episodic memory   Summarized context    Rolling window + summary
Multi-session   Episodic (LT)    Database + retrieval  Explicit retrieval
Factual         Semantic memory   Vector store          Similarity search
Behavioral      Procedural memory System prompt / code  Hard-coded or retrieved

No agent needs all five simultaneously. The appropriate memory architecture depends on the session length, the need for cross-session persistence, and whether the agent needs to recall facts vs. events vs. procedures.

Core Architecture

Working Memory

The context window is the agent's working memory. Everything the agent can "think about" in a single reasoning step must fit within the context window. Working memory management is the discipline of deciding what to include, what to summarize, and what to discard.

Key constraint: Each LLM call processes the entire context window. Memory that is in context is available; memory not in context does not exist from the model's perspective.

Episodic Memory

Episodic memory stores what happened: previous conversation turns, prior decisions, tool call results from earlier in the session. The challenge: conversation history grows without bound, eventually overflowing the context window.

Solutions:

Sliding window: Keep only the last N turns in context; discard older turns
Progressive summarization: Compress older turns into a summary; keep the summary + recent turns
Full persistence: Store all turns in a database; retrieve relevant ones by recency or semantic similarity

Progressive summarization is the production standard: it preserves the semantic content of older turns without consuming unbounded context space.

Semantic Memory

Semantic memory stores what is true: facts, domain knowledge, institutional policies, reference data. In agent systems, this is implemented as a vector store that the agent can query via a search_knowledge tool. The architecture is identical to RAG (see Chapter 3: RAG) with the difference that the agent decides when to query, rather than a fixed retrieval step.

Procedural Memory

Procedural memory stores how to do things: workflows, decision trees, recurring patterns. In agent systems, this is typically encoded in the system prompt or as few-shot examples. For complex, evolving procedures, it can be stored in a database and retrieved dynamically, allowing procedure updates without code changes.

Architecture Diagram

graph TD subgraph "Agent Context Window (Working Memory)" SP["System Prompt\n(Procedural Memory)"] Summary["Episodic Summary\n(compressed older turns)"] RecentTurns["Recent Turns\n(last N interactions)"] RetrievedKnowledge["Retrieved Knowledge\n(semantic memory, retrieved on demand)"] CurrentInput["Current Input\n+ Tool Results"] end subgraph "External Memory Stores" ConvDB["Conversation Store\n(PostgreSQL / Redis)"] VectorDB["Vector Store\n(Pinecone / Weaviate)"] ProcStore["Procedure Store\n(Config / Database)"] end subgraph "Memory Operations" Summarizer["Summarization Service\n(compresses old turns)"] Retriever["Retrieval Service\n(semantic search)"] Writer["Memory Writer\n(persists after session)"] end ConvDB -->|"Retrieve last N turns"| RecentTurns ConvDB -->|"Retrieve full history"| Summarizer Summarizer --> Summary VectorDB -->|"Semantic search"| RetrievedKnowledge ProcStore -->|"Load procedure"| SP CurrentInput --> Writer Writer --> ConvDB Writer --> VectorDB

Common Mistakes

Putting everything in context without eviction. Appending every tool result and message turn to context without limit exhausts the context window and produces truncation errors or degraded performance.

Summarizing too aggressively. A summary that compresses 20 turns into 2 sentences loses the specific decisions and data points the agent needs to continue coherently. Preserve key entities (patient IDs, decision outcomes, tool results) in summaries even at the cost of verbosity.

No cross-session memory for long-horizon workflows. A prior auth workflow that spans 3 days of evidence gathering, with each session starting fresh, loses all accumulated context. Implement episodic persistence for workflows with multi-day time horizons.

Sharing semantic memory without access control. A vector store that returns results from any patient's records without checking the calling agent's authorization is a serious PHI leakage risk.

Best Practices

Implement working memory management with a token budget tracker; trigger summarization at ~60–70% of context capacity
Always preserve key entities (IDs, decisions, data values) when summarizing; do not compress them away
Use progressive summarization (compress old → keep recent verbatim), not sliding window alone
Implement session-level episodic persistence for any workflow that may span multiple interactions
Partition all memory stores by tenant/patient; enforce namespace isolation at the store level
Define TTL policies for all memory tiers; align with HIPAA minimum necessary standards
Instrument memory operations: log what is written, retrieved, and evicted for debuggability

Alternatives

Memory Approach	When to Use	Trade-off
Context window only	Short, single-session workflows	Simple; fails at scale or long sessions
Sliding window	When recency is the only relevant signal	Fast to implement; loses earlier context
Progressive summarization	Long sessions where all history matters	Preserves meaning; compression quality varies
Full persistence + retrieval	Multi-session workflows; complex history	Complete record; retrieval latency + complexity
External graph memory	Complex relationship tracking (knowledge graphs)	Richest structure; highest operational complexity

Trade-offs

Dimension	More Memory	Less Memory
Coherence	Higher — agent remembers prior decisions	Lower — agent may contradict itself
Cost	Higher — more tokens per call	Lower
Latency	Higher — memory retrieval time	Lower
Infrastructure	More complex — stores, retrieval, eviction	Simpler
PHI risk	Higher surface area	Lower

Interview Questions

Q1: Explain the four memory types in agent systems and when each is appropriate.

Category: Architecture Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Working memory is the context window — everything currently in context. It is always used; the question is what to put in it and what to evict.

Episodic memory stores what happened: conversation history, prior tool call results, decisions made. It is needed for sessions lasting more than a few turns and essential for multi-session workflows. Implementation options range from sliding window (simple) to progressive summarization (preferred) to full persistence with semantic retrieval (for complex long-horizon tasks).

Semantic memory stores what is true: domain knowledge, institutional policies, reference data. It is implemented as a vector store the agent queries on demand via a tool. It is the right choice when the knowledge base is too large for the context window and updated frequently.

Procedural memory stores how to do things: workflows, decision processes, behavioral patterns. It typically lives in the system prompt for fixed procedures, or in a retrievable procedure store for procedures that evolve.

Q2: How do you prevent context window overflow in a long-running agent workflow?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

The key insight is that the agent does not need the verbatim history of every past turn — it needs the semantic content of past turns. Progressive summarization preserves the meaning while compressing the token count.

Implementation: track token usage at each step. When total context approaches a trigger threshold (70–80% of max context), run a summarization step on the oldest N turns: call a small, fast model (not the frontier model — this is an internal step) to produce a compressed summary, replace the old turns with the summary, and continue. Keep the last 6–10 turns verbatim because they contain the most relevant immediate context.

Critical: the summarization prompt must be instructed to preserve key entities — patient IDs, decision outcomes, data values — because compressed sentences without specific facts are not useful to the agent.

Key Takeaways

Agent memory is a stack: working (context), episodic (history), semantic (knowledge), procedural (behavior)
The context window is working memory; everything beyond it requires explicit memory architecture
Progressive summarization is the production standard for episodic memory management
Semantic memory (vector store) allows agents to access knowledge too large for the context window on demand
All memory stores containing PHI require the same protection as the primary EHR
Define TTL policies for all memory tiers; memory stores that grow indefinitely become operational and security liabilities
Instrument memory operations for debuggability — silent memory failures produce subtle, hard-to-diagnose agent errors

Memory Systems#

Conceptual Explanation#

Core Architecture#

Working Memory#

Episodic Memory#

Semantic Memory#

Procedural Memory#

Architecture Diagram#

Common Mistakes#

Best Practices#

Alternatives#

Trade-offs#

Interview Questions#

Q1: Explain the four memory types in agent systems and when each is appropriate.#

Q2: How do you prevent context window overflow in a long-running agent workflow?#

Key Takeaways#