Memory Systems
Executive Summary
Memory is what transforms a stateless LLM call into a persistent, context-aware agent. Without memory, every agent invocation starts with no knowledge of previous interactions, prior decisions, or accumulated context — making long-horizon tasks and multi-session workflows impossible. This chapter covers the four memory types used in production agentic systems (working, episodic, semantic, and procedural), the architectural patterns for implementing each, and the trade-offs that determine which combination is appropriate for a given enterprise use case. AI architects and senior engineers building long-running or multi-session agent systems should read this chapter.
Learning Objectives
- Distinguish the four memory types and describe when each is appropriate
- Implement working memory management for agents operating near context window limits
- Design an episodic memory system using conversation summarization
- Integrate semantic memory (vector store) into an agent's retrieval loop
- Identify failure modes specific to each memory type
Business Problem
Enterprise agentic workflows are not instantaneous. A prior authorization agent may spend hours gathering clinical evidence across multiple tool calls. A care coordination agent may maintain context about a patient across days of interactions. A research agent may need to reference conclusions from previous sessions.
Without memory architecture, three categories of failures occur:
- Context overflow: The agent accumulates so much history that the context window fills and the workflow fails
- Amnesia: The agent forgets decisions made earlier in the same session and contradicts itself
- Isolation: The agent cannot benefit from knowledge accumulated in prior sessions with the same patient or topic
Memory systems solve these problems at the cost of additional infrastructure and architectural complexity.
Why This Technology Exists
The context window is the LLM's only "working memory" by default. Early agent systems (2023) hit practical limits quickly: a frontier model with 8K–32K context could hold only a few dozen back-and-forth turns before the window filled. This forced developers to implement their own memory management — which led to the modern taxonomy of agent memory types borrowed from cognitive science (Tulving, 1972; Squire, 1987).
The parallel to human cognition is useful:
- Working memory (7±2 items) → context window
- Episodic memory (autobiographical events) → conversation history + summarization
- Semantic memory (general knowledge, facts) → vector database knowledge retrieval
- Procedural memory (how to do things) → few-shot examples, system prompt patterns
With 200K+ context windows available today, pure context overflow is less common than it once was, but the other three failure modes (amnesia, isolation, cost) still drive the need for explicit memory architecture.
Conceptual Explanation
Memory in agent systems is not a single system — it is a stack of complementary stores, each operating at a different time horizon:
Time Horizon Memory Type Store Type Access Pattern
───────────── ─────────────── ────────────────── ──────────────────
Current turn Working memory Context window Direct (always in context)
Current session Episodic memory Summarized context Rolling window + summary
Multi-session Episodic (LT) Database + retrieval Explicit retrieval
Factual Semantic memory Vector store Similarity search
Behavioral Procedural memory System prompt / code Hard-coded or retrievedNo agent needs all five simultaneously. The appropriate memory architecture depends on the session length, the need for cross-session persistence, and whether the agent needs to recall facts vs. events vs. procedures.
Core Architecture
Working Memory
The context window is the agent's working memory. Everything the agent can "think about" in a single reasoning step must fit within the context window. Working memory management is the discipline of deciding what to include, what to summarize, and what to discard.
Key constraint: Each LLM call processes the entire context window. Memory that is in context is available; memory not in context does not exist from the model's perspective.
Episodic Memory
Episodic memory stores what happened: previous conversation turns, prior decisions, tool call results from earlier in the session. The challenge: conversation history grows without bound, eventually overflowing the context window.
Solutions:
- Sliding window: Keep only the last N turns in context; discard older turns
- Progressive summarization: Compress older turns into a summary; keep the summary + recent turns
- Full persistence: Store all turns in a database; retrieve relevant ones by recency or semantic similarity
Progressive summarization is the production standard: it preserves the semantic content of older turns without consuming unbounded context space.
Semantic Memory
Semantic memory stores what is true: facts, domain knowledge, institutional policies, reference data. In agent systems, this is implemented as a vector store that the agent can query via a search_knowledge tool. The architecture is identical to RAG (see Chapter 3: RAG) with the difference that the agent decides when to query, rather than a fixed retrieval step.
Procedural Memory
Procedural memory stores how to do things: workflows, decision trees, recurring patterns. In agent systems, this is typically encoded in the system prompt or as few-shot examples. For complex, evolving procedures, it can be stored in a database and retrieved dynamically, allowing procedure updates without code changes.
Architecture Diagram
Components
Working Memory Manager
Responsible for building the context payload for each LLM call. Tracks current token usage and applies eviction/summarization when approaching the context limit.
from dataclasses import dataclass, field
@dataclass
class ConversationMessage:
role: str # "user", "assistant", "tool_result"
content: str
token_estimate: int = 0
is_summarized: bool = False
@dataclass
class WorkingMemory:
"""
Manages agent working memory within context window budget.
Educational Example — Reference Implementation.
"""
system_prompt: str
max_tokens: int = 180_000 # Reserve 20K for output
summary_trigger: int = 120_000 # Summarize when context exceeds this
keep_recent_n: int = 6 # Always keep last N turns verbatim
messages: list[ConversationMessage] = field(default_factory=list)
episodic_summary: str = ""
def add_message(self, role: str, content: str) -> None:
msg = ConversationMessage(
role=role,
content=content,
token_estimate=len(content.split()) * 4 // 3,
)
self.messages.append(msg)
# Check if summarization is needed
if self._total_token_estimate() > self.summary_trigger:
self._compress_history()
def _total_token_estimate(self) -> int:
base = len(self.system_prompt.split()) * 4 // 3
summary = len(self.episodic_summary.split()) * 4 // 3
recent = sum(m.token_estimate for m in self.messages[-self.keep_recent_n:])
return base + summary + recent
def _compress_history(self) -> None:
"""Move older messages into the episodic summary."""
if len(self.messages) <= self.keep_recent_n:
return
to_compress = self.messages[:-self.keep_recent_n]
new_content = "\n".join(
f"{m.role}: {m.content[:200]}..." for m in to_compress
)
# In production: call a summarization LLM here
self.episodic_summary += f"\n[SUMMARY OF EARLIER TURNS]:\n{new_content[:500]}..."
self.messages = self.messages[-self.keep_recent_n:]
def build_context(self) -> tuple[str, list[dict]]:
"""Build system prompt and messages list for LLM call."""
system = self.system_prompt
if self.episodic_summary:
system += f"\n\n[PRIOR CONTEXT SUMMARY]:\n{self.episodic_summary}"
messages = [
{"role": m.role, "content": m.content}
for m in self.messages
]
return system, messagesSemantic Memory Integration
"""
Semantic memory via vector store — agent decides when to query.
Educational Example — Illustrative Pattern.
"""
from anthropic import Anthropic
# Semantic memory tool exposed to the agent
SEMANTIC_MEMORY_TOOL = {
"name": "search_knowledge_base",
"description": (
"Search the institutional knowledge base for relevant policies, guidelines, "
"and reference information. Use this when you need factual information that "
"is not in the current conversation context. Returns up to 5 relevant passages "
"with source citations. Does NOT search patient records — use get_patient_summary() "
"for patient-specific information."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query in natural language."
},
"domain": {
"type": "string",
"enum": ["clinical_guidelines", "payer_policies", "drug_formulary", "compliance"],
"description": "The knowledge domain to search within."
}
},
"required": ["query"]
}
}
def search_knowledge_base(
query: str,
domain: str = "clinical_guidelines",
vector_store_client=None,
) -> dict:
"""
Execute semantic search against the vector store.
In production: query Pinecone/Weaviate with domain-specific namespace filtering.
"""
# Stub: in production, call vector store with embedded query
return {
"success": True,
"query": query,
"domain": domain,
"results": [
{
"content": f"Relevant passage about '{query}' from {domain}...",
"source": "Institutional Guidelines v2.1",
"relevance_score": 0.87,
}
]
}Episodic Memory with Persistence
"""
Long-term episodic memory — persists agent context across sessions.
Educational Example — Illustrative Pattern.
"""
import json
from datetime import datetime
class EpisodicMemoryStore:
"""
Stores and retrieves agent session memories across runs.
Production implementation: PostgreSQL or Redis with session indexing.
"""
def __init__(self) -> None:
# In production: replace with database connection
self._store: dict[str, list[dict]] = {}
def save_session(self, agent_id: str, session_id: str, summary: str, key_facts: list[str]) -> None:
"""Persist a session summary and extracted key facts after workflow completion."""
if agent_id not in self._store:
self._store[agent_id] = []
self._store[agent_id].append({
"session_id": session_id,
"completed_at": datetime.utcnow().isoformat(),
"summary": summary,
"key_facts": key_facts,
})
def retrieve_relevant_sessions(self, agent_id: str, context_hint: str, top_k: int = 3) -> list[dict]:
"""
Retrieve the most relevant past sessions for the current context.
Production: embed context_hint and semantic-search stored summaries.
Stub: return most recent sessions.
"""
sessions = self._store.get(agent_id, [])
return sorted(sessions, key=lambda s: s["completed_at"], reverse=True)[:top_k]Enterprise Considerations
Context window economics. Everything in the context window costs tokens on every call. A 10K token episodic summary included in every agent step costs 10K × (number of steps) × (token rate) across the workflow. Evaluate whether the episodic summary is needed on every step, or only on steps where it is likely to be relevant.
Memory consistency. When multiple agents share a memory store (e.g., in multi-agent systems), write ordering and cache invalidation become concerns. Two agents writing to the same patient record concurrently can produce inconsistent state. Implement optimistic locking or event-sourcing patterns for shared memory stores.
PHI in memory. Episodic and semantic memory stores that contain patient information are HIPAA-covered. Apply the same encryption, access control, and audit logging to memory stores as to the primary EHR. De-identify memories before storing them in shared systems.
Memory decay strategy. Memory stores grow indefinitely without a retention policy. Implement TTL (time-to-live) on episodic memories: short-term sessions expire after 24 hours; task-specific memories expire when the task is closed; long-term patient context lives as long as the patient relationship. Align retention policies with HIPAA minimum necessary standards.
Security Considerations
Memory injection. External data written into long-term memory (conversation summaries, retrieved documents) becomes part of future prompts. If an attacker can influence what gets stored in memory, they can inject instructions that affect future agent behavior. Validate and sanitize content before persisting it to memory stores.
PHI leakage through shared memory. In multi-tenant systems, memory stores must be strictly partitioned by tenant/patient ID. A query to semantic memory must not return results from another tenant's data. Implement namespace isolation at the vector store level.
Healthcare Example
Educational Example — Illustrative Workflow. Not intended for clinical decision making.
A Reference Healthcare Organization's care coordination agent maintains three memory tiers for each patient engagement:
Working memory (per interaction):
- Current encounter context + active clinical questions
- Last 6 turns of the care coordinator conversation
- Retrieved clinical guidelines (loaded on demand)
Episodic memory (per patient, per care episode):
- Compressed summary of all prior interactions with this patient in the current episode
- Key decisions made: medications discussed, referrals initiated, concerns flagged
- Persisted to PostgreSQL; retrieved at session start
Semantic memory (institutional, shared):
- Clinical guidelines library (read by all agents)
- Drug formulary (read by all agents)
- Payer policy database (read by agents with appropriate authorization)
The episodic memory allows the care coordinator agent to say, in a follow-up session 3 days later: "In our last interaction, you mentioned the patient expressed concern about the medication side effects. Has that been addressed?" — without re-reading the full prior conversation.
Common Mistakes
Putting everything in context without eviction. Appending every tool result and message turn to context without limit exhausts the context window and produces truncation errors or degraded performance.
Summarizing too aggressively. A summary that compresses 20 turns into 2 sentences loses the specific decisions and data points the agent needs to continue coherently. Preserve key entities (patient IDs, decision outcomes, tool results) in summaries even at the cost of verbosity.
No cross-session memory for long-horizon workflows. A prior auth workflow that spans 3 days of evidence gathering, with each session starting fresh, loses all accumulated context. Implement episodic persistence for workflows with multi-day time horizons.
Sharing semantic memory without access control. A vector store that returns results from any patient's records without checking the calling agent's authorization is a serious PHI leakage risk.
Best Practices
- Implement working memory management with a token budget tracker; trigger summarization at ~60–70% of context capacity
- Always preserve key entities (IDs, decisions, data values) when summarizing; do not compress them away
- Use progressive summarization (compress old → keep recent verbatim), not sliding window alone
- Implement session-level episodic persistence for any workflow that may span multiple interactions
- Partition all memory stores by tenant/patient; enforce namespace isolation at the store level
- Define TTL policies for all memory tiers; align with HIPAA minimum necessary standards
- Instrument memory operations: log what is written, retrieved, and evicted for debuggability
Alternatives
| Memory Approach | When to Use | Trade-off |
|---|---|---|
| Context window only | Short, single-session workflows | Simple; fails at scale or long sessions |
| Sliding window | When recency is the only relevant signal | Fast to implement; loses earlier context |
| Progressive summarization | Long sessions where all history matters | Preserves meaning; compression quality varies |
| Full persistence + retrieval | Multi-session workflows; complex history | Complete record; retrieval latency + complexity |
| External graph memory | Complex relationship tracking (knowledge graphs) | Richest structure; highest operational complexity |
Trade-offs
| Dimension | More Memory | Less Memory |
|---|---|---|
| Coherence | Higher — agent remembers prior decisions | Lower — agent may contradict itself |
| Cost | Higher — more tokens per call | Lower |
| Latency | Higher — memory retrieval time | Lower |
| Infrastructure | More complex — stores, retrieval, eviction | Simpler |
| PHI risk | Higher surface area | Lower |
Interview Questions
Q1: Explain the four memory types in agent systems and when each is appropriate.
Category: Architecture Difficulty: Senior Role: AI Architect / ML Engineer
Answer Framework:
Working memory is the context window — everything currently in context. It is always used; the question is what to put in it and what to evict.
Episodic memory stores what happened: conversation history, prior tool call results, decisions made. It is needed for sessions lasting more than a few turns and essential for multi-session workflows. Implementation options range from sliding window (simple) to progressive summarization (preferred) to full persistence with semantic retrieval (for complex long-horizon tasks).
Semantic memory stores what is true: domain knowledge, institutional policies, reference data. It is implemented as a vector store the agent queries on demand via a tool. It is the right choice when the knowledge base is too large for the context window and updated frequently.
Procedural memory stores how to do things: workflows, decision processes, behavioral patterns. It typically lives in the system prompt for fixed procedures, or in a retrievable procedure store for procedures that evolve.
Q2: How do you prevent context window overflow in a long-running agent workflow?
Category: System Design Difficulty: Senior Role: AI Architect
Answer Framework:
The key insight is that the agent does not need the verbatim history of every past turn — it needs the semantic content of past turns. Progressive summarization preserves the meaning while compressing the token count.
Implementation: track token usage at each step. When total context approaches a trigger threshold (70–80% of max context), run a summarization step on the oldest N turns: call a small, fast model (not the frontier model — this is an internal step) to produce a compressed summary, replace the old turns with the summary, and continue. Keep the last 6–10 turns verbatim because they contain the most relevant immediate context.
Critical: the summarization prompt must be instructed to preserve key entities — patient IDs, decision outcomes, data values — because compressed sentences without specific facts are not useful to the agent.
Key Takeaways
- Agent memory is a stack: working (context), episodic (history), semantic (knowledge), procedural (behavior)
- The context window is working memory; everything beyond it requires explicit memory architecture
- Progressive summarization is the production standard for episodic memory management
- Semantic memory (vector store) allows agents to access knowledge too large for the context window on demand
- All memory stores containing PHI require the same protection as the primary EHR
- Define TTL policies for all memory tiers; memory stores that grow indefinitely become operational and security liabilities
- Instrument memory operations for debuggability — silent memory failures produce subtle, hard-to-diagnose agent errors
Glossary
| Term | Definition |
|---|---|
| Working memory | The agent's context window — what it can actively reason about in a single step |
| Episodic memory | Storage of what happened: conversation history, events, decisions |
| Semantic memory | Storage of what is true: facts, domain knowledge, reference data |
| Procedural memory | Storage of how to do things: workflows, behavioral patterns |
| Progressive summarization | Compressing older turns into a summary while keeping recent turns verbatim |
| Memory decay | Time-based expiration of stored memories according to a retention policy |
| Context window overflow | When accumulated context exceeds the model's maximum token limit |
| Episodic persistence | Saving session memories to a database for retrieval in future sessions |
Further Reading
In This Repository:
- RAG Architecture — Foundation for semantic memory implementation
- Context Window Management — Working memory strategies in detail
- LangGraph Deep Dive — LangGraph's built-in checkpointing for episodic persistence
External References:
- Tulving (1972), "Episodic and Semantic Memory" — the cognitive science origin of memory type taxonomy
- LangGraph Memory documentation — official reference for memory integration with LangGraph
Previous: Tool Design Patterns | Next: Multi-Agent Systems