Multi-Agent Systems
Conceptual Explanation
When Multi-Agent Architecture is Warranted
Three conditions justify the coordination overhead of multi-agent systems:
- Work can be parallelized: Independent subtasks that could proceed simultaneously are being bottlenecked in a single-agent sequential loop
- Tool count exceeds ~15 per agent: Tool selection accuracy degrades significantly above this threshold; specialization restores precision
- Domain specialization produces meaningful quality gains: A clinical agent trained on clinical system prompts and clinical tools outperforms a general agent given all tools
The Three Topologies
Orchestrator-Worker: A central orchestrator agent decomposes the goal, delegates subtasks to specialized worker agents, and aggregates results. Workers report back to the orchestrator; they do not communicate with each other directly. Best for: tasks with clear decomposition, moderate parallelism, and sequential dependency between phases.
Hierarchical: Orchestrators can themselves be orchestrated. A top-level coordinator delegates to sub-orchestrators, which delegate to workers. Best for: complex enterprise workflows where a single orchestrator would have too many responsibilities.
Peer-to-Peer (Specialist Handoff): Agents pass tasks to each other without a central coordinator. Agent A determines that a task is outside its domain and routes it to Agent B. Best for: specialist consultation workflows where the routing logic is embedded in each agent's expertise.
Architecture Diagram
Standalone diagram: architecture/mermaid/02-multi-agent-topology.mmd
Common Mistakes
Creating too many agents too early. Start with a single agent and extract to multi-agent only when a specific, measurable limitation is encountered. Premature decomposition adds coordination overhead without benefit.
Workers that are too general. A "Research Worker" that can do anything defeats the purpose of specialization. Workers should be narrowly focused: one domain, one set of tools, one type of task.
No explicit failure handling at the orchestrator level. When a worker returns an error, the orchestrator must be designed to handle it (retry, fallback, skip, escalate) — not just forward the error to the user.
Circular dependencies. Worker A calls Worker B which calls Worker C which calls Worker A. Without careful design, multi-agent systems can introduce deadlocks. Map the dependency graph before implementation.
Best Practices
- Start with a single agent; extract to multi-agent only when a specific limitation is measured
- Workers should have a single responsibility: one domain, ≤10 tools, one output type
- Use small models for workers on focused tasks; use frontier models for orchestrator planning
- Carry a correlation ID through all agent invocations for distributed tracing
- Design explicit failure handling at the orchestrator for each class of worker failure
- Version worker agents and test orchestrator-worker interface compatibility before deployment
- Gate all External-class tool calls behind HITL, regardless of which agent makes the call
Alternatives
| Approach | When to Choose | Trade-off |
|---|---|---|
| Single agent | Task fits in one context window; <15 tools needed | Simpler; no coordination overhead |
| Sequential chain | Task has no parallel work; steps are known upfront | Predictable; no dynamic decomposition |
| Orchestrator-worker | Parallelizable subtasks; clear role separation | Coordination overhead; requires failure handling |
| Hierarchical multi-agent | Complex workflows with multiple layers of decomposition | Maximum scalability; highest operational complexity |
| Peer-to-peer handoff | Specialist routing; each agent decides when to escalate | Flexible; requires careful loop prevention |
Trade-offs
| Dimension | Advantage | Cost |
|---|---|---|
| Specialization | Agents excel in their domain | Coordination protocol required |
| Parallelism | Independent tasks proceed simultaneously | Shared state management complexity |
| Scale | Context saturation avoided | Inter-agent latency overhead |
| Resilience | Worker failures are local | Failure propagation design required |
| Observability | Each agent's behavior is auditable | Distributed tracing infrastructure required |
Interview Questions
Q1: When does a single agent become a multi-agent system, and how do you make that decision?
Category: Architecture / System Design Difficulty: Principal Role: AI Architect
Answer Framework:
Three specific conditions justify the transition: (1) tool count exceeds ~15, degrading selection accuracy; (2) context saturation occurs frequently in production — the single agent's context window fills before the task completes; (3) there is parallel work that is being serialized unnecessarily.
The decision process is empirical, not intuitive. Measure: what is the agent's tool selection error rate? What is the frequency of context overflow? Is there measurable latency from sequential execution of independent tasks? If no specific, measured problem exists, the agent is not ready for multi-agent decomposition.
The transition adds coordination overhead (latency, cost, failure handling complexity). If you cannot articulate which specific limitation you are solving and how the multi-agent architecture addresses it, you are adding complexity without benefit.
Red Flags: "Multi-agent is just better" — not true. "We're planning for scale we don't have yet" — premature optimization.
Q2: How do you establish trust boundaries between agents in a multi-agent system?
Category: Security / Architecture Difficulty: Principal Role: AI Architect
Answer Framework:
Trust between agents is not automatic — it must be designed explicitly. The threat model has two components: (1) a compromised or hallucinating orchestrator could send malicious instructions to workers; (2) a malicious agent in the system could exceed its intended authorization.
The defense is scope validation at each agent boundary: every worker validates that the task it receives is within its defined scope before executing any tool. If the orchestrator tells the EHR Worker to "also submit the prior auth to the payer," the EHR Worker should refuse — submittopayer is not in its tool registry.
In addition, agents should not trust each other's identity without authentication. In distributed deployments, use mTLS or signed messages between agents. The system that spawns agents (LangGraph, CrewAI, or a custom orchestration layer) should be the trust anchor, not the agents themselves.
Key Takeaways
- Multi-agent systems are warranted when single-agent limitations are specifically measured: tool count, context saturation, or parallelism bottlenecks
- Three topologies: orchestrator-worker (central coordinator), hierarchical (nested coordination), peer-to-peer (specialist handoff)
- Workers should have a single responsibility, a focused tool set (≤10 tools), and a well-defined interface
- Coordination overhead is real — use small models for workers, frontier models for orchestrator reasoning
- Trust between agents is not automatic — validate task scope at each worker boundary
- Distributed tracing with correlation IDs is required to debug multi-agent workflows
- External-class tools always require HITL regardless of which agent in the system calls them