Multi-Agent Systems

Conceptual Explanation

When Multi-Agent Architecture is Warranted

Three conditions justify the coordination overhead of multi-agent systems:

Work can be parallelized: Independent subtasks that could proceed simultaneously are being bottlenecked in a single-agent sequential loop
Tool count exceeds ~15 per agent: Tool selection accuracy degrades significantly above this threshold; specialization restores precision
Domain specialization produces meaningful quality gains: A clinical agent trained on clinical system prompts and clinical tools outperforms a general agent given all tools

The Three Topologies

Orchestrator-Worker: A central orchestrator agent decomposes the goal, delegates subtasks to specialized worker agents, and aggregates results. Workers report back to the orchestrator; they do not communicate with each other directly. Best for: tasks with clear decomposition, moderate parallelism, and sequential dependency between phases.

Hierarchical: Orchestrators can themselves be orchestrated. A top-level coordinator delegates to sub-orchestrators, which delegate to workers. Best for: complex enterprise workflows where a single orchestrator would have too many responsibilities.

Peer-to-Peer (Specialist Handoff): Agents pass tasks to each other without a central coordinator. Agent A determines that a task is outside its domain and routes it to Agent B. Best for: specialist consultation workflows where the routing logic is embedded in each agent's expertise.

Architecture Diagram

Standalone diagram: architecture/mermaid/02-multi-agent-topology.mmd

Common Mistakes

Creating too many agents too early. Start with a single agent and extract to multi-agent only when a specific, measurable limitation is encountered. Premature decomposition adds coordination overhead without benefit.

Workers that are too general. A "Research Worker" that can do anything defeats the purpose of specialization. Workers should be narrowly focused: one domain, one set of tools, one type of task.

No explicit failure handling at the orchestrator level. When a worker returns an error, the orchestrator must be designed to handle it (retry, fallback, skip, escalate) — not just forward the error to the user.

Circular dependencies. Worker A calls Worker B which calls Worker C which calls Worker A. Without careful design, multi-agent systems can introduce deadlocks. Map the dependency graph before implementation.

Best Practices

Start with a single agent; extract to multi-agent only when a specific limitation is measured
Workers should have a single responsibility: one domain, ≤10 tools, one output type
Use small models for workers on focused tasks; use frontier models for orchestrator planning
Carry a correlation ID through all agent invocations for distributed tracing
Design explicit failure handling at the orchestrator for each class of worker failure
Version worker agents and test orchestrator-worker interface compatibility before deployment
Gate all External-class tool calls behind HITL, regardless of which agent makes the call

Alternatives

Approach	When to Choose	Trade-off
Single agent	Task fits in one context window; <15 tools needed	Simpler; no coordination overhead
Sequential chain	Task has no parallel work; steps are known upfront	Predictable; no dynamic decomposition
Orchestrator-worker	Parallelizable subtasks; clear role separation	Coordination overhead; requires failure handling
Hierarchical multi-agent	Complex workflows with multiple layers of decomposition	Maximum scalability; highest operational complexity
Peer-to-peer handoff	Specialist routing; each agent decides when to escalate	Flexible; requires careful loop prevention

Trade-offs

Dimension	Advantage	Cost
Specialization	Agents excel in their domain	Coordination protocol required
Parallelism	Independent tasks proceed simultaneously	Shared state management complexity
Scale	Context saturation avoided	Inter-agent latency overhead
Resilience	Worker failures are local	Failure propagation design required
Observability	Each agent's behavior is auditable	Distributed tracing infrastructure required

Interview Questions

Q1: When does a single agent become a multi-agent system, and how do you make that decision?

Category: Architecture / System Design Difficulty: Principal Role: AI Architect

Answer Framework:

Three specific conditions justify the transition: (1) tool count exceeds ~15, degrading selection accuracy; (2) context saturation occurs frequently in production — the single agent's context window fills before the task completes; (3) there is parallel work that is being serialized unnecessarily.

The decision process is empirical, not intuitive. Measure: what is the agent's tool selection error rate? What is the frequency of context overflow? Is there measurable latency from sequential execution of independent tasks? If no specific, measured problem exists, the agent is not ready for multi-agent decomposition.

The transition adds coordination overhead (latency, cost, failure handling complexity). If you cannot articulate which specific limitation you are solving and how the multi-agent architecture addresses it, you are adding complexity without benefit.

Red Flags: "Multi-agent is just better" — not true. "We're planning for scale we don't have yet" — premature optimization.

Q2: How do you establish trust boundaries between agents in a multi-agent system?

Category: Security / Architecture Difficulty: Principal Role: AI Architect

Answer Framework:

Trust between agents is not automatic — it must be designed explicitly. The threat model has two components: (1) a compromised or hallucinating orchestrator could send malicious instructions to workers; (2) a malicious agent in the system could exceed its intended authorization.

The defense is scope validation at each agent boundary: every worker validates that the task it receives is within its defined scope before executing any tool. If the orchestrator tells the EHR Worker to "also submit the prior auth to the payer," the EHR Worker should refuse — submittopayer is not in its tool registry.

In addition, agents should not trust each other's identity without authentication. In distributed deployments, use mTLS or signed messages between agents. The system that spawns agents (LangGraph, CrewAI, or a custom orchestration layer) should be the trust anchor, not the agents themselves.

Key Takeaways

Multi-agent systems are warranted when single-agent limitations are specifically measured: tool count, context saturation, or parallelism bottlenecks
Three topologies: orchestrator-worker (central coordinator), hierarchical (nested coordination), peer-to-peer (specialist handoff)
Workers should have a single responsibility, a focused tool set (≤10 tools), and a well-defined interface
Coordination overhead is real — use small models for workers, frontier models for orchestrator reasoning
Trust between agents is not automatic — validate task scope at each worker boundary
Distributed tracing with correlation IDs is required to debug multi-agent workflows
External-class tools always require HITL regardless of which agent in the system calls them

Multi-Agent Systems#

Conceptual Explanation#

When Multi-Agent Architecture is Warranted#

The Three Topologies#

Architecture Diagram#

Common Mistakes#

Best Practices#

Alternatives#

Trade-offs#

Interview Questions#

Q1: When does a single agent become a multi-agent system, and how do you make that decision?#

Q2: How do you establish trust boundaries between agents in a multi-agent system?#

Key Takeaways#