Agent Observability
Conceptual Explanation
Three Observability Pillars for Agents
Traces: End-to-end records of a single agent execution. A trace captures the ordered sequence of LLM calls, tool calls, and routing decisions for one workflow invocation, linked by a shared trace ID. The trace is the primary debugging artifact — it answers "what happened in this specific run?"
Metrics: Aggregated statistics across many runs. Relevant agent metrics include: success rate (did the workflow complete without error?), quality rate (did the output meet quality thresholds?), latency (end-to-end and per-node), token consumption, tool call frequency by tool, HITL trigger rate, and evaluation scores (faithfulness, relevancy, accuracy).
Logs: Structured, timestamped records of discrete events within an agent run. Logs complement traces (which are hierarchical) with flat, searchable, filterable event streams. Critical log events: workflow initiation, tool call attempts, HITL triggers, errors, and completions. For clinical systems: every decision that could affect patient care is a compliance log event.
Span Hierarchy
A trace is composed of spans arranged in a parent-child hierarchy:
Trace: prior-auth-P12345-95810
├── Span: workflow.run (root)
│ ├── Span: node.fetch_patient_data
│ │ └── Span: tool.get_patient_summary
│ ├── Span: node.retrieve_guidelines
│ │ └── Span: tool.search_clinical_guidelines
│ ├── Span: node.evaluate_criteria
│ │ └── Span: llm.call (claude-opus-4-8)
│ ├── Span: hitl.physician_review [INTERRUPT]
│ └── Span: node.finalize_determination
│ └── Span: tool.submit_determinationEach span carries: span ID, parent span ID, trace ID, start/end timestamps, status (OK/ERROR), and relevant attributes (model, tokens, tool name, tool input/output, error message).
Core Architecture
Common Mistakes
Logging the LLM response text in application logs. Application logs should capture structured event metadata, not the full LLM response body. Full response content belongs in the trace store (LangSmith), not CloudWatch or Datadog. This keeps log storage costs manageable and PHI exposure scoped.
No per-node latency. Logging only end-to-end workflow latency makes it impossible to identify which node is causing slow P99 responses. Span-based tracing with per-node timing is necessary.
Evaluation only on pass/fail. Binary pass/fail metrics do not reveal which quality dimension is degrading. Track sub-scores (citation accuracy, format compliance, criteria accuracy) separately to enable targeted debugging.
No baseline quality metric. Without a pre-deployment quality baseline, you cannot detect regression. Establish baseline metrics before go-live and alert when they drop by more than a defined threshold (e.g., 5 percentage points).
Best Practices
- Structure traces as parent-child span hierarchies — one root span per workflow, child spans per node and tool call
- Instrument every LLM call with token counts, model version, and latency
- Separate compliance audit logs (immutable, clinical events) from operational logs (application events)
- Mask or hash PHI in all observability data before transmission to third-party platforms
- Establish quality gate thresholds before go-live — evaluate every batch and alert on threshold violations
- Run evaluation on sampled production traces, not only on pre-deployment golden datasets
- Track HITL trigger rate as a proxy for agent confidence drift — rising rates indicate model or data regression
Interview Questions
Q1: How do you detect a quality regression in a production agentic clinical workflow without manual case review?
Category: System Design Difficulty: Senior Role: AI Architect / ML Engineer
Answer Framework:
Quality regression detection requires three layers working together.
First, structured tracing with span-level instrumentation captures every workflow run's inputs, outputs, and sub-component results. This is the raw data layer — without it, regression detection is impossible.
Second, an automated evaluation pipeline runs on sampled production traces. It computes quality metrics (format compliance, citation accuracy, determination consistency relative to ground truth) for each sampled run and aggregates them by time window. A sliding-window quality score gives a near-real-time quality signal without requiring manual review of every case.
Third, alerting is configured on the aggregate quality score: if the rolling pass rate drops more than a threshold (e.g., 5 percentage points) below the established baseline, an alert fires. The alert includes which sub-scores dropped, enabling targeted investigation.
Key Points to Hit:
- Automated evaluation on sampled production traces is the scalable path
- Sub-dimension metrics (not just overall score) are necessary for targeted debugging
- A pre-deployment baseline is required — without it, there is no regression reference
- LangSmith's evaluation datasets and automatic evaluation runs address this in the LangChain ecosystem
Key Takeaways
- The three observability pillars for agents are: traces (per-run), metrics (aggregated), and logs (structured events)
- Trace spans should form a parent-child hierarchy: one root span per workflow, child spans per node and tool call
- LangSmith integrates automatically with LangGraph when
LANGCHAIN<em>TRACING</em>V2=trueis set - Compliance audit logs in clinical systems must be immutable and separate from operational logs
- Automated evaluation pipelines running on sampled production traces provide scalable quality monitoring
- PHI must be masked or hashed before transmission to third-party observability platforms