Agent Observability

Conceptual Explanation

Three Observability Pillars for Agents

Traces: End-to-end records of a single agent execution. A trace captures the ordered sequence of LLM calls, tool calls, and routing decisions for one workflow invocation, linked by a shared trace ID. The trace is the primary debugging artifact — it answers "what happened in this specific run?"

Metrics: Aggregated statistics across many runs. Relevant agent metrics include: success rate (did the workflow complete without error?), quality rate (did the output meet quality thresholds?), latency (end-to-end and per-node), token consumption, tool call frequency by tool, HITL trigger rate, and evaluation scores (faithfulness, relevancy, accuracy).

Logs: Structured, timestamped records of discrete events within an agent run. Logs complement traces (which are hierarchical) with flat, searchable, filterable event streams. Critical log events: workflow initiation, tool call attempts, HITL triggers, errors, and completions. For clinical systems: every decision that could affect patient care is a compliance log event.

Span Hierarchy

A trace is composed of spans arranged in a parent-child hierarchy:

text

Trace: prior-auth-P12345-95810
├── Span: workflow.run (root)
│   ├── Span: node.fetch_patient_data
│   │   └── Span: tool.get_patient_summary
│   ├── Span: node.retrieve_guidelines
│   │   └── Span: tool.search_clinical_guidelines
│   ├── Span: node.evaluate_criteria
│   │   └── Span: llm.call (claude-opus-4-8)
│   ├── Span: hitl.physician_review [INTERRUPT]
│   └── Span: node.finalize_determination
│       └── Span: tool.submit_determination

Each span carries: span ID, parent span ID, trace ID, start/end timestamps, status (OK/ERROR), and relevant attributes (model, tokens, tool name, tool input/output, error message).

Core Architecture

graph TD subgraph "Agent Runtime" A["Agent Loop"] -->|"instrumented calls"| T["Tracing SDK\n(LangSmith / OpenTelemetry)"] A --> L["Structured Logger"] A --> M["Metrics Emitter"] end subgraph "Observability Stack" T -->|"traces"| TS["Trace Store\n(LangSmith Cloud / Jaeger)"] L -->|"log events"| LS["Log Aggregator\n(CloudWatch / Datadog)"] M -->|"metrics"| MS["Metrics Store\n(Prometheus / Datadog)"] end subgraph "Analysis Layer" TS --> EV["Evaluation Pipeline\n(RAGAS / LangSmith Evals)"] TS --> DB["Debug Interface\n(LangSmith UI)"] LS --> AL["Audit Log\n(Compliance)"] MS --> DA["Dashboard / Alerts\n(Grafana / Datadog)"] end

Common Mistakes

Logging the LLM response text in application logs. Application logs should capture structured event metadata, not the full LLM response body. Full response content belongs in the trace store (LangSmith), not CloudWatch or Datadog. This keeps log storage costs manageable and PHI exposure scoped.

No per-node latency. Logging only end-to-end workflow latency makes it impossible to identify which node is causing slow P99 responses. Span-based tracing with per-node timing is necessary.

Evaluation only on pass/fail. Binary pass/fail metrics do not reveal which quality dimension is degrading. Track sub-scores (citation accuracy, format compliance, criteria accuracy) separately to enable targeted debugging.

No baseline quality metric. Without a pre-deployment quality baseline, you cannot detect regression. Establish baseline metrics before go-live and alert when they drop by more than a defined threshold (e.g., 5 percentage points).

Best Practices

Structure traces as parent-child span hierarchies — one root span per workflow, child spans per node and tool call
Instrument every LLM call with token counts, model version, and latency
Separate compliance audit logs (immutable, clinical events) from operational logs (application events)
Mask or hash PHI in all observability data before transmission to third-party platforms
Establish quality gate thresholds before go-live — evaluate every batch and alert on threshold violations
Run evaluation on sampled production traces, not only on pre-deployment golden datasets
Track HITL trigger rate as a proxy for agent confidence drift — rising rates indicate model or data regression

Interview Questions

Q1: How do you detect a quality regression in a production agentic clinical workflow without manual case review?

Category: System Design Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Quality regression detection requires three layers working together.

First, structured tracing with span-level instrumentation captures every workflow run's inputs, outputs, and sub-component results. This is the raw data layer — without it, regression detection is impossible.

Second, an automated evaluation pipeline runs on sampled production traces. It computes quality metrics (format compliance, citation accuracy, determination consistency relative to ground truth) for each sampled run and aggregates them by time window. A sliding-window quality score gives a near-real-time quality signal without requiring manual review of every case.

Third, alerting is configured on the aggregate quality score: if the rolling pass rate drops more than a threshold (e.g., 5 percentage points) below the established baseline, an alert fires. The alert includes which sub-scores dropped, enabling targeted investigation.

Key Points to Hit:

Automated evaluation on sampled production traces is the scalable path
Sub-dimension metrics (not just overall score) are necessary for targeted debugging
A pre-deployment baseline is required — without it, there is no regression reference
LangSmith's evaluation datasets and automatic evaluation runs address this in the LangChain ecosystem

Key Takeaways

The three observability pillars for agents are: traces (per-run), metrics (aggregated), and logs (structured events)
Trace spans should form a parent-child hierarchy: one root span per workflow, child spans per node and tool call
LangSmith integrates automatically with LangGraph when LANGCHAIN<em>TRACING</em>V2=true is set
Compliance audit logs in clinical systems must be immutable and separate from operational logs
Automated evaluation pipelines running on sampled production traces provide scalable quality monitoring
PHI must be masked or hashed before transmission to third-party observability platforms

Agent Observability#

Conceptual Explanation#

Three Observability Pillars for Agents#

Span Hierarchy#

Core Architecture#

Common Mistakes#

Best Practices#

Interview Questions#

Q1: How do you detect a quality regression in a production agentic clinical workflow without manual case review?#

Key Takeaways#