Agent Observability
Executive Summary
Observability in agentic systems extends beyond traditional application monitoring: an agent's "behavior" spans multiple LLM calls, tool executions, branching decisions, and state transitions — each of which is a potential failure point and a source of both debugging signal and compliance evidence. Without structured observability, agentic systems are black boxes that fail silently, degrade unpredictably, and cannot be audited for correctness. This chapter covers tracing architecture, structured logging for agents, LangSmith integration, evaluation pipelines, and the observability requirements specific to regulated healthcare environments. Engineers deploying agents in production and architects designing agentic infrastructure should read this chapter.
Learning Objectives
- Define the three observability pillars as they apply to agentic systems
- Implement structured span-based tracing for an agent loop
- Configure LangSmith for LangGraph workflow tracing
- Design an evaluation pipeline with automated quality gates
- Identify the compliance logging requirements for clinical AI in regulated environments
Business Problem
An agentic prior authorization system processes 500 requests daily. Over two weeks, denial rates increase by 12%. Clinical leadership wants to know: is this a change in patient population, a regression in the clinical evaluation agent, a data quality issue in the EHR integration, or a change in guideline content? Without structured observability — traces that capture each agent decision with its inputs, reasoning, and outputs — answering this question requires replaying production traffic manually. That may take days. With structured tracing and evaluation pipelines, the regression is isolatable in hours.
Why This Technology Exists
Traditional APM (Application Performance Monitoring) tools monitor request latency, error rates, and resource utilization. These metrics are necessary but insufficient for agentic systems, where the most important failures are semantic: the agent retrieved the wrong guideline, applied the correct guideline to the wrong clinical criterion, or was misled by a poorly formatted tool response.
LangSmith (LangChain's observability platform), Weights & Biases, Arize Phoenix, and similar tools emerged to address this gap. They capture the full reasoning trace — every LLM call with its input, output, and token usage; every tool call with its inputs and results; every routing decision with the state that triggered it — making semantic debugging and regression detection tractable.
Conceptual Explanation
Three Observability Pillars for Agents
Traces: End-to-end records of a single agent execution. A trace captures the ordered sequence of LLM calls, tool calls, and routing decisions for one workflow invocation, linked by a shared trace ID. The trace is the primary debugging artifact — it answers "what happened in this specific run?"
Metrics: Aggregated statistics across many runs. Relevant agent metrics include: success rate (did the workflow complete without error?), quality rate (did the output meet quality thresholds?), latency (end-to-end and per-node), token consumption, tool call frequency by tool, HITL trigger rate, and evaluation scores (faithfulness, relevancy, accuracy).
Logs: Structured, timestamped records of discrete events within an agent run. Logs complement traces (which are hierarchical) with flat, searchable, filterable event streams. Critical log events: workflow initiation, tool call attempts, HITL triggers, errors, and completions. For clinical systems: every decision that could affect patient care is a compliance log event.
Span Hierarchy
A trace is composed of spans arranged in a parent-child hierarchy:
Trace: prior-auth-P12345-95810
├── Span: workflow.run (root)
│ ├── Span: node.fetch_patient_data
│ │ └── Span: tool.get_patient_summary
│ ├── Span: node.retrieve_guidelines
│ │ └── Span: tool.search_clinical_guidelines
│ ├── Span: node.evaluate_criteria
│ │ └── Span: llm.call (claude-opus-4-8)
│ ├── Span: hitl.physician_review [INTERRUPT]
│ └── Span: node.finalize_determination
│ └── Span: tool.submit_determinationEach span carries: span ID, parent span ID, trace ID, start/end timestamps, status (OK/ERROR), and relevant attributes (model, tokens, tool name, tool input/output, error message).
Core Architecture
Implementation Patterns
Pattern 1: Structured Agent Logging
"""
Structured observability for an agentic workflow.
Educational Example — Illustrative observability patterns.
"""
from __future__ import annotations
import json
import logging
import time
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Any, Optional
class EventType(str, Enum):
WORKFLOW_START = "workflow.start"
WORKFLOW_COMPLETE = "workflow.complete"
WORKFLOW_ERROR = "workflow.error"
NODE_START = "node.start"
NODE_COMPLETE = "node.complete"
NODE_ERROR = "node.error"
TOOL_CALL = "tool.call"
TOOL_RESULT = "tool.result"
TOOL_ERROR = "tool.error"
LLM_CALL = "llm.call"
LLM_RESPONSE = "llm.response"
HITL_TRIGGERED = "hitl.triggered"
HITL_RESOLVED = "hitl.resolved"
@dataclass
class AgentEvent:
"""Structured event for agent observability."""
event_type: EventType
trace_id: str
span_id: str
workflow_name: str
timestamp: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
duration_ms: Optional[float] = None
node_name: Optional[str] = None
tool_name: Optional[str] = None
model: Optional[str] = None
input_tokens: Optional[int] = None
output_tokens: Optional[int] = None
success: Optional[bool] = None
error_code: Optional[str] = None
error_message: Optional[str] = None
# Clinical compliance fields
patient_id: Optional[str] = None # For compliance logging
reviewer_id: Optional[str] = None # For HITL audit
decision: Optional[str] = None # For outcome logging
metadata: dict = field(default_factory=dict)
def to_log_dict(self) -> dict:
return {k: v for k, v in self.__dict__.items() if v is not None}
class AgentObservability:
"""
Structured observability wrapper for agentic workflows.
Emits structured JSON log events compatible with CloudWatch, Datadog, etc.
"""
def __init__(self, workflow_name: str, logger: Optional[logging.Logger] = None):
self.workflow_name = workflow_name
self.logger = logger or logging.getLogger(workflow_name)
self._configure_json_logging()
def _configure_json_logging(self):
if not self.logger.handlers:
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(message)s"))
self.logger.addHandler(handler)
self.logger.setLevel(logging.INFO)
def _emit(self, event: AgentEvent):
self.logger.info(json.dumps(event.to_log_dict()))
def trace_workflow(self, patient_id: str):
"""Context manager for tracing a complete workflow."""
import contextlib
@contextlib.contextmanager
def _trace():
trace_id = str(uuid.uuid4())
span_id = str(uuid.uuid4())
start = time.monotonic()
self._emit(AgentEvent(
event_type=EventType.WORKFLOW_START,
trace_id=trace_id,
span_id=span_id,
workflow_name=self.workflow_name,
patient_id=patient_id,
))
try:
yield trace_id
duration = (time.monotonic() - start) * 1000
self._emit(AgentEvent(
event_type=EventType.WORKFLOW_COMPLETE,
trace_id=trace_id,
span_id=span_id,
workflow_name=self.workflow_name,
patient_id=patient_id,
duration_ms=duration,
success=True,
))
except Exception as e:
duration = (time.monotonic() - start) * 1000
self._emit(AgentEvent(
event_type=EventType.WORKFLOW_ERROR,
trace_id=trace_id,
span_id=span_id,
workflow_name=self.workflow_name,
patient_id=patient_id,
duration_ms=duration,
success=False,
error_code=type(e).__name__,
error_message=str(e),
))
raise
return _trace()
def trace_tool_call(
self,
trace_id: str,
tool_name: str,
tool_input: dict,
patient_id: Optional[str] = None,
) -> dict:
"""Execute a tool call with tracing. Returns tool result."""
span_id = str(uuid.uuid4())
start = time.monotonic()
self._emit(AgentEvent(
event_type=EventType.TOOL_CALL,
trace_id=trace_id,
span_id=span_id,
workflow_name=self.workflow_name,
tool_name=tool_name,
patient_id=patient_id,
metadata={"input_keys": list(tool_input.keys())},
))
try:
# Actual tool dispatch would happen here
result = {"status": "stub", "tool": tool_name}
duration = (time.monotonic() - start) * 1000
self._emit(AgentEvent(
event_type=EventType.TOOL_RESULT,
trace_id=trace_id,
span_id=span_id,
workflow_name=self.workflow_name,
tool_name=tool_name,
patient_id=patient_id,
duration_ms=duration,
success=True,
))
return result
except Exception as e:
duration = (time.monotonic() - start) * 1000
self._emit(AgentEvent(
event_type=EventType.TOOL_ERROR,
trace_id=trace_id,
span_id=span_id,
workflow_name=self.workflow_name,
tool_name=tool_name,
patient_id=patient_id,
duration_ms=duration,
success=False,
error_code=type(e).__name__,
error_message=str(e),
))
raise
def log_llm_call(
self,
trace_id: str,
model: str,
input_tokens: int,
output_tokens: int,
duration_ms: float,
success: bool,
):
"""Log an LLM call's token consumption and latency."""
self._emit(AgentEvent(
event_type=EventType.LLM_RESPONSE,
trace_id=trace_id,
span_id=str(uuid.uuid4()),
workflow_name=self.workflow_name,
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
duration_ms=duration_ms,
success=success,
))
def log_hitl_triggered(
self,
trace_id: str,
trigger_reason: str,
patient_id: str,
metadata: dict,
):
"""Log a HITL trigger event for compliance and monitoring."""
self._emit(AgentEvent(
event_type=EventType.HITL_TRIGGERED,
trace_id=trace_id,
span_id=str(uuid.uuid4()),
workflow_name=self.workflow_name,
patient_id=patient_id,
metadata={"trigger_reason": trigger_reason, **metadata},
))Pattern 2: LangSmith Integration
"""
LangSmith integration for LangGraph workflow tracing.
Educational Example — Illustrative integration pattern.
Verify current LangSmith API at smith.langchain.com/docs
"""
import os
from typing import Any
def configure_langsmith(project_name: str, environment: str = "development"):
"""
Configure LangSmith tracing for a LangGraph application.
LangSmith traces LangGraph workflows automatically when LANGCHAIN_TRACING_V2=true.
"""
# Set these in environment configuration, not hardcoded
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"{project_name}-{environment}"
# os.environ["LANGCHAIN_API_KEY"] = "..." # Set via secrets manager in production
# os.environ["LANGCHAIN_ENDPOINT"] = "..." # Required for self-hosted LangSmith
def add_run_metadata(config: dict, metadata: dict) -> dict:
"""
Add run-level metadata to a LangGraph invocation config.
Metadata appears in LangSmith UI for filtering and searching.
"""
if "metadata" not in config:
config["metadata"] = {}
config["metadata"].update(metadata)
return config
def run_with_langsmith_metadata(
graph: Any,
initial_state: dict,
thread_id: str,
patient_id: str,
procedure_code: str,
) -> dict:
"""
Run a LangGraph workflow with LangSmith metadata for filtering in the UI.
"""
config = {
"configurable": {"thread_id": thread_id},
"metadata": {
"patient_id": patient_id, # For filtering in LangSmith UI
"procedure_code": procedure_code,
"workflow_version": "2.0",
"environment": os.environ.get("APP_ENV", "development"),
},
"tags": ["prior-auth", procedure_code, "hms-workflow"],
}
return graph.invoke(initial_state, config)Pattern 3: Automated Evaluation Pipeline
"""
Automated quality evaluation pipeline for agentic outputs.
Educational Example — Illustrative evaluation patterns.
"""
from dataclasses import dataclass
from typing import Optional
@dataclass
class EvaluationResult:
workflow_id: str
patient_id: str
procedure_code: str
# Quality dimensions
criteria_accuracy: Optional[float] = None # 0.0–1.0: were criteria applied correctly?
guideline_citation_accuracy: float = 0.0 # 0.0–1.0: were citations correct and relevant?
determination_consistency: float = 0.0 # 0.0–1.0: consistent with similar cases?
output_format_compliance: float = 0.0 # 0.0–1.0: did output match expected schema?
# Aggregate
overall_quality_score: float = 0.0
passed_quality_gate: bool = False
evaluation_notes: str = ""
def evaluate_prior_auth_output(
workflow_output: dict,
ground_truth: Optional[dict] = None,
) -> EvaluationResult:
"""
Evaluate a prior authorization workflow output.
For production: integrate with RAGAS or LangSmith evaluation datasets.
ground_truth: if provided, compare against known-correct evaluation.
If not provided, apply heuristic quality checks only.
"""
result = EvaluationResult(
workflow_id=workflow_output.get("workflow_id", "unknown"),
patient_id=workflow_output.get("patient_id", "unknown"),
procedure_code=workflow_output.get("procedure_code", "unknown"),
)
determination = workflow_output.get("final_determination", {})
# 1. Output format compliance — does the output have required fields?
required_fields = ["criteria_met", "rationale", "supporting_guidelines", "determination"]
present_fields = sum(1 for f in required_fields if f in determination)
result.output_format_compliance = present_fields / len(required_fields)
# 2. Guideline citation presence — are citations non-empty?
citations = determination.get("supporting_guidelines", [])
result.guideline_citation_accuracy = min(len(citations) / 3, 1.0) # expect ≥ 3 citations
# 3. If ground truth is available, compare determination
if ground_truth:
expected = ground_truth.get("criteria_met")
actual = determination.get("criteria_met")
result.criteria_accuracy = 1.0 if expected == actual else 0.0
# 4. Overall score
scores = [
result.output_format_compliance,
result.guideline_citation_accuracy,
]
if result.criteria_accuracy is not None:
scores.append(result.criteria_accuracy)
result.overall_quality_score = sum(scores) / len(scores)
# Quality gate: all required fields present AND citations present
result.passed_quality_gate = (
result.output_format_compliance >= 1.0
and result.guideline_citation_accuracy >= 0.33 # At least 1 citation
)
return result
class EvaluationPipeline:
"""
Continuous evaluation pipeline for production agent monitoring.
In production: integrate with LangSmith Evaluation or RAGAS.
"""
def __init__(self, quality_gate_threshold: float = 0.80):
self.quality_gate_threshold = quality_gate_threshold
self.evaluation_history: list[EvaluationResult] = []
def evaluate_batch(
self,
workflow_outputs: list[dict],
ground_truths: Optional[list[dict]] = None,
) -> dict:
"""Evaluate a batch of workflow outputs and return aggregate metrics."""
results = []
for i, output in enumerate(workflow_outputs):
gt = ground_truths[i] if ground_truths else None
result = evaluate_prior_auth_output(output, gt)
results.append(result)
self.evaluation_history.append(result)
pass_rate = sum(1 for r in results if r.passed_quality_gate) / max(len(results), 1)
avg_quality = sum(r.overall_quality_score for r in results) / max(len(results), 1)
return {
"batch_size": len(results),
"pass_rate": pass_rate,
"avg_quality_score": avg_quality,
"below_gate_count": sum(1 for r in results if not r.passed_quality_gate),
"gate_threshold": self.quality_gate_threshold,
"alert_required": avg_quality < self.quality_gate_threshold,
}Enterprise Considerations
Token cost observability. Token consumption drives LLM API costs at scale. Instrument every LLM call to capture input and output token counts. Aggregate by model, workflow, node, and time period. Alert when token usage per workflow exceeds expected ranges — unexpectedly high counts indicate prompt injection, reasoning loops, or context window mismanagement.
Latency percentiles. Monitor p50, p95, and p99 latency for each node in the workflow. Long-tail latency (p99) often reveals specific code paths (large tool responses, complex re-ranking, high context length LLM calls) that degrade user experience under load. Per-node latency attribution is possible only when spans are properly structured.
Evaluation dataset drift. The golden evaluation dataset used to validate the pipeline should reflect current patient population and guideline content. If the dataset is not updated when guidelines change, evaluation scores will diverge from real-world accuracy. Schedule quarterly dataset reviews.
Privacy in traces. Traces capture tool inputs and LLM inputs, which in clinical contexts contain PHI. Configure trace collectors to redact or mask PHI fields before transmission to observability platforms. Verify that your observability vendor has a signed BAA if traces will contain PHI.
Security Considerations
Trace data classification. In clinical workflows, traces contain PHI. Treat traces with the same access controls as clinical records: role-based access, audit logging of trace access, retention limits aligned with HIPAA requirements.
Observability as attack surface. An observability platform with weak access controls exposes complete agent reasoning traces, including prompts, tool inputs, and tool outputs. A compromised trace store is equivalent to a compromised EHR query log. Apply principle of least privilege: developers access anonymized or de-identified traces in non-production; only compliance officers access production traces with PHI.
Healthcare Example
Educational Example — Illustrative Observability Configuration. Not intended for clinical decision making.
A Reference Healthcare Organization's agentic prior authorization system implements the following observability configuration:
| Dimension | Implementation | Alert Threshold |
|---|---|---|
| Trace store | LangSmith (BAA executed) | N/A |
| Compliance audit log | Immutable CloudWatch Logs with S3 export | Any HITL trigger |
| Quality gate | Automated eval pipeline; criteria accuracy ≥ 0.85 | Pass rate < 0.80 |
| Token budget | Per-workflow token cap = 20,000 | Per-workflow > 18,000 |
| HITL trigger rate | Target: 15–25% of requests | > 35% (regression alert) |
| Latency p95 | Target: 45 seconds end-to-end | > 90 seconds |
PHI is redacted from trace metadata before transmission to LangSmith. Patient ID is stored as a hashed reference — lookups require access to the hash mapping, which is in the compliance store, not the trace store.
Common Mistakes
Logging the LLM response text in application logs. Application logs should capture structured event metadata, not the full LLM response body. Full response content belongs in the trace store (LangSmith), not CloudWatch or Datadog. This keeps log storage costs manageable and PHI exposure scoped.
No per-node latency. Logging only end-to-end workflow latency makes it impossible to identify which node is causing slow P99 responses. Span-based tracing with per-node timing is necessary.
Evaluation only on pass/fail. Binary pass/fail metrics do not reveal which quality dimension is degrading. Track sub-scores (citation accuracy, format compliance, criteria accuracy) separately to enable targeted debugging.
No baseline quality metric. Without a pre-deployment quality baseline, you cannot detect regression. Establish baseline metrics before go-live and alert when they drop by more than a defined threshold (e.g., 5 percentage points).
Best Practices
- Structure traces as parent-child span hierarchies — one root span per workflow, child spans per node and tool call
- Instrument every LLM call with token counts, model version, and latency
- Separate compliance audit logs (immutable, clinical events) from operational logs (application events)
- Mask or hash PHI in all observability data before transmission to third-party platforms
- Establish quality gate thresholds before go-live — evaluate every batch and alert on threshold violations
- Run evaluation on sampled production traces, not only on pre-deployment golden datasets
- Track HITL trigger rate as a proxy for agent confidence drift — rising rates indicate model or data regression
Interview Questions
Q1: How do you detect a quality regression in a production agentic clinical workflow without manual case review?
Category: System Design Difficulty: Senior Role: AI Architect / ML Engineer
Answer Framework:
Quality regression detection requires three layers working together.
First, structured tracing with span-level instrumentation captures every workflow run's inputs, outputs, and sub-component results. This is the raw data layer — without it, regression detection is impossible.
Second, an automated evaluation pipeline runs on sampled production traces. It computes quality metrics (format compliance, citation accuracy, determination consistency relative to ground truth) for each sampled run and aggregates them by time window. A sliding-window quality score gives a near-real-time quality signal without requiring manual review of every case.
Third, alerting is configured on the aggregate quality score: if the rolling pass rate drops more than a threshold (e.g., 5 percentage points) below the established baseline, an alert fires. The alert includes which sub-scores dropped, enabling targeted investigation.
Key Points to Hit:
- Automated evaluation on sampled production traces is the scalable path
- Sub-dimension metrics (not just overall score) are necessary for targeted debugging
- A pre-deployment baseline is required — without it, there is no regression reference
- LangSmith's evaluation datasets and automatic evaluation runs address this in the LangChain ecosystem
Key Takeaways
- The three observability pillars for agents are: traces (per-run), metrics (aggregated), and logs (structured events)
- Trace spans should form a parent-child hierarchy: one root span per workflow, child spans per node and tool call
- LangSmith integrates automatically with LangGraph when
LANGCHAIN<em>TRACING</em>V2=trueis set - Compliance audit logs in clinical systems must be immutable and separate from operational logs
- Automated evaluation pipelines running on sampled production traces provide scalable quality monitoring
- PHI must be masked or hashed before transmission to third-party observability platforms
Glossary
| Term | Definition |
|---|---|
| Trace | An end-to-end record of a single agent workflow execution, composed of nested spans |
| Span | A single unit of work within a trace (one LLM call, one tool call, one node execution) |
| Trace ID | Unique identifier linking all spans belonging to the same workflow execution |
| Evaluation pipeline | Automated system that applies quality metrics to agent outputs and aggregates results |
| Quality gate | A minimum quality score threshold; below it, alerts fire or deployments are blocked |
| Golden dataset | A curated set of inputs with known-correct expected outputs, used for offline evaluation |
| RAGAS | Retrieval-Augmented Generation Assessment — a framework for evaluating RAG and agent quality |
| LangSmith | LangChain's observability and evaluation platform for LLM applications |
Further Reading
In This Repository:
- LangGraph Deep Dive — LangSmith integration within LangGraph workflows
- Evaluation and Benchmarking — RAGAS and golden dataset design
- Human-in-the-Loop — HITL audit log requirements
External References:
- LangSmith documentation — official guide to tracing, evaluation, and datasets
- OpenTelemetry specification — standard for vendor-neutral distributed tracing
- RAGAS documentation — retrieval and generation evaluation metrics
Previous: Human-in-the-Loop | Next: Model Context Protocol