Agent Observability

Executive Summary

Observability in agentic systems extends beyond traditional application monitoring: an agent's "behavior" spans multiple LLM calls, tool executions, branching decisions, and state transitions — each of which is a potential failure point and a source of both debugging signal and compliance evidence. Without structured observability, agentic systems are black boxes that fail silently, degrade unpredictably, and cannot be audited for correctness. This chapter covers tracing architecture, structured logging for agents, LangSmith integration, evaluation pipelines, and the observability requirements specific to regulated healthcare environments. Engineers deploying agents in production and architects designing agentic infrastructure should read this chapter.

Learning Objectives

  • Define the three observability pillars as they apply to agentic systems
  • Implement structured span-based tracing for an agent loop
  • Configure LangSmith for LangGraph workflow tracing
  • Design an evaluation pipeline with automated quality gates
  • Identify the compliance logging requirements for clinical AI in regulated environments

Business Problem

An agentic prior authorization system processes 500 requests daily. Over two weeks, denial rates increase by 12%. Clinical leadership wants to know: is this a change in patient population, a regression in the clinical evaluation agent, a data quality issue in the EHR integration, or a change in guideline content? Without structured observability — traces that capture each agent decision with its inputs, reasoning, and outputs — answering this question requires replaying production traffic manually. That may take days. With structured tracing and evaluation pipelines, the regression is isolatable in hours.

Why This Technology Exists

Traditional APM (Application Performance Monitoring) tools monitor request latency, error rates, and resource utilization. These metrics are necessary but insufficient for agentic systems, where the most important failures are semantic: the agent retrieved the wrong guideline, applied the correct guideline to the wrong clinical criterion, or was misled by a poorly formatted tool response.

LangSmith (LangChain's observability platform), Weights & Biases, Arize Phoenix, and similar tools emerged to address this gap. They capture the full reasoning trace — every LLM call with its input, output, and token usage; every tool call with its inputs and results; every routing decision with the state that triggered it — making semantic debugging and regression detection tractable.

Conceptual Explanation

Three Observability Pillars for Agents

Traces: End-to-end records of a single agent execution. A trace captures the ordered sequence of LLM calls, tool calls, and routing decisions for one workflow invocation, linked by a shared trace ID. The trace is the primary debugging artifact — it answers "what happened in this specific run?"

Metrics: Aggregated statistics across many runs. Relevant agent metrics include: success rate (did the workflow complete without error?), quality rate (did the output meet quality thresholds?), latency (end-to-end and per-node), token consumption, tool call frequency by tool, HITL trigger rate, and evaluation scores (faithfulness, relevancy, accuracy).

Logs: Structured, timestamped records of discrete events within an agent run. Logs complement traces (which are hierarchical) with flat, searchable, filterable event streams. Critical log events: workflow initiation, tool call attempts, HITL triggers, errors, and completions. For clinical systems: every decision that could affect patient care is a compliance log event.

Span Hierarchy

A trace is composed of spans arranged in a parent-child hierarchy:

text
Trace: prior-auth-P12345-95810
├── Span: workflow.run (root)
│   ├── Span: node.fetch_patient_data
│   │   └── Span: tool.get_patient_summary
│   ├── Span: node.retrieve_guidelines
│   │   └── Span: tool.search_clinical_guidelines
│   ├── Span: node.evaluate_criteria
│   │   └── Span: llm.call (claude-opus-4-8)
│   ├── Span: hitl.physician_review [INTERRUPT]
│   └── Span: node.finalize_determination
│       └── Span: tool.submit_determination

Each span carries: span ID, parent span ID, trace ID, start/end timestamps, status (OK/ERROR), and relevant attributes (model, tokens, tool name, tool input/output, error message).

Core Architecture

Implementation Patterns

Pattern 1: Structured Agent Logging

python
"""
Structured observability for an agentic workflow.
Educational Example — Illustrative observability patterns.
"""
from __future__ import annotations

import json
import logging
import time
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Any, Optional


class EventType(str, Enum):
    WORKFLOW_START = "workflow.start"
    WORKFLOW_COMPLETE = "workflow.complete"
    WORKFLOW_ERROR = "workflow.error"
    NODE_START = "node.start"
    NODE_COMPLETE = "node.complete"
    NODE_ERROR = "node.error"
    TOOL_CALL = "tool.call"
    TOOL_RESULT = "tool.result"
    TOOL_ERROR = "tool.error"
    LLM_CALL = "llm.call"
    LLM_RESPONSE = "llm.response"
    HITL_TRIGGERED = "hitl.triggered"
    HITL_RESOLVED = "hitl.resolved"


@dataclass
class AgentEvent:
    """Structured event for agent observability."""
    event_type: EventType
    trace_id: str
    span_id: str
    workflow_name: str
    timestamp: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
    duration_ms: Optional[float] = None
    node_name: Optional[str] = None
    tool_name: Optional[str] = None
    model: Optional[str] = None
    input_tokens: Optional[int] = None
    output_tokens: Optional[int] = None
    success: Optional[bool] = None
    error_code: Optional[str] = None
    error_message: Optional[str] = None
    # Clinical compliance fields
    patient_id: Optional[str] = None       # For compliance logging
    reviewer_id: Optional[str] = None      # For HITL audit
    decision: Optional[str] = None         # For outcome logging
    metadata: dict = field(default_factory=dict)

    def to_log_dict(self) -> dict:
        return {k: v for k, v in self.__dict__.items() if v is not None}


class AgentObservability:
    """
    Structured observability wrapper for agentic workflows.
    Emits structured JSON log events compatible with CloudWatch, Datadog, etc.
    """

    def __init__(self, workflow_name: str, logger: Optional[logging.Logger] = None):
        self.workflow_name = workflow_name
        self.logger = logger or logging.getLogger(workflow_name)
        self._configure_json_logging()

    def _configure_json_logging(self):
        if not self.logger.handlers:
            handler = logging.StreamHandler()
            handler.setFormatter(logging.Formatter("%(message)s"))
            self.logger.addHandler(handler)
            self.logger.setLevel(logging.INFO)

    def _emit(self, event: AgentEvent):
        self.logger.info(json.dumps(event.to_log_dict()))

    def trace_workflow(self, patient_id: str):
        """Context manager for tracing a complete workflow."""
        import contextlib

        @contextlib.contextmanager
        def _trace():
            trace_id = str(uuid.uuid4())
            span_id = str(uuid.uuid4())
            start = time.monotonic()

            self._emit(AgentEvent(
                event_type=EventType.WORKFLOW_START,
                trace_id=trace_id,
                span_id=span_id,
                workflow_name=self.workflow_name,
                patient_id=patient_id,
            ))

            try:
                yield trace_id
                duration = (time.monotonic() - start) * 1000
                self._emit(AgentEvent(
                    event_type=EventType.WORKFLOW_COMPLETE,
                    trace_id=trace_id,
                    span_id=span_id,
                    workflow_name=self.workflow_name,
                    patient_id=patient_id,
                    duration_ms=duration,
                    success=True,
                ))
            except Exception as e:
                duration = (time.monotonic() - start) * 1000
                self._emit(AgentEvent(
                    event_type=EventType.WORKFLOW_ERROR,
                    trace_id=trace_id,
                    span_id=span_id,
                    workflow_name=self.workflow_name,
                    patient_id=patient_id,
                    duration_ms=duration,
                    success=False,
                    error_code=type(e).__name__,
                    error_message=str(e),
                ))
                raise

        return _trace()

    def trace_tool_call(
        self,
        trace_id: str,
        tool_name: str,
        tool_input: dict,
        patient_id: Optional[str] = None,
    ) -> dict:
        """Execute a tool call with tracing. Returns tool result."""
        span_id = str(uuid.uuid4())
        start = time.monotonic()

        self._emit(AgentEvent(
            event_type=EventType.TOOL_CALL,
            trace_id=trace_id,
            span_id=span_id,
            workflow_name=self.workflow_name,
            tool_name=tool_name,
            patient_id=patient_id,
            metadata={"input_keys": list(tool_input.keys())},
        ))

        try:
            # Actual tool dispatch would happen here
            result = {"status": "stub", "tool": tool_name}
            duration = (time.monotonic() - start) * 1000
            self._emit(AgentEvent(
                event_type=EventType.TOOL_RESULT,
                trace_id=trace_id,
                span_id=span_id,
                workflow_name=self.workflow_name,
                tool_name=tool_name,
                patient_id=patient_id,
                duration_ms=duration,
                success=True,
            ))
            return result

        except Exception as e:
            duration = (time.monotonic() - start) * 1000
            self._emit(AgentEvent(
                event_type=EventType.TOOL_ERROR,
                trace_id=trace_id,
                span_id=span_id,
                workflow_name=self.workflow_name,
                tool_name=tool_name,
                patient_id=patient_id,
                duration_ms=duration,
                success=False,
                error_code=type(e).__name__,
                error_message=str(e),
            ))
            raise

    def log_llm_call(
        self,
        trace_id: str,
        model: str,
        input_tokens: int,
        output_tokens: int,
        duration_ms: float,
        success: bool,
    ):
        """Log an LLM call's token consumption and latency."""
        self._emit(AgentEvent(
            event_type=EventType.LLM_RESPONSE,
            trace_id=trace_id,
            span_id=str(uuid.uuid4()),
            workflow_name=self.workflow_name,
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            duration_ms=duration_ms,
            success=success,
        ))

    def log_hitl_triggered(
        self,
        trace_id: str,
        trigger_reason: str,
        patient_id: str,
        metadata: dict,
    ):
        """Log a HITL trigger event for compliance and monitoring."""
        self._emit(AgentEvent(
            event_type=EventType.HITL_TRIGGERED,
            trace_id=trace_id,
            span_id=str(uuid.uuid4()),
            workflow_name=self.workflow_name,
            patient_id=patient_id,
            metadata={"trigger_reason": trigger_reason, **metadata},
        ))

Pattern 2: LangSmith Integration

python
"""
LangSmith integration for LangGraph workflow tracing.
Educational Example — Illustrative integration pattern.
Verify current LangSmith API at smith.langchain.com/docs
"""
import os
from typing import Any


def configure_langsmith(project_name: str, environment: str = "development"):
    """
    Configure LangSmith tracing for a LangGraph application.
    LangSmith traces LangGraph workflows automatically when LANGCHAIN_TRACING_V2=true.
    """
    # Set these in environment configuration, not hardcoded
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_PROJECT"] = f"{project_name}-{environment}"
    # os.environ["LANGCHAIN_API_KEY"] = "..."  # Set via secrets manager in production
    # os.environ["LANGCHAIN_ENDPOINT"] = "..."  # Required for self-hosted LangSmith


def add_run_metadata(config: dict, metadata: dict) -> dict:
    """
    Add run-level metadata to a LangGraph invocation config.
    Metadata appears in LangSmith UI for filtering and searching.
    """
    if "metadata" not in config:
        config["metadata"] = {}

    config["metadata"].update(metadata)
    return config


def run_with_langsmith_metadata(
    graph: Any,
    initial_state: dict,
    thread_id: str,
    patient_id: str,
    procedure_code: str,
) -> dict:
    """
    Run a LangGraph workflow with LangSmith metadata for filtering in the UI.
    """
    config = {
        "configurable": {"thread_id": thread_id},
        "metadata": {
            "patient_id": patient_id,        # For filtering in LangSmith UI
            "procedure_code": procedure_code,
            "workflow_version": "2.0",
            "environment": os.environ.get("APP_ENV", "development"),
        },
        "tags": ["prior-auth", procedure_code, "hms-workflow"],
    }

    return graph.invoke(initial_state, config)

Pattern 3: Automated Evaluation Pipeline

python
"""
Automated quality evaluation pipeline for agentic outputs.
Educational Example — Illustrative evaluation patterns.
"""
from dataclasses import dataclass
from typing import Optional


@dataclass
class EvaluationResult:
    workflow_id: str
    patient_id: str
    procedure_code: str
    # Quality dimensions
    criteria_accuracy: Optional[float] = None    # 0.0–1.0: were criteria applied correctly?
    guideline_citation_accuracy: float = 0.0     # 0.0–1.0: were citations correct and relevant?
    determination_consistency: float = 0.0       # 0.0–1.0: consistent with similar cases?
    output_format_compliance: float = 0.0        # 0.0–1.0: did output match expected schema?
    # Aggregate
    overall_quality_score: float = 0.0
    passed_quality_gate: bool = False
    evaluation_notes: str = ""


def evaluate_prior_auth_output(
    workflow_output: dict,
    ground_truth: Optional[dict] = None,
) -> EvaluationResult:
    """
    Evaluate a prior authorization workflow output.
    For production: integrate with RAGAS or LangSmith evaluation datasets.

    ground_truth: if provided, compare against known-correct evaluation.
    If not provided, apply heuristic quality checks only.
    """
    result = EvaluationResult(
        workflow_id=workflow_output.get("workflow_id", "unknown"),
        patient_id=workflow_output.get("patient_id", "unknown"),
        procedure_code=workflow_output.get("procedure_code", "unknown"),
    )

    determination = workflow_output.get("final_determination", {})

    # 1. Output format compliance — does the output have required fields?
    required_fields = ["criteria_met", "rationale", "supporting_guidelines", "determination"]
    present_fields = sum(1 for f in required_fields if f in determination)
    result.output_format_compliance = present_fields / len(required_fields)

    # 2. Guideline citation presence — are citations non-empty?
    citations = determination.get("supporting_guidelines", [])
    result.guideline_citation_accuracy = min(len(citations) / 3, 1.0)  # expect ≥ 3 citations

    # 3. If ground truth is available, compare determination
    if ground_truth:
        expected = ground_truth.get("criteria_met")
        actual = determination.get("criteria_met")
        result.criteria_accuracy = 1.0 if expected == actual else 0.0

    # 4. Overall score
    scores = [
        result.output_format_compliance,
        result.guideline_citation_accuracy,
    ]
    if result.criteria_accuracy is not None:
        scores.append(result.criteria_accuracy)

    result.overall_quality_score = sum(scores) / len(scores)

    # Quality gate: all required fields present AND citations present
    result.passed_quality_gate = (
        result.output_format_compliance >= 1.0
        and result.guideline_citation_accuracy >= 0.33  # At least 1 citation
    )

    return result


class EvaluationPipeline:
    """
    Continuous evaluation pipeline for production agent monitoring.
    In production: integrate with LangSmith Evaluation or RAGAS.
    """

    def __init__(self, quality_gate_threshold: float = 0.80):
        self.quality_gate_threshold = quality_gate_threshold
        self.evaluation_history: list[EvaluationResult] = []

    def evaluate_batch(
        self,
        workflow_outputs: list[dict],
        ground_truths: Optional[list[dict]] = None,
    ) -> dict:
        """Evaluate a batch of workflow outputs and return aggregate metrics."""
        results = []
        for i, output in enumerate(workflow_outputs):
            gt = ground_truths[i] if ground_truths else None
            result = evaluate_prior_auth_output(output, gt)
            results.append(result)
            self.evaluation_history.append(result)

        pass_rate = sum(1 for r in results if r.passed_quality_gate) / max(len(results), 1)
        avg_quality = sum(r.overall_quality_score for r in results) / max(len(results), 1)

        return {
            "batch_size": len(results),
            "pass_rate": pass_rate,
            "avg_quality_score": avg_quality,
            "below_gate_count": sum(1 for r in results if not r.passed_quality_gate),
            "gate_threshold": self.quality_gate_threshold,
            "alert_required": avg_quality < self.quality_gate_threshold,
        }

Enterprise Considerations

Token cost observability. Token consumption drives LLM API costs at scale. Instrument every LLM call to capture input and output token counts. Aggregate by model, workflow, node, and time period. Alert when token usage per workflow exceeds expected ranges — unexpectedly high counts indicate prompt injection, reasoning loops, or context window mismanagement.

Latency percentiles. Monitor p50, p95, and p99 latency for each node in the workflow. Long-tail latency (p99) often reveals specific code paths (large tool responses, complex re-ranking, high context length LLM calls) that degrade user experience under load. Per-node latency attribution is possible only when spans are properly structured.

Evaluation dataset drift. The golden evaluation dataset used to validate the pipeline should reflect current patient population and guideline content. If the dataset is not updated when guidelines change, evaluation scores will diverge from real-world accuracy. Schedule quarterly dataset reviews.

Privacy in traces. Traces capture tool inputs and LLM inputs, which in clinical contexts contain PHI. Configure trace collectors to redact or mask PHI fields before transmission to observability platforms. Verify that your observability vendor has a signed BAA if traces will contain PHI.

Security Considerations

Trace data classification. In clinical workflows, traces contain PHI. Treat traces with the same access controls as clinical records: role-based access, audit logging of trace access, retention limits aligned with HIPAA requirements.

Observability as attack surface. An observability platform with weak access controls exposes complete agent reasoning traces, including prompts, tool inputs, and tool outputs. A compromised trace store is equivalent to a compromised EHR query log. Apply principle of least privilege: developers access anonymized or de-identified traces in non-production; only compliance officers access production traces with PHI.

Healthcare Example

⊕ Healthcare Example

Educational Example — Illustrative Observability Configuration. Not intended for clinical decision making.

A Reference Healthcare Organization's agentic prior authorization system implements the following observability configuration:

Dimension Implementation Alert Threshold
Trace store LangSmith (BAA executed) N/A
Compliance audit log Immutable CloudWatch Logs with S3 export Any HITL trigger
Quality gate Automated eval pipeline; criteria accuracy ≥ 0.85 Pass rate < 0.80
Token budget Per-workflow token cap = 20,000 Per-workflow > 18,000
HITL trigger rate Target: 15–25% of requests > 35% (regression alert)
Latency p95 Target: 45 seconds end-to-end > 90 seconds

PHI is redacted from trace metadata before transmission to LangSmith. Patient ID is stored as a hashed reference — lookups require access to the hash mapping, which is in the compliance store, not the trace store.

Common Mistakes

Logging the LLM response text in application logs. Application logs should capture structured event metadata, not the full LLM response body. Full response content belongs in the trace store (LangSmith), not CloudWatch or Datadog. This keeps log storage costs manageable and PHI exposure scoped.

No per-node latency. Logging only end-to-end workflow latency makes it impossible to identify which node is causing slow P99 responses. Span-based tracing with per-node timing is necessary.

Evaluation only on pass/fail. Binary pass/fail metrics do not reveal which quality dimension is degrading. Track sub-scores (citation accuracy, format compliance, criteria accuracy) separately to enable targeted debugging.

No baseline quality metric. Without a pre-deployment quality baseline, you cannot detect regression. Establish baseline metrics before go-live and alert when they drop by more than a defined threshold (e.g., 5 percentage points).

Best Practices

  • Structure traces as parent-child span hierarchies — one root span per workflow, child spans per node and tool call
  • Instrument every LLM call with token counts, model version, and latency
  • Separate compliance audit logs (immutable, clinical events) from operational logs (application events)
  • Mask or hash PHI in all observability data before transmission to third-party platforms
  • Establish quality gate thresholds before go-live — evaluate every batch and alert on threshold violations
  • Run evaluation on sampled production traces, not only on pre-deployment golden datasets
  • Track HITL trigger rate as a proxy for agent confidence drift — rising rates indicate model or data regression

Interview Questions

Q1: How do you detect a quality regression in a production agentic clinical workflow without manual case review?

Category: System Design Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Quality regression detection requires three layers working together.

First, structured tracing with span-level instrumentation captures every workflow run's inputs, outputs, and sub-component results. This is the raw data layer — without it, regression detection is impossible.

Second, an automated evaluation pipeline runs on sampled production traces. It computes quality metrics (format compliance, citation accuracy, determination consistency relative to ground truth) for each sampled run and aggregates them by time window. A sliding-window quality score gives a near-real-time quality signal without requiring manual review of every case.

Third, alerting is configured on the aggregate quality score: if the rolling pass rate drops more than a threshold (e.g., 5 percentage points) below the established baseline, an alert fires. The alert includes which sub-scores dropped, enabling targeted investigation.

Key Points to Hit:

  • Automated evaluation on sampled production traces is the scalable path
  • Sub-dimension metrics (not just overall score) are necessary for targeted debugging
  • A pre-deployment baseline is required — without it, there is no regression reference
  • LangSmith's evaluation datasets and automatic evaluation runs address this in the LangChain ecosystem

Key Takeaways

  • The three observability pillars for agents are: traces (per-run), metrics (aggregated), and logs (structured events)
  • Trace spans should form a parent-child hierarchy: one root span per workflow, child spans per node and tool call
  • LangSmith integrates automatically with LangGraph when LANGCHAIN<em>TRACING</em>V2=true is set
  • Compliance audit logs in clinical systems must be immutable and separate from operational logs
  • Automated evaluation pipelines running on sampled production traces provide scalable quality monitoring
  • PHI must be masked or hashed before transmission to third-party observability platforms

Glossary

Term Definition
Trace An end-to-end record of a single agent workflow execution, composed of nested spans
Span A single unit of work within a trace (one LLM call, one tool call, one node execution)
Trace ID Unique identifier linking all spans belonging to the same workflow execution
Evaluation pipeline Automated system that applies quality metrics to agent outputs and aggregates results
Quality gate A minimum quality score threshold; below it, alerts fire or deployments are blocked
Golden dataset A curated set of inputs with known-correct expected outputs, used for offline evaluation
RAGAS Retrieval-Augmented Generation Assessment — a framework for evaluating RAG and agent quality
LangSmith LangChain's observability and evaluation platform for LLM applications

Further Reading

In This Repository:

External References:

  • LangSmith documentation — official guide to tracing, evaluation, and datasets
  • OpenTelemetry specification — standard for vendor-neutral distributed tracing
  • RAGAS documentation — retrieval and generation evaluation metrics

Previous: Human-in-the-Loop | Next: Model Context Protocol