AI Observability and Monitoring

Conceptual Explanation

LLM observability operates across three layers that must work together to provide production confidence in clinical AI systems.

Inference-Level Observability: Capturing the metadata and content signals from individual inference calls โ€” latency, token counts, model version, prompt version, and quality signals (confidence scores, output length, structured field completeness). This layer produces the raw data that higher-level analytics aggregate.

Pipeline-Level Observability: For multi-step workflows โ€” RAG pipelines, agentic chains, tool-calling sequences โ€” capturing the full execution trace: which retrieval calls were made, which context was included, which tools were invoked, how many retry attempts occurred. Without pipeline-level tracing, debugging a quality failure in a multi-step clinical workflow is nearly impossible.

Quality-Level Observability: Aggregating inference-level signals over time to detect statistical degradation. This layer answers the question: is the system performing as well today as it was last week? Quality metrics include automated scores (semantic similarity to reference outputs, structured field completeness, hallucination detection signals) and human evaluation samples from the clinical quality team.

Core Architecture

Common Mistakes

Monitoring Only Infrastructure Metrics. Latency and error rate dashboards are necessary but insufficient for AI systems. An LLM-based clinical AI can operate within all infrastructure SLAs while producing clinically degraded outputs. Quality metrics must be first-class monitoring signals.

Synchronous Quality Scoring. Running quality evaluation synchronously (inline with the inference call) adds latency to clinical workflows. Quality scoring must be asynchronous โ€” compute it after the output is delivered and write results to the quality store for offline aggregation.

Absolute Quality Thresholds Instead of Drift. Individual LLM output quality scores are noisy. Setting an alert for "any output below 0.75 quality" produces constant noise for a system that legitimately produces a distribution of quality scores. Alert on statistical drift (7-day average drops 5% from 30-day baseline) rather than individual output quality.

No PHI Protection in Traces. Storing raw clinical inputs in trace data is a HIPAA violation if trace backends are not covered by a BAA and appropriate access controls. Hash or omit PHI from trace attributes from the instrumentation layer โ€” not as a post-processing step.

Best Practices

  • Define quality SLAs before the first production deployment: what quality score, measured how, must be maintained for the system to remain in production
  • Use OpenTelemetry as the instrumentation standard for all clinical AI workflows โ€” it provides vendor flexibility for the trace backend
  • Route 1โ€“3% of production outputs to a human review queue to maintain calibrated quality labels alongside automated scoring
  • Configure drift alerts on 7-day rolling averages relative to 30-day baselines, not on individual output scores
  • Build the quality scoring pipeline as shared platform infrastructure โ€” all clinical AI use cases should feed the same scoring and drift detection system
  • Hash all PHI identifiers in trace attributes and review trace access controls before the first production inference
  • Run weekly quality reports to the clinical governance committee; do not rely on incident-driven quality communication

Alternatives

LangSmith (LangChain): A managed LLM observability platform with native support for LangChain and LangGraph traces, evaluation datasets, and human review workflows. Provides most of the pipeline-level and quality-level observability described in this chapter without custom infrastructure development. Suitable for organizations that have adopted LangChain/LangGraph as the orchestration framework.

Arize AI / Phoenix: An open-source LLM observability platform that supports OpenTelemetry-compatible trace ingestion and provides built-in evaluation metrics including hallucination detection and semantic similarity scoring. Appropriate for organizations that require on-premises observability for PHI protection.

Weights & Biases (Weave): A general ML experiment tracking platform with LLM observability features. Strong for tracking quality metrics over model training iterations; less mature for production inference observability at clinical scale.

Trade-offs

Approach Quality Visibility PHI Risk Implementation Cost Vendor Lock-in
Infrastructure metrics only None None Low None
OTel + custom quality scorer High Managed High None
Managed LLM observability platform High Requires BAA review Medium Medium
Open-source (Phoenix) Medium On-premises Medium Low
Manual clinical audit only Low (delayed) None Medium (clinical staff time) None

Interview Questions

Q: How do you detect that a clinical AI system's output quality has degraded in production before a clinician reports a problem?

Category: Architecture Difficulty: Senior Role: AI Architect

Answer Framework:

Structure the answer around a four-layer detection approach: automated quality scoring, statistical drift detection, human review sampling, and behavioral signal monitoring.

Automated quality scoring: every production output is scored asynchronously by a quality evaluation pipeline. For clinical documentation AI, the scorer applies structural completeness checks (required fields present), clinical entity coverage (key diagnoses and medications mentioned in the source are reflected in the output), and NLI-based factual consistency (claims in the output are supported by the input context, not hallucinated).

Statistical drift detection: individual quality scores are aggregated into 7-day rolling averages and compared against the 30-day baseline. A drift threshold (typically 5% relative drop for warning, 10% for critical) triggers an alert before the cumulative impact becomes visible to clinicians.

Human review sampling: 1โ€“3% of production outputs are routed to a clinical review queue where credentialed reviewers evaluate quality on a standardized rubric. This provides ground truth labels that calibrate the automated scorer and catches failure modes that automated scoring misses.

Behavioral signals: indirect quality indicators that surface in application-layer metrics โ€” clinician edit rate (fraction of AI outputs modified before use), time-to-accept (if clinicians take significantly longer to review outputs, they are finding more problems), and explicit thumbs-up/down feedback where present.

Key Points to Hit:

  • Automated quality scoring is asynchronous โ€” does not add latency
  • Statistical drift (7-day vs. 30-day comparison) is more reliable than absolute thresholds
  • Human review at 1โ€“3% sample rate provides ground truth calibration
  • Clinician edit rate is the most clinically meaningful behavioral signal

Q: What does a well-designed clinical AI SLA look like, and how does it differ from a traditional software SLA?

Category: Architecture Difficulty: Senior Role: AI Architect / Engineering Manager

Answer Framework:

A traditional software SLA covers three dimensions: availability (99.9% uptime), latency (p95 < 500ms), and error rate (< 0.1%). These are measurable from infrastructure metrics alone.

A clinical AI SLA must add a fourth dimension: quality. The quality SLA specifies that a defined fraction of AI outputs (say, 92% of discharge summaries) must score above a minimum quality threshold (say, 0.82 on the structured completeness scorer) over a 7-day rolling window. If the quality SLA is breached, the same incident response process that triggers for an availability breach applies.

Additionally, for generative clinical AI, the SLA may specify a human evaluation standard: at least 90% of sampled outputs in the human review queue must be rated "acceptable without major revision" by credentialed clinical reviewers. This provides a clinically interpretable SLA that governance committees can evaluate.

The SLA must also specify the hallucination rate threshold: the fraction of outputs in which the quality scorer detects a factually unsupported claim must remain below a defined rate (say, < 2%). Breaching this threshold triggers a governance review, not just an operations response.

Key Points to Hit:

  • Quality dimension is absent from traditional SLAs but essential for clinical AI
  • SLA thresholds are measured over rolling windows, not point-in-time
  • Human evaluation provides the clinically interpretable quality standard
  • Hallucination rate is a distinct SLA dimension for generative clinical AI
  • SLA breach triggers governance review, not just infrastructure incident response

Key Takeaways

  • LLM-based clinical AI can fail silently โ€” degraded quality produces no infrastructure alerts, only statistical shifts in output quality that require dedicated monitoring to detect
  • Quality metrics must be first-class observability signals alongside latency and error rate
  • Asynchronous quality scoring keeps evaluation out of the critical path and enables comprehensive per-output assessment
  • Statistical drift detection (7-day vs. 30-day rolling comparison) is more reliable than absolute quality thresholds for detecting genuine degradation
  • PHI must never appear in trace attributes โ€” use hashed identifiers from the instrumentation layer
  • Human review sampling at 1โ€“3% provides calibrated ground truth that automated scoring alone cannot provide
  • Clinical AI SLAs must include a quality dimension, not just availability and latency