Evaluation and Benchmarking

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Advanced


Executive Summary

AI systems that cannot be measured cannot be improved, governed, or trusted. Evaluation is the engineering discipline that answers "does this AI system actually work?" — not based on impressions or cherry-picked examples, but based on systematic measurement against defined quality criteria. In enterprise and clinical contexts, evaluation is also a governance requirement: before deploying an AI system that influences clinical decisions, you must demonstrate its quality. This chapter covers LLM evaluation frameworks, RAG-specific metrics (RAGAS), clinical AI evaluation, and the infrastructure needed to make evaluation continuous rather than a one-time gate.


Trade-offs and Considerations

Evaluation Pitfalls

Pitfall Description Mitigation
Golden set contamination Golden set questions are similar to training/indexing data, producing inflated scores Ensure golden set is held out and not used for any development decisions
Metric-hacking Optimizing the system for the evaluation metric at the expense of overall quality Use multiple diverse metrics; supplement automated metrics with human evaluation
Stale golden set Golden set questions become easy over time as the system is tuned to them Quarterly golden set refresh with new questions from production
Ignoring edge cases Golden set represents easy, common queries — edge cases are underrepresented Deliberately add challenging, ambiguous, and edge case queries
RAGAS as a truth RAGAS scores are themselves estimated by an LLM — they have their own error rate Treat RAGAS as a proxy, not ground truth; validate against human scores periodically

Evaluation at Scale

At a Reference Healthcare Organization processing thousands of daily patient encounters, even a 1% sample generates a meaningful number of queries per day for evaluation. This creates a rich ongoing evaluation dataset but requires:

  • Automated logging of all AI queries, responses, and retrieved contexts
  • PHI stripping before logging (de-identify query and response)
  • Storage infrastructure for evaluation history (PostgreSQL + S3)
  • Dashboard tooling (LangSmith, Grafana) for trend visualization

Interview Questions

Q1: How would you design a continuous evaluation system for a clinical AI system after initial deployment?

Category: Architecture / System Design Difficulty: Principal Role: AI Architect

Answer Framework:

Post-deployment evaluation has three layers: automated metrics, human sampling, and drift detection.

For automated metrics: every production query is logged (with PHI stripped) along with the AI response and the retrieved context chunks. Nightly, a RAGAS evaluation job samples a stratified subset (200 queries representative of query category distribution) and computes faithfulness, answer relevancy, context precision, and context recall. Safety rule checks run on all production outputs in real-time — dose recommendation violations and diagnostic statements trigger immediate alerts to the AI clinical review committee.

For human sampling: 25 de-identified query-response pairs are sent to the clinical informatics review panel weekly. Reviewers score on clinical accuracy (1–5) and clinical utility (1–5). Scores below 3.5 trigger a root cause analysis. This human signal is the ground truth that calibrates the automated metrics.

For drift detection: statistical process control on the automated metrics (CUSUM or EWMA control charts) detects gradual quality degradation that point-in-time comparisons miss. Alerts fire when cumulative drift exceeds 10% from the baseline established at deployment. Separately, monitor for distribution shift in query types — if the proportion of pharmacology questions suddenly increases, that may indicate the system is being used outside its intended scope, requiring targeted evaluation.

The governance mechanism: all evaluation results are reported monthly to the AI Clinical Review Committee. The committee has authority to suspend the AI system pending investigation if safety metrics are violated or quality drops below the pre-deployment gate thresholds.


Q2: What is the RAGAS "faithfulness" metric and why is it particularly important for clinical AI?

Category: Technical Depth Difficulty: Mid-Level Role: AI Architect / ML Engineer

Answer Framework:

Faithfulness measures whether the statements in the generated answer are supported by the retrieved context — it detects hallucination within a RAG system specifically. RAGAS computes it by extracting individual claims from the generated answer and checking each claim against the retrieved context using an LLM judge. A faithfulness score of 0.90 means 90% of the factual statements in the generated answers are supported by the retrieved context; 10% are generated from the model's parametric memory or invented.

In clinical AI, faithfulness is the most safety-critical metric for a specific reason: a hallucinated drug interaction or clinical fact that contradicts the retrieved evidence is the most dangerous failure mode. A RAG system with low faithfulness is actively dangerous — it is presenting generated clinical content as if it were drawn from authoritative sources, while the actual retrieved sources contradict or don't support the answer.

The practical threshold for clinical systems should be meaningfully higher than for general knowledge systems. For a general Q&A chatbot, faithfulness of 0.80 is acceptable. For a clinical RAG system that informs medication and treatment decisions, I would set the minimum acceptable faithfulness at 0.88–0.90 and treat anything below 0.85 as a system incident requiring investigation and potential suspension.

Faithfulness also has a specific failure mode that's worth understanding: the LLM may refuse to answer based on the retrieved context (correct behavior when context is insufficient) and instead answer from its own knowledge, producing a factually correct answer that nonetheless has low faithfulness. This is actually a prompt engineering problem, not a retrieval problem — the system prompt must explicitly instruct the model to say "insufficient context" rather than generating from parametric memory.


Further Reading

In This Repository: