Evaluation and Benchmarking

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Advanced


Trade-offs and Considerations

Evaluation Pitfalls

Pitfall Description Mitigation
Golden set contamination Golden set questions are similar to training/indexing data, producing inflated scores Ensure golden set is held out and not used for any development decisions
Metric-hacking Optimizing the system for the evaluation metric at the expense of overall quality Use multiple diverse metrics; supplement automated metrics with human evaluation
Stale golden set Golden set questions become easy over time as the system is tuned to them Quarterly golden set refresh with new questions from production
Ignoring edge cases Golden set represents easy, common queries — edge cases are underrepresented Deliberately add challenging, ambiguous, and edge case queries
RAGAS as a truth RAGAS scores are themselves estimated by an LLM — they have their own error rate Treat RAGAS as a proxy, not ground truth; validate against human scores periodically

Evaluation at Scale

At a Reference Healthcare Organization processing thousands of daily patient encounters, even a 1% sample generates a meaningful number of queries per day for evaluation. This creates a rich ongoing evaluation dataset but requires:

  • Automated logging of all AI queries, responses, and retrieved contexts
  • PHI stripping before logging (de-identify query and response)
  • Storage infrastructure for evaluation history (PostgreSQL + S3)
  • Dashboard tooling (LangSmith, Grafana) for trend visualization

Comparison Table

Evaluation Type Automated PHI-Safe Clinical Accuracy Cost Latency
RAGAS Yes Requires PHI handling Proxy measure Low (LLM cost) Hours
Safety rule checks Yes Yes (pattern matching) Partial Very Low Minutes
Latency monitoring Yes Yes No Very Low Real-time
Human clinical review No Requires de-identification High High (clinical time) Days
A/B testing Yes Requires PHI handling Behavioral signal Medium Ongoing

Interview Questions

Q1: How would you design a continuous evaluation system for a clinical AI system after initial deployment?

Category: Architecture / System Design Difficulty: Principal Role: AI Architect

Answer Framework:

Post-deployment evaluation has three layers: automated metrics, human sampling, and drift detection.

For automated metrics: every production query is logged (with PHI stripped) along with the AI response and the retrieved context chunks. Nightly, a RAGAS evaluation job samples a stratified subset (200 queries representative of query category distribution) and computes faithfulness, answer relevancy, context precision, and context recall. Safety rule checks run on all production outputs in real-time — dose recommendation violations and diagnostic statements trigger immediate alerts to the AI clinical review committee.

For human sampling: 25 de-identified query-response pairs are sent to the clinical informatics review panel weekly. Reviewers score on clinical accuracy (1–5) and clinical utility (1–5). Scores below 3.5 trigger a root cause analysis. This human signal is the ground truth that calibrates the automated metrics.

For drift detection: statistical process control on the automated metrics (CUSUM or EWMA control charts) detects gradual quality degradation that point-in-time comparisons miss. Alerts fire when cumulative drift exceeds 10% from the baseline established at deployment. Separately, monitor for distribution shift in query types — if the proportion of pharmacology questions suddenly increases, that may indicate the system is being used outside its intended scope, requiring targeted evaluation.

The governance mechanism: all evaluation results are reported monthly to the AI Clinical Review Committee. The committee has authority to suspend the AI system pending investigation if safety metrics are violated or quality drops below the pre-deployment gate thresholds.


Q2: What is the RAGAS "faithfulness" metric and why is it particularly important for clinical AI?

Category: Technical Depth Difficulty: Mid-Level Role: AI Architect / ML Engineer

Answer Framework:

Faithfulness measures whether the statements in the generated answer are supported by the retrieved context — it detects hallucination within a RAG system specifically. RAGAS computes it by extracting individual claims from the generated answer and checking each claim against the retrieved context using an LLM judge. A faithfulness score of 0.90 means 90% of the factual statements in the generated answers are supported by the retrieved context; 10% are generated from the model's parametric memory or invented.

In clinical AI, faithfulness is the most safety-critical metric for a specific reason: a hallucinated drug interaction or clinical fact that contradicts the retrieved evidence is the most dangerous failure mode. A RAG system with low faithfulness is actively dangerous — it is presenting generated clinical content as if it were drawn from authoritative sources, while the actual retrieved sources contradict or don't support the answer.

The practical threshold for clinical systems should be meaningfully higher than for general knowledge systems. For a general Q&A chatbot, faithfulness of 0.80 is acceptable. For a clinical RAG system that informs medication and treatment decisions, I would set the minimum acceptable faithfulness at 0.88–0.90 and treat anything below 0.85 as a system incident requiring investigation and potential suspension.

Faithfulness also has a specific failure mode that's worth understanding: the LLM may refuse to answer based on the retrieved context (correct behavior when context is insufficient) and instead answer from its own knowledge, producing a factually correct answer that nonetheless has low faithfulness. This is actually a prompt engineering problem, not a retrieval problem — the system prompt must explicitly instruct the model to say "insufficient context" rather than generating from parametric memory.