AI Vendor Evaluation

Conceptual Explanation

Vendor evaluation in enterprise AI has two distinct phases that organizations frequently conflate:

Phase 1 — Qualification: Determining which vendors are eligible for clinical AI use. Eligibility requires a HIPAA Business Associate Agreement, acceptable data residency, acceptable data retention and training policies, and security posture compatible with the organization's risk tolerance. Vendors that do not meet qualification criteria are eliminated before technical evaluation begins.

Phase 2 — Selection: Among qualified vendors, comparing model capability, performance, cost, operational maturity, and strategic fit for the organization's specific use cases and infrastructure.

Conflating these phases — comparing benchmark scores across vendors before confirming HIPAA qualification — is a common mistake that can produce a technically superior vendor that is legally ineligible for clinical PHI processing.

Core Architecture

Common Mistakes

Evaluating on Generic Benchmarks. MMLU and HumanEval scores do not predict clinical AI quality. Organizations that select vendors based on public benchmark leaderboards without running use-case-specific evaluations may find that the "top benchmark" model underperforms for their specific clinical documentation tasks.

Signing a BAA Without Reading Retention Clauses. Some enterprise BAAs include retention provisions that allow the vendor to retain inference content for 30 or 90 days for security monitoring purposes. Organizations that do not negotiate or review these clauses may inadvertently allow PHI retention exceeding their own data minimization policies.

Selecting a Vendor Before Confirming Rate Limits. Default API rate limits for enterprise AI vendors may be orders of magnitude below the rate required for production clinical AI at hospital scale. Confirm that the vendor can provision adequate rate limits, and include rate limit SLA commitments in the contract.

Ignoring the Egress Vendor. For organizations running AI on a cloud provider other than the AI vendor (e.g., running their application on AWS but calling Anthropic API), cross-cloud data egress adds latency and cost. Evaluate total latency including network transit, not just model inference time.

Best Practices

  • Always complete the HIPAA qualification gate before beginning technical evaluation — eliminate non-qualifying vendors first
  • Run use-case-specific evaluations on de-identified clinical data rather than relying on public benchmarks
  • Model cost at production scale (millions of monthly tokens) before finalizing vendor selection
  • Negotiate BAA retention clauses explicitly: PHI should not be retained for training or for longer than operationally necessary
  • Confirm rate limit capacity for production scale before signing — include rate limit minimums in the contract
  • Design for multi-vendor through an AI gateway even if initially deploying single-vendor, to preserve optionality
  • Re-evaluate vendor selection annually — model capabilities, pricing, and compliance commitments evolve

Alternatives

The evaluation in this chapter covers the four primary enterprise AI cloud and API vendors. Additional alternatives include:

  • Self-hosted open-source models (Llama 3, Mistral): Eliminate vendor dependency and data sharing, but require significant GPU infrastructure, ML engineering capacity, and an internal model evaluation and safety program. Appropriate for organizations with extreme data sensitivity or sovereignty requirements.
  • Cohere Enterprise: Strong retrieval and embeddings focus; evaluate for RAG-heavy use cases.
  • Specialized clinical AI vendors (Nuance, Ambient AI): Domain-specific clinical AI products that sit on top of foundation models. Not general-purpose LLM vendors; evaluated separately as point solutions.

Trade-offs

Dimension Direct API (Anthropic) Cloud-Integrated (Azure OAI, Bedrock) Self-Hosted
Latest model access Immediate Delayed (cloud integration lag) Depends on open-source ecosystem
HIPAA BAA Direct BAA required Covered by cloud provider HIPAA BAA Not applicable (no third party)
Existing cloud integration Additional vendor Unified with cloud platform Native
Rate limits Negotiate directly Cloud provider quotas Limited by own GPU capacity
Pricing control Direct negotiation Cloud commitment tiers GPU infrastructure cost
Lock-in risk Medium Medium-High (cloud ecosystem) Low (model portability)
Operational complexity Low Low High

Interview Questions

Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?

Category: Architecture / Compliance Difficulty: Senior Role: AI Architect / FDE

Answer Framework:

The minimum required before PHI can be transmitted to an external LLM API is a signed HIPAA Business Associate Agreement with the vendor. Without a BAA, transmitting PHI to an external service is a potential HIPAA violation regardless of the vendor's security posture.

Beyond the BAA, the due diligence checklist includes: confirming the vendor's data retention policy (PHI should not be retained for training purposes), reviewing data residency to confirm PHI stays in acceptable regions, and confirming the vendor is covered under the organization's risk management program (SOC 2 Type II or equivalent security certifications reviewed).

For an AI vendor specifically: confirm that inference requests are not used to train the model by default, and that the organization can request deletion of any retained inference content. These provisions may require explicit negotiation — they are not universally included in default enterprise terms.

Key Points to Hit:

  • BAA is a legal prerequisite, not optional
  • Model training opt-out is an AI-specific HIPAA concern not present in traditional vendor relationships
  • Data residency matters if the organization has state-level regulations beyond HIPAA
  • SOC 2 Type II is the baseline security certification to require

Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

A rigorous POC requires three elements: a representative dataset, an objective evaluation rubric, and identical conditions across vendors.

The dataset must be de-identified clinical cases representative of the production use case — not synthetic examples, which vendors may perform differently on than real clinical text. For discharge summary generation, this means real (de-identified) encounter data with similar complexity distribution to what production will process.

The evaluation rubric must be defined before running the POC, not after seeing the results. For clinical documentation, the rubric covers: required section presence, format compliance, clinical accuracy (ideally reviewed by a clinical panel rather than automated scoring alone), and hallucination rate (presence of clinically plausible but factually unsupported claims).

Identical conditions: same system prompt (or if testing vendor-specific prompt optimization, clearly documented prompt differences), same input context format, run at the same time of day to avoid diurnal latency variation, sufficient sample size for statistical significance (typically 100+ cases for use cases with binary pass/fail criteria).

Measure: latency (P50 and P95), cost per request, and quality scores. Report all three together — a cheaper, slower vendor may still be the right choice for asynchronous workflows where latency is not the constraint.

Key Points to Hit:

  • De-identified real data, not synthetic
  • Rubric defined before running (avoid confirmation bias)
  • Identical conditions across vendors
  • Sample size sufficient for statistical significance
  • Measure latency, cost, AND quality

Key Takeaways

  • Vendor qualification (HIPAA BAA, data residency, training opt-out) must precede technical evaluation — ineligible vendors should not consume evaluation resources
  • A signed HIPAA BAA with the AI vendor is a legal prerequisite before PHI can be transmitted to any external LLM API
  • Confirm explicitly that inference requests are excluded from model training — this is not universally true by default
  • Use-case-specific evaluations on de-identified clinical data predict clinical AI quality better than public benchmarks
  • Model cost at production scale, not at current volume — per-token economics change substantially with high request volumes
  • Design for multi-vendor strategy through an AI gateway even when starting with a single vendor
  • Negotiate rate limits, pricing tiers, and BAA retention clauses before signing — post-signature negotiating leverage is minimal