AI Vendor Evaluation
Executive Summary
Selecting an AI vendor for a healthcare enterprise is a decision with a 3–5 year operational horizon, HIPAA liability consequences, and technical lock-in risks that are difficult to reverse once clinical systems depend on a vendor's proprietary infrastructure. The decision is not primarily about which model currently benchmarks highest — model capabilities change quarterly. It is about which vendor relationship, contractual structure, data handling commitment, and platform architecture best match the organization's operational requirements, compliance obligations, and strategic direction. This chapter provides an evaluation framework for the four primary enterprise AI providers — Anthropic, Azure OpenAI Service, Google Vertex AI, and AWS Bedrock — applied to a healthcare context.
Learning Objectives
After reading this chapter, you will be able to:
- Apply a structured evaluation framework to compare enterprise AI vendors across contractual, technical, and operational dimensions
- Identify the HIPAA-specific evaluation criteria that distinguish clinical AI vendor selection from general enterprise AI vendor selection
- Design a proof-of-concept evaluation that produces comparable, objective results across multiple vendors
- Articulate the trade-offs between single-vendor and multi-vendor AI strategies at enterprise scale
Business Problem
An enterprise that commits to a single AI vendor without evaluating alternatives, or that evaluates vendors only on model benchmark scores, frequently discovers 18 months into deployment that: the vendor's HIPAA Business Associate Agreement contains data handling provisions that are incompatible with the organization's compliance requirements; the proprietary SDK and API patterns have created a migration cost that makes switching economically infeasible; the model tier required for clinical quality is priced above budget at scale; or the vendor's operational SLA does not meet the availability requirement for a clinical decision support workflow.
These failures are not the result of the chosen vendor being technically inferior — they are the result of evaluating the wrong dimensions at selection time. The correct evaluation framework evaluates the vendor relationship and contractual structure alongside the model capabilities.
Why This Technology Exists
Enterprise AI vendors emerged from two different directions: foundation model developers that built enterprise commercial offerings around their research models (Anthropic, OpenAI), and cloud platform providers that integrated foundation model access into their existing enterprise cloud and compliance infrastructure (Microsoft Azure, Google Cloud, Amazon Web Services).
The distinction matters for healthcare organizations. Foundation model developers offer direct access to their latest models with the most rapid capability releases, but require the organization to build compliance infrastructure (BAA, PHI handling, audit logging) on top of the vendor's API. Cloud platform providers offer model access within existing HIPAA-compliant cloud frameworks, with pre-existing BAAs that cover the AI services alongside storage, compute, and networking — but the integrated model catalog is subject to the cloud provider's release schedule, which may lag the foundation model developer's direct API.
Conceptual Explanation
Vendor evaluation in enterprise AI has two distinct phases that organizations frequently conflate:
Phase 1 — Qualification: Determining which vendors are eligible for clinical AI use. Eligibility requires a HIPAA Business Associate Agreement, acceptable data residency, acceptable data retention and training policies, and security posture compatible with the organization's risk tolerance. Vendors that do not meet qualification criteria are eliminated before technical evaluation begins.
Phase 2 — Selection: Among qualified vendors, comparing model capability, performance, cost, operational maturity, and strategic fit for the organization's specific use cases and infrastructure.
Conflating these phases — comparing benchmark scores across vendors before confirming HIPAA qualification — is a common mistake that can produce a technically superior vendor that is legally ineligible for clinical PHI processing.
Core Architecture
Components
HIPAA Business Associate Agreement (BAA)
A BAA is the contractual instrument that governs how a vendor may process, store, and transmit Protected Health Information (PHI) on behalf of a Covered Entity. For clinical AI use cases that process PHI, a signed BAA with the AI vendor is a legal prerequisite, not a preference.
Key BAA provisions to evaluate:
- PHI data retention: How long does the vendor retain inference request content? Acceptable: short retention (hours to days) for debugging, no retention for training.
- Training opt-out: By default, some AI vendors use API request content to improve their models. Healthcare organizations must confirm that clinical inference data is excluded from training by default, not by opt-out.
- Subprocessors: The BAA should disclose and govern all subprocessors (third parties the vendor may share PHI with for service delivery).
- Breach notification: The BAA must specify notification timelines in the event of a security incident.
- Audit rights: The organization should have the right to audit or request evidence of the vendor's security controls.
Data Residency
HIPAA does not prescribe data residency requirements, but healthcare organizations in regulated states, or with multi-national operations subject to GDPR or local healthcare data laws, may have data residency requirements that constrain which cloud regions can process PHI. Evaluate whether the vendor supports inference in the required regions.
Model Capability Evaluation
Benchmark scores on standard datasets (MMLU, HellaSwag, HumanEval) are necessary but insufficient for clinical AI vendor selection. They measure generic reasoning, not clinical reasoning. The relevant evaluation is: on the organization's specific clinical use cases, with its specific data, how does each vendor's model perform?
Clinical capability dimensions:
- Clinical reasoning accuracy: Does the model reason correctly about clinical scenarios relevant to the use case?
- Hallucination rate: Does the model generate clinically plausible but factually incorrect claims?
- Citation adherence: When grounded in clinical guidelines, does the model accurately apply them?
- Format compliance: Does the model reliably produce structured output (discharge summaries, prior auth forms) in the required format?
- Instruction following: Does the model consistently follow clinical documentation constraints?
Implementation Patterns
The Parallel POC Framework
A rigorous vendor evaluation runs a parallel proof-of-concept on a representative subset of a real clinical use case. The following pattern provides comparable, objective results:
# Educational Example — Parallel Vendor Evaluation Framework
# Illustrates how to run the same clinical evaluation across multiple LLM vendors
# Educational disclaimer: Not intended for clinical use
from dataclasses import dataclass
from typing import Callable, Optional
import time
@dataclass
class ClinicalEvalCase:
"""A single clinical evaluation case with expected output criteria."""
case_id: str
use_case: str # e.g., "discharge_summary"
input_context: str # Structured clinical context (de-identified)
required_sections: list[str]
forbidden_content: list[str]
reference_output: Optional[str] = None
@dataclass
class VendorEvalResult:
"""Result of evaluating one case against one vendor's model."""
case_id: str
vendor: str
model_id: str
output: str
latency_ms: float
input_tokens: int
output_tokens: int
sections_present: bool
forbidden_content_absent: bool
passed: bool
def evaluate_vendor_on_case(
vendor_name: str,
model_id: str,
inference_fn: Callable[[str, str], tuple[str, int, int]],
case: ClinicalEvalCase,
system_prompt: str,
) -> VendorEvalResult:
"""
Run a single evaluation case against a specific vendor's model.
inference_fn: (system_prompt, user_prompt) -> (output_text, input_tokens, output_tokens)
"""
start = time.monotonic()
output, input_tokens, output_tokens = inference_fn(system_prompt, case.input_context)
latency_ms = (time.monotonic() - start) * 1000
sections_present = all(
section.lower() in output.lower()
for section in case.required_sections
)
forbidden_content_absent = not any(
phrase.lower() in output.lower()
for phrase in case.forbidden_content
)
return VendorEvalResult(
case_id=case.case_id,
vendor=vendor_name,
model_id=model_id,
output=output,
latency_ms=latency_ms,
input_tokens=input_tokens,
output_tokens=output_tokens,
sections_present=sections_present,
forbidden_content_absent=forbidden_content_absent,
passed=sections_present and forbidden_content_absent,
)Cost Modeling at Scale
Point-in-time pricing comparisons are unreliable for vendor selection because AI API pricing changes frequently. More useful is a cost model that applies the vendor's current pricing structure to production-representative call patterns:
# Educational Example — Vendor Cost Model
# Illustrative — verify current pricing in vendor documentation
from dataclasses import dataclass
@dataclass
class VendorCostModel:
"""
Cost model for a single LLM vendor at projected production volume.
All prices illustrative — verify current rates in vendor documentation.
"""
vendor: str
model_tier: str
input_price_per_million_tokens: float # $ per 1M input tokens
output_price_per_million_tokens: float # $ per 1M output tokens
cache_read_price_per_million_tokens: float
def monthly_cost(
self,
daily_requests: int,
avg_input_tokens: int,
avg_output_tokens: int,
cache_hit_rate: float = 0.0,
) -> float:
"""Estimate monthly cost for a given usage pattern."""
monthly_requests = daily_requests * 30
uncached_input = avg_input_tokens * (1 - cache_hit_rate)
cached_input = avg_input_tokens * cache_hit_rate
input_cost = (
monthly_requests * uncached_input / 1_000_000
* self.input_price_per_million_tokens
)
cache_cost = (
monthly_requests * cached_input / 1_000_000
* self.cache_read_price_per_million_tokens
)
output_cost = (
monthly_requests * avg_output_tokens / 1_000_000
* self.output_price_per_million_tokens
)
return input_cost + cache_cost + output_costEnterprise Considerations
Multi-Vendor Strategy: A multi-vendor approach that uses one vendor for primary inference and a second for fallback provides resilience against vendor outages, negotiation leverage at contract renewal, and flexibility to route different use cases to the most cost-effective model for that use case. The cost is integration complexity. The AI gateway described in Chapter 6 is the enabling infrastructure for multi-vendor strategy — without gateway abstraction, multi-vendor quickly becomes an unmaintainable integration sprawl.
Contract Negotiation at Enterprise Scale: Enterprise healthcare organizations deploying AI at scale (millions of monthly tokens across multiple use cases) have negotiating leverage. Committed spend agreements can reduce effective per-token pricing significantly. Negotiate before committing to a vendor, not after. Key terms to negotiate: enterprise discount tiers, rate limit increases, dedicated capacity guarantees, and contractual SLA commitments.
API Stability: Model versions are deprecated, APIs change, SDK breaking changes are introduced. Evaluate the vendor's history of API stability, deprecation notice periods, and backward compatibility commitments. For clinical AI, a model version being deprecated without adequate notice is an operational risk — the successor model must be evaluated for clinical quality before it goes into production.
Support Model: Enterprise healthcare organizations require support SLA commitments beyond consumer-tier response times. Evaluate: What is the vendor's enterprise support tier? What is the committed response time for production-impacting incidents? Is a dedicated technical account manager available?
Security Considerations
- PHI handling in inference requests: Confirm in writing that the vendor does not use inference request content for model training. This is a HIPAA privacy requirement, not a preference.
- Encryption at rest and in transit: Confirm that inference requests are encrypted in transit (TLS 1.2+) and that any persisted data (for debugging, logging) is encrypted at rest with keys the organization can control or audit.
- Logging and audit trail: Confirm the vendor provides per-request audit logs with sufficient metadata for HIPAA audit purposes.
- Penetration testing and certifications: Evaluate vendor security certifications (SOC 2 Type II, ISO 27001, HITRUST) and the recency of penetration testing.
Healthcare Example
Educational Example — Illustrative Workflow. Not intended for clinical decision making.
The Reference Healthcare Organization evaluates three qualified vendors for its AI platform — Anthropic API, Azure OpenAI Service, and AWS Bedrock — against a discharge summary generation use case. The organization has already signed a BAA with Microsoft Azure for its EHR infrastructure; the question is whether to extend that BAA to Azure OpenAI Service or establish a direct relationship with Anthropic and/or AWS.
Qualification gate results:
| Criterion | Anthropic API | Azure OpenAI Service | AWS Bedrock |
|---|---|---|---|
| HIPAA BAA available | Yes | Yes (via Azure HIPAA BAA) | Yes (via AWS HIPAA BAA) |
| Data residency (US) | Yes | Yes | Yes |
| PHI excluded from training | Yes (by default for API) | Yes | Yes |
| SOC 2 Type II | Yes | Yes | Yes |
| Qualified | Yes | Yes | Yes |
Phase 2 evaluation results (discharge summary use case, de-identified test set, 100 cases):
| Dimension | Anthropic (Claude Opus) | Azure OAI (GPT-4o) | AWS Bedrock (Claude Sonnet) |
|---|---|---|---|
| Section completion rate | 97% | 92% | 95% |
| Clinical accuracy (panel review) | 4.6/5.0 | 4.3/5.0 | 4.5/5.0 |
| P50 latency | 2.1s | 2.4s | 2.3s |
| P95 latency | 5.8s | 7.1s | 6.4s |
| Format compliance | 96% | 89% | 94% |
| Illustrative cost per 1K requests | $X.XX | $X.XX | $X.XX (see vendor docs) |
All pricing figures are illustrative. Verify current pricing in vendor documentation before cost modeling.
Strategic fit assessment:
The organization already uses Azure for EHR hosting, Active Directory, and storage. Azure OpenAI Service would allow the AI service to operate under the existing enterprise HIPAA BAA and Azure RBAC. However, Anthropic's models achieved meaningfully higher clinical accuracy in panel review, and the organization's AI governance policy requires the highest-quality model available for clinical documentation. The final decision: use Anthropic API (with a direct BAA) as the primary inference endpoint for Tier 1 clinical use cases, and Azure OpenAI Service for Tier 2 (administrative) use cases where the existing Azure relationship simplifies compliance operations.
Common Mistakes
Evaluating on Generic Benchmarks. MMLU and HumanEval scores do not predict clinical AI quality. Organizations that select vendors based on public benchmark leaderboards without running use-case-specific evaluations may find that the "top benchmark" model underperforms for their specific clinical documentation tasks.
Signing a BAA Without Reading Retention Clauses. Some enterprise BAAs include retention provisions that allow the vendor to retain inference content for 30 or 90 days for security monitoring purposes. Organizations that do not negotiate or review these clauses may inadvertently allow PHI retention exceeding their own data minimization policies.
Selecting a Vendor Before Confirming Rate Limits. Default API rate limits for enterprise AI vendors may be orders of magnitude below the rate required for production clinical AI at hospital scale. Confirm that the vendor can provision adequate rate limits, and include rate limit SLA commitments in the contract.
Ignoring the Egress Vendor. For organizations running AI on a cloud provider other than the AI vendor (e.g., running their application on AWS but calling Anthropic API), cross-cloud data egress adds latency and cost. Evaluate total latency including network transit, not just model inference time.
Best Practices
- Always complete the HIPAA qualification gate before beginning technical evaluation — eliminate non-qualifying vendors first
- Run use-case-specific evaluations on de-identified clinical data rather than relying on public benchmarks
- Model cost at production scale (millions of monthly tokens) before finalizing vendor selection
- Negotiate BAA retention clauses explicitly: PHI should not be retained for training or for longer than operationally necessary
- Confirm rate limit capacity for production scale before signing — include rate limit minimums in the contract
- Design for multi-vendor through an AI gateway even if initially deploying single-vendor, to preserve optionality
- Re-evaluate vendor selection annually — model capabilities, pricing, and compliance commitments evolve
Alternatives
The evaluation in this chapter covers the four primary enterprise AI cloud and API vendors. Additional alternatives include:
- Self-hosted open-source models (Llama 3, Mistral): Eliminate vendor dependency and data sharing, but require significant GPU infrastructure, ML engineering capacity, and an internal model evaluation and safety program. Appropriate for organizations with extreme data sensitivity or sovereignty requirements.
- Cohere Enterprise: Strong retrieval and embeddings focus; evaluate for RAG-heavy use cases.
- Specialized clinical AI vendors (Nuance, Ambient AI): Domain-specific clinical AI products that sit on top of foundation models. Not general-purpose LLM vendors; evaluated separately as point solutions.
Trade-offs
| Dimension | Direct API (Anthropic) | Cloud-Integrated (Azure OAI, Bedrock) | Self-Hosted |
|---|---|---|---|
| Latest model access | Immediate | Delayed (cloud integration lag) | Depends on open-source ecosystem |
| HIPAA BAA | Direct BAA required | Covered by cloud provider HIPAA BAA | Not applicable (no third party) |
| Existing cloud integration | Additional vendor | Unified with cloud platform | Native |
| Rate limits | Negotiate directly | Cloud provider quotas | Limited by own GPU capacity |
| Pricing control | Direct negotiation | Cloud commitment tiers | GPU infrastructure cost |
| Lock-in risk | Medium | Medium-High (cloud ecosystem) | Low (model portability) |
| Operational complexity | Low | Low | High |
Interview Questions
Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?
Category: Architecture / Compliance Difficulty: Senior Role: AI Architect / FDE
Answer Framework:
The minimum required before PHI can be transmitted to an external LLM API is a signed HIPAA Business Associate Agreement with the vendor. Without a BAA, transmitting PHI to an external service is a potential HIPAA violation regardless of the vendor's security posture.
Beyond the BAA, the due diligence checklist includes: confirming the vendor's data retention policy (PHI should not be retained for training purposes), reviewing data residency to confirm PHI stays in acceptable regions, and confirming the vendor is covered under the organization's risk management program (SOC 2 Type II or equivalent security certifications reviewed).
For an AI vendor specifically: confirm that inference requests are not used to train the model by default, and that the organization can request deletion of any retained inference content. These provisions may require explicit negotiation — they are not universally included in default enterprise terms.
Key Points to Hit:
- BAA is a legal prerequisite, not optional
- Model training opt-out is an AI-specific HIPAA concern not present in traditional vendor relationships
- Data residency matters if the organization has state-level regulations beyond HIPAA
- SOC 2 Type II is the baseline security certification to require
Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?
Category: System Design Difficulty: Senior Role: AI Architect
Answer Framework:
A rigorous POC requires three elements: a representative dataset, an objective evaluation rubric, and identical conditions across vendors.
The dataset must be de-identified clinical cases representative of the production use case — not synthetic examples, which vendors may perform differently on than real clinical text. For discharge summary generation, this means real (de-identified) encounter data with similar complexity distribution to what production will process.
The evaluation rubric must be defined before running the POC, not after seeing the results. For clinical documentation, the rubric covers: required section presence, format compliance, clinical accuracy (ideally reviewed by a clinical panel rather than automated scoring alone), and hallucination rate (presence of clinically plausible but factually unsupported claims).
Identical conditions: same system prompt (or if testing vendor-specific prompt optimization, clearly documented prompt differences), same input context format, run at the same time of day to avoid diurnal latency variation, sufficient sample size for statistical significance (typically 100+ cases for use cases with binary pass/fail criteria).
Measure: latency (P50 and P95), cost per request, and quality scores. Report all three together — a cheaper, slower vendor may still be the right choice for asynchronous workflows where latency is not the constraint.
Key Points to Hit:
- De-identified real data, not synthetic
- Rubric defined before running (avoid confirmation bias)
- Identical conditions across vendors
- Sample size sufficient for statistical significance
- Measure latency, cost, AND quality
Key Takeaways
- Vendor qualification (HIPAA BAA, data residency, training opt-out) must precede technical evaluation — ineligible vendors should not consume evaluation resources
- A signed HIPAA BAA with the AI vendor is a legal prerequisite before PHI can be transmitted to any external LLM API
- Confirm explicitly that inference requests are excluded from model training — this is not universally true by default
- Use-case-specific evaluations on de-identified clinical data predict clinical AI quality better than public benchmarks
- Model cost at production scale, not at current volume — per-token economics change substantially with high request volumes
- Design for multi-vendor strategy through an AI gateway even when starting with a single vendor
- Negotiate rate limits, pricing tiers, and BAA retention clauses before signing — post-signature negotiating leverage is minimal
Glossary
Business Associate Agreement (BAA): A HIPAA-required contract between a Covered Entity and a vendor (Business Associate) that processes PHI, governing how the vendor may use, store, and disclose that PHI.
Model training opt-out: A provision that prevents a vendor from using inference request content to improve its model. Some vendors enable this by default for enterprise API customers; others require explicit configuration.
Data residency: The requirement that PHI remain within a specified geographic region (e.g., US-only) during processing and storage.
Rate limit: A vendor-imposed ceiling on the number of API requests or tokens that can be processed per minute or per day. Relevant to production planning for high-volume clinical AI use cases.
Further Reading
- Chapter 2: AI Governance — Governance requirements that shape vendor selection criteria
- Chapter 4: Cost Management — Token economics and cost modeling at production scale
- Chapter 6: AI Platform Architecture — AI gateway as the enabling infrastructure for multi-vendor strategy
- Anthropic Enterprise Documentation — Current enterprise terms and BAA information
- AWS Bedrock HIPAA Compliance — AWS HIPAA eligible services documentation
- Azure OpenAI Service HIPAA — Azure compliance documentation