AI Vendor Evaluation
Executive Summary
Selecting an AI vendor for a healthcare enterprise is a decision with a 3–5 year operational horizon, HIPAA liability consequences, and technical lock-in risks that are difficult to reverse once clinical systems depend on a vendor's proprietary infrastructure. The decision is not primarily about which model currently benchmarks highest — model capabilities change quarterly. It is about which vendor relationship, contractual structure, data handling commitment, and platform architecture best match the organization's operational requirements, compliance obligations, and strategic direction. This chapter provides an evaluation framework for the four primary enterprise AI providers — Anthropic, Azure OpenAI Service, Google Vertex AI, and AWS Bedrock — applied to a healthcare context.
Learning Objectives
After reading this chapter, you will be able to:
- Apply a structured evaluation framework to compare enterprise AI vendors across contractual, technical, and operational dimensions
- Identify the HIPAA-specific evaluation criteria that distinguish clinical AI vendor selection from general enterprise AI vendor selection
- Design a proof-of-concept evaluation that produces comparable, objective results across multiple vendors
- Articulate the trade-offs between single-vendor and multi-vendor AI strategies at enterprise scale
Business Problem
An enterprise that commits to a single AI vendor without evaluating alternatives, or that evaluates vendors only on model benchmark scores, frequently discovers 18 months into deployment that: the vendor's HIPAA Business Associate Agreement contains data handling provisions that are incompatible with the organization's compliance requirements; the proprietary SDK and API patterns have created a migration cost that makes switching economically infeasible; the model tier required for clinical quality is priced above budget at scale; or the vendor's operational SLA does not meet the availability requirement for a clinical decision support workflow.
These failures are not the result of the chosen vendor being technically inferior — they are the result of evaluating the wrong dimensions at selection time. The correct evaluation framework evaluates the vendor relationship and contractual structure alongside the model capabilities.
Why This Technology Exists
Enterprise AI vendors emerged from two different directions: foundation model developers that built enterprise commercial offerings around their research models (Anthropic, OpenAI), and cloud platform providers that integrated foundation model access into their existing enterprise cloud and compliance infrastructure (Microsoft Azure, Google Cloud, Amazon Web Services).
The distinction matters for healthcare organizations. Foundation model developers offer direct access to their latest models with the most rapid capability releases, but require the organization to build compliance infrastructure (BAA, PHI handling, audit logging) on top of the vendor's API. Cloud platform providers offer model access within existing HIPAA-compliant cloud frameworks, with pre-existing BAAs that cover the AI services alongside storage, compute, and networking — but the integrated model catalog is subject to the cloud provider's release schedule, which may lag the foundation model developer's direct API.
Conceptual Explanation
Vendor evaluation in enterprise AI has two distinct phases that organizations frequently conflate:
Phase 1 — Qualification: Determining which vendors are eligible for clinical AI use. Eligibility requires a HIPAA Business Associate Agreement, acceptable data residency, acceptable data retention and training policies, and security posture compatible with the organization's risk tolerance. Vendors that do not meet qualification criteria are eliminated before technical evaluation begins.
Phase 2 — Selection: Among qualified vendors, comparing model capability, performance, cost, operational maturity, and strategic fit for the organization's specific use cases and infrastructure.
Conflating these phases — comparing benchmark scores across vendors before confirming HIPAA qualification — is a common mistake that can produce a technically superior vendor that is legally ineligible for clinical PHI processing.
Core Architecture
Enterprise Considerations
Multi-Vendor Strategy: A multi-vendor approach that uses one vendor for primary inference and a second for fallback provides resilience against vendor outages, negotiation leverage at contract renewal, and flexibility to route different use cases to the most cost-effective model for that use case. The cost is integration complexity. The AI gateway described in Chapter 6 is the enabling infrastructure for multi-vendor strategy — without gateway abstraction, multi-vendor quickly becomes an unmaintainable integration sprawl.
Contract Negotiation at Enterprise Scale: Enterprise healthcare organizations deploying AI at scale (millions of monthly tokens across multiple use cases) have negotiating leverage. Committed spend agreements can reduce effective per-token pricing significantly. Negotiate before committing to a vendor, not after. Key terms to negotiate: enterprise discount tiers, rate limit increases, dedicated capacity guarantees, and contractual SLA commitments.
API Stability: Model versions are deprecated, APIs change, SDK breaking changes are introduced. Evaluate the vendor's history of API stability, deprecation notice periods, and backward compatibility commitments. For clinical AI, a model version being deprecated without adequate notice is an operational risk — the successor model must be evaluated for clinical quality before it goes into production.
Support Model: Enterprise healthcare organizations require support SLA commitments beyond consumer-tier response times. Evaluate: What is the vendor's enterprise support tier? What is the committed response time for production-impacting incidents? Is a dedicated technical account manager available?
Healthcare Example
Educational Example — Illustrative Workflow. Not intended for clinical decision making.
The Reference Healthcare Organization evaluates three qualified vendors for its AI platform — Anthropic API, Azure OpenAI Service, and AWS Bedrock — against a discharge summary generation use case. The organization has already signed a BAA with Microsoft Azure for its EHR infrastructure; the question is whether to extend that BAA to Azure OpenAI Service or establish a direct relationship with Anthropic and/or AWS.
Qualification gate results:
| Criterion | Anthropic API | Azure OpenAI Service | AWS Bedrock |
|---|---|---|---|
| HIPAA BAA available | Yes | Yes (via Azure HIPAA BAA) | Yes (via AWS HIPAA BAA) |
| Data residency (US) | Yes | Yes | Yes |
| PHI excluded from training | Yes (by default for API) | Yes | Yes |
| SOC 2 Type II | Yes | Yes | Yes |
| Qualified | Yes | Yes | Yes |
Phase 2 evaluation results (discharge summary use case, de-identified test set, 100 cases):
| Dimension | Anthropic (Claude Opus) | Azure OAI (GPT-4o) | AWS Bedrock (Claude Sonnet) |
|---|---|---|---|
| Section completion rate | 97% | 92% | 95% |
| Clinical accuracy (panel review) | 4.6/5.0 | 4.3/5.0 | 4.5/5.0 |
| P50 latency | 2.1s | 2.4s | 2.3s |
| P95 latency | 5.8s | 7.1s | 6.4s |
| Format compliance | 96% | 89% | 94% |
| Illustrative cost per 1K requests | $X.XX | $X.XX | $X.XX (see vendor docs) |
All pricing figures are illustrative. Verify current pricing in vendor documentation before cost modeling.
Strategic fit assessment:
The organization already uses Azure for EHR hosting, Active Directory, and storage. Azure OpenAI Service would allow the AI service to operate under the existing enterprise HIPAA BAA and Azure RBAC. However, Anthropic's models achieved meaningfully higher clinical accuracy in panel review, and the organization's AI governance policy requires the highest-quality model available for clinical documentation. The final decision: use Anthropic API (with a direct BAA) as the primary inference endpoint for Tier 1 clinical use cases, and Azure OpenAI Service for Tier 2 (administrative) use cases where the existing Azure relationship simplifies compliance operations.
Common Mistakes
Evaluating on Generic Benchmarks. MMLU and HumanEval scores do not predict clinical AI quality. Organizations that select vendors based on public benchmark leaderboards without running use-case-specific evaluations may find that the "top benchmark" model underperforms for their specific clinical documentation tasks.
Signing a BAA Without Reading Retention Clauses. Some enterprise BAAs include retention provisions that allow the vendor to retain inference content for 30 or 90 days for security monitoring purposes. Organizations that do not negotiate or review these clauses may inadvertently allow PHI retention exceeding their own data minimization policies.
Selecting a Vendor Before Confirming Rate Limits. Default API rate limits for enterprise AI vendors may be orders of magnitude below the rate required for production clinical AI at hospital scale. Confirm that the vendor can provision adequate rate limits, and include rate limit SLA commitments in the contract.
Ignoring the Egress Vendor. For organizations running AI on a cloud provider other than the AI vendor (e.g., running their application on AWS but calling Anthropic API), cross-cloud data egress adds latency and cost. Evaluate total latency including network transit, not just model inference time.
Best Practices
- Always complete the HIPAA qualification gate before beginning technical evaluation — eliminate non-qualifying vendors first
- Run use-case-specific evaluations on de-identified clinical data rather than relying on public benchmarks
- Model cost at production scale (millions of monthly tokens) before finalizing vendor selection
- Negotiate BAA retention clauses explicitly: PHI should not be retained for training or for longer than operationally necessary
- Confirm rate limit capacity for production scale before signing — include rate limit minimums in the contract
- Design for multi-vendor through an AI gateway even if initially deploying single-vendor, to preserve optionality
- Re-evaluate vendor selection annually — model capabilities, pricing, and compliance commitments evolve
Alternatives
The evaluation in this chapter covers the four primary enterprise AI cloud and API vendors. Additional alternatives include:
- Self-hosted open-source models (Llama 3, Mistral): Eliminate vendor dependency and data sharing, but require significant GPU infrastructure, ML engineering capacity, and an internal model evaluation and safety program. Appropriate for organizations with extreme data sensitivity or sovereignty requirements.
- Cohere Enterprise: Strong retrieval and embeddings focus; evaluate for RAG-heavy use cases.
- Specialized clinical AI vendors (Nuance, Ambient AI): Domain-specific clinical AI products that sit on top of foundation models. Not general-purpose LLM vendors; evaluated separately as point solutions.
Trade-offs
| Dimension | Direct API (Anthropic) | Cloud-Integrated (Azure OAI, Bedrock) | Self-Hosted |
|---|---|---|---|
| Latest model access | Immediate | Delayed (cloud integration lag) | Depends on open-source ecosystem |
| HIPAA BAA | Direct BAA required | Covered by cloud provider HIPAA BAA | Not applicable (no third party) |
| Existing cloud integration | Additional vendor | Unified with cloud platform | Native |
| Rate limits | Negotiate directly | Cloud provider quotas | Limited by own GPU capacity |
| Pricing control | Direct negotiation | Cloud commitment tiers | GPU infrastructure cost |
| Lock-in risk | Medium | Medium-High (cloud ecosystem) | Low (model portability) |
| Operational complexity | Low | Low | High |
Interview Questions
Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?
Category: Architecture / Compliance Difficulty: Senior Role: AI Architect / FDE
Answer Framework:
The minimum required before PHI can be transmitted to an external LLM API is a signed HIPAA Business Associate Agreement with the vendor. Without a BAA, transmitting PHI to an external service is a potential HIPAA violation regardless of the vendor's security posture.
Beyond the BAA, the due diligence checklist includes: confirming the vendor's data retention policy (PHI should not be retained for training purposes), reviewing data residency to confirm PHI stays in acceptable regions, and confirming the vendor is covered under the organization's risk management program (SOC 2 Type II or equivalent security certifications reviewed).
For an AI vendor specifically: confirm that inference requests are not used to train the model by default, and that the organization can request deletion of any retained inference content. These provisions may require explicit negotiation — they are not universally included in default enterprise terms.
Key Points to Hit:
- BAA is a legal prerequisite, not optional
- Model training opt-out is an AI-specific HIPAA concern not present in traditional vendor relationships
- Data residency matters if the organization has state-level regulations beyond HIPAA
- SOC 2 Type II is the baseline security certification to require
Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?
Category: System Design Difficulty: Senior Role: AI Architect
Answer Framework:
A rigorous POC requires three elements: a representative dataset, an objective evaluation rubric, and identical conditions across vendors.
The dataset must be de-identified clinical cases representative of the production use case — not synthetic examples, which vendors may perform differently on than real clinical text. For discharge summary generation, this means real (de-identified) encounter data with similar complexity distribution to what production will process.
The evaluation rubric must be defined before running the POC, not after seeing the results. For clinical documentation, the rubric covers: required section presence, format compliance, clinical accuracy (ideally reviewed by a clinical panel rather than automated scoring alone), and hallucination rate (presence of clinically plausible but factually unsupported claims).
Identical conditions: same system prompt (or if testing vendor-specific prompt optimization, clearly documented prompt differences), same input context format, run at the same time of day to avoid diurnal latency variation, sufficient sample size for statistical significance (typically 100+ cases for use cases with binary pass/fail criteria).
Measure: latency (P50 and P95), cost per request, and quality scores. Report all three together — a cheaper, slower vendor may still be the right choice for asynchronous workflows where latency is not the constraint.
Key Points to Hit:
- De-identified real data, not synthetic
- Rubric defined before running (avoid confirmation bias)
- Identical conditions across vendors
- Sample size sufficient for statistical significance
- Measure latency, cost, AND quality
Key Takeaways
- Vendor qualification (HIPAA BAA, data residency, training opt-out) must precede technical evaluation — ineligible vendors should not consume evaluation resources
- A signed HIPAA BAA with the AI vendor is a legal prerequisite before PHI can be transmitted to any external LLM API
- Confirm explicitly that inference requests are excluded from model training — this is not universally true by default
- Use-case-specific evaluations on de-identified clinical data predict clinical AI quality better than public benchmarks
- Model cost at production scale, not at current volume — per-token economics change substantially with high request volumes
- Design for multi-vendor strategy through an AI gateway even when starting with a single vendor
- Negotiate rate limits, pricing tiers, and BAA retention clauses before signing — post-signature negotiating leverage is minimal
Further Reading
- Chapter 2: AI Governance — Governance requirements that shape vendor selection criteria
- Chapter 4: Cost Management — Token economics and cost modeling at production scale
- Chapter 6: AI Platform Architecture — AI gateway as the enabling infrastructure for multi-vendor strategy
- Anthropic Enterprise Documentation — Current enterprise terms and BAA information
- AWS Bedrock HIPAA Compliance — AWS HIPAA eligible services documentation
- Azure OpenAI Service HIPAA — Azure compliance documentation