AI Vendor Evaluation

Executive Summary

Selecting an AI vendor for a healthcare enterprise is a decision with a 3–5 year operational horizon, HIPAA liability consequences, and technical lock-in risks that are difficult to reverse once clinical systems depend on a vendor's proprietary infrastructure. The decision is not primarily about which model currently benchmarks highest — model capabilities change quarterly. It is about which vendor relationship, contractual structure, data handling commitment, and platform architecture best match the organization's operational requirements, compliance obligations, and strategic direction. This chapter provides an evaluation framework for the four primary enterprise AI providers — Anthropic, Azure OpenAI Service, Google Vertex AI, and AWS Bedrock — applied to a healthcare context.

Learning Objectives

After reading this chapter, you will be able to:

Apply a structured evaluation framework to compare enterprise AI vendors across contractual, technical, and operational dimensions
Identify the HIPAA-specific evaluation criteria that distinguish clinical AI vendor selection from general enterprise AI vendor selection
Design a proof-of-concept evaluation that produces comparable, objective results across multiple vendors
Articulate the trade-offs between single-vendor and multi-vendor AI strategies at enterprise scale

Business Problem

An enterprise that commits to a single AI vendor without evaluating alternatives, or that evaluates vendors only on model benchmark scores, frequently discovers 18 months into deployment that: the vendor's HIPAA Business Associate Agreement contains data handling provisions that are incompatible with the organization's compliance requirements; the proprietary SDK and API patterns have created a migration cost that makes switching economically infeasible; the model tier required for clinical quality is priced above budget at scale; or the vendor's operational SLA does not meet the availability requirement for a clinical decision support workflow.

These failures are not the result of the chosen vendor being technically inferior — they are the result of evaluating the wrong dimensions at selection time. The correct evaluation framework evaluates the vendor relationship and contractual structure alongside the model capabilities.

Why This Technology Exists

Enterprise AI vendors emerged from two different directions: foundation model developers that built enterprise commercial offerings around their research models (Anthropic, OpenAI), and cloud platform providers that integrated foundation model access into their existing enterprise cloud and compliance infrastructure (Microsoft Azure, Google Cloud, Amazon Web Services).

The distinction matters for healthcare organizations. Foundation model developers offer direct access to their latest models with the most rapid capability releases, but require the organization to build compliance infrastructure (BAA, PHI handling, audit logging) on top of the vendor's API. Cloud platform providers offer model access within existing HIPAA-compliant cloud frameworks, with pre-existing BAAs that cover the AI services alongside storage, compute, and networking — but the integrated model catalog is subject to the cloud provider's release schedule, which may lag the foundation model developer's direct API.

Conceptual Explanation

Vendor evaluation in enterprise AI has two distinct phases that organizations frequently conflate:

Phase 1 — Qualification: Determining which vendors are eligible for clinical AI use. Eligibility requires a HIPAA Business Associate Agreement, acceptable data residency, acceptable data retention and training policies, and security posture compatible with the organization's risk tolerance. Vendors that do not meet qualification criteria are eliminated before technical evaluation begins.

Phase 2 — Selection: Among qualified vendors, comparing model capability, performance, cost, operational maturity, and strategic fit for the organization's specific use cases and infrastructure.

Conflating these phases — comparing benchmark scores across vendors before confirming HIPAA qualification — is a common mistake that can produce a technically superior vendor that is legally ineligible for clinical PHI processing.

Core Architecture

flowchart TD START["AI Vendor Evaluation\nInitiated"] subgraph "Phase 1 — Qualification Gate" Q1{"HIPAA BAA\nAvailable?"} Q2{"Data Residency\nAcceptable?"} Q3{"PHI Used for\nModel Training?"} Q4{"Security Posture\nAcceptable?"} DISQ["Vendor Disqualified\nFor Clinical AI"] QUAL["Vendor Qualified\nProceed to Phase 2"] end subgraph "Phase 2 — Technical Evaluation" T1["Model Quality\nEvaluation"] T2["Latency &\nThroughput Benchmarks"] T3["Cost Modeling\n(at production scale)"] T4["SDK & Integration\nAssessment"] T5["Operational\nMaturity Review"] end subgraph "Phase 3 — Strategic Fit" S1["Vendor Roadmap\nAlignment"] S2["Cloud Platform\nAlignment"] S3["Lock-in Risk\nAssessment"] S4["Procurement\n& Contracting"] end DECIDE["Vendor Selection\nDecision"] START --> Q1 Q1 -->|No| DISQ Q1 -->|Yes| Q2 Q2 -->|No| DISQ Q2 -->|Yes| Q3 Q3 -->|PHI used for training\nand cannot be opted out| DISQ Q3 -->|PHI not used for training| Q4 Q4 -->|Fails| DISQ Q4 -->|Passes| QUAL QUAL --> T1 & T2 & T3 T1 & T2 & T3 --> T4 --> T5 T5 --> S1 & S2 & S3 S1 & S2 & S3 --> S4 --> DECIDE

Enterprise Considerations

Multi-Vendor Strategy: A multi-vendor approach that uses one vendor for primary inference and a second for fallback provides resilience against vendor outages, negotiation leverage at contract renewal, and flexibility to route different use cases to the most cost-effective model for that use case. The cost is integration complexity. The AI gateway described in Chapter 6 is the enabling infrastructure for multi-vendor strategy — without gateway abstraction, multi-vendor quickly becomes an unmaintainable integration sprawl.

Contract Negotiation at Enterprise Scale: Enterprise healthcare organizations deploying AI at scale (millions of monthly tokens across multiple use cases) have negotiating leverage. Committed spend agreements can reduce effective per-token pricing significantly. Negotiate before committing to a vendor, not after. Key terms to negotiate: enterprise discount tiers, rate limit increases, dedicated capacity guarantees, and contractual SLA commitments.

API Stability: Model versions are deprecated, APIs change, SDK breaking changes are introduced. Evaluate the vendor's history of API stability, deprecation notice periods, and backward compatibility commitments. For clinical AI, a model version being deprecated without adequate notice is an operational risk — the successor model must be evaluated for clinical quality before it goes into production.

Support Model: Enterprise healthcare organizations require support SLA commitments beyond consumer-tier response times. Evaluate: What is the vendor's enterprise support tier? What is the committed response time for production-impacting incidents? Is a dedicated technical account manager available?

Healthcare Example

⊕ Healthcare Example

Educational Example — Illustrative Workflow. Not intended for clinical decision making.

The Reference Healthcare Organization evaluates three qualified vendors for its AI platform — Anthropic API, Azure OpenAI Service, and AWS Bedrock — against a discharge summary generation use case. The organization has already signed a BAA with Microsoft Azure for its EHR infrastructure; the question is whether to extend that BAA to Azure OpenAI Service or establish a direct relationship with Anthropic and/or AWS.

Qualification gate results:

Criterion	Anthropic API	Azure OpenAI Service	AWS Bedrock
HIPAA BAA available	Yes	Yes (via Azure HIPAA BAA)	Yes (via AWS HIPAA BAA)
Data residency (US)	Yes	Yes	Yes
PHI excluded from training	Yes (by default for API)	Yes	Yes
SOC 2 Type II	Yes	Yes	Yes
Qualified	Yes	Yes	Yes

Phase 2 evaluation results (discharge summary use case, de-identified test set, 100 cases):

Dimension	Anthropic (Claude Opus)	Azure OAI (GPT-4o)	AWS Bedrock (Claude Sonnet)
Section completion rate	97%	92%	95%
Clinical accuracy (panel review)	4.6/5.0	4.3/5.0	4.5/5.0
P50 latency	2.1s	2.4s	2.3s
P95 latency	5.8s	7.1s	6.4s
Format compliance	96%	89%	94%
Illustrative cost per 1K requests	$X.XX	$X.XX	$X.XX (see vendor docs)

All pricing figures are illustrative. Verify current pricing in vendor documentation before cost modeling.

Strategic fit assessment:

The organization already uses Azure for EHR hosting, Active Directory, and storage. Azure OpenAI Service would allow the AI service to operate under the existing enterprise HIPAA BAA and Azure RBAC. However, Anthropic's models achieved meaningfully higher clinical accuracy in panel review, and the organization's AI governance policy requires the highest-quality model available for clinical documentation. The final decision: use Anthropic API (with a direct BAA) as the primary inference endpoint for Tier 1 clinical use cases, and Azure OpenAI Service for Tier 2 (administrative) use cases where the existing Azure relationship simplifies compliance operations.

Common Mistakes

Evaluating on Generic Benchmarks. MMLU and HumanEval scores do not predict clinical AI quality. Organizations that select vendors based on public benchmark leaderboards without running use-case-specific evaluations may find that the "top benchmark" model underperforms for their specific clinical documentation tasks.

Signing a BAA Without Reading Retention Clauses. Some enterprise BAAs include retention provisions that allow the vendor to retain inference content for 30 or 90 days for security monitoring purposes. Organizations that do not negotiate or review these clauses may inadvertently allow PHI retention exceeding their own data minimization policies.

Selecting a Vendor Before Confirming Rate Limits. Default API rate limits for enterprise AI vendors may be orders of magnitude below the rate required for production clinical AI at hospital scale. Confirm that the vendor can provision adequate rate limits, and include rate limit SLA commitments in the contract.

Ignoring the Egress Vendor. For organizations running AI on a cloud provider other than the AI vendor (e.g., running their application on AWS but calling Anthropic API), cross-cloud data egress adds latency and cost. Evaluate total latency including network transit, not just model inference time.

Best Practices

Always complete the HIPAA qualification gate before beginning technical evaluation — eliminate non-qualifying vendors first
Run use-case-specific evaluations on de-identified clinical data rather than relying on public benchmarks
Model cost at production scale (millions of monthly tokens) before finalizing vendor selection
Negotiate BAA retention clauses explicitly: PHI should not be retained for training or for longer than operationally necessary
Confirm rate limit capacity for production scale before signing — include rate limit minimums in the contract
Design for multi-vendor through an AI gateway even if initially deploying single-vendor, to preserve optionality
Re-evaluate vendor selection annually — model capabilities, pricing, and compliance commitments evolve

Alternatives

The evaluation in this chapter covers the four primary enterprise AI cloud and API vendors. Additional alternatives include:

Self-hosted open-source models (Llama 3, Mistral): Eliminate vendor dependency and data sharing, but require significant GPU infrastructure, ML engineering capacity, and an internal model evaluation and safety program. Appropriate for organizations with extreme data sensitivity or sovereignty requirements.
Cohere Enterprise: Strong retrieval and embeddings focus; evaluate for RAG-heavy use cases.
Specialized clinical AI vendors (Nuance, Ambient AI): Domain-specific clinical AI products that sit on top of foundation models. Not general-purpose LLM vendors; evaluated separately as point solutions.

Trade-offs

Dimension	Direct API (Anthropic)	Cloud-Integrated (Azure OAI, Bedrock)	Self-Hosted
Latest model access	Immediate	Delayed (cloud integration lag)	Depends on open-source ecosystem
HIPAA BAA	Direct BAA required	Covered by cloud provider HIPAA BAA	Not applicable (no third party)
Existing cloud integration	Additional vendor	Unified with cloud platform	Native
Rate limits	Negotiate directly	Cloud provider quotas	Limited by own GPU capacity
Pricing control	Direct negotiation	Cloud commitment tiers	GPU infrastructure cost
Lock-in risk	Medium	Medium-High (cloud ecosystem)	Low (model portability)
Operational complexity	Low	Low	High

Interview Questions

Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?

Category: Architecture / Compliance Difficulty: Senior Role: AI Architect / FDE

Answer Framework:

The minimum required before PHI can be transmitted to an external LLM API is a signed HIPAA Business Associate Agreement with the vendor. Without a BAA, transmitting PHI to an external service is a potential HIPAA violation regardless of the vendor's security posture.

Beyond the BAA, the due diligence checklist includes: confirming the vendor's data retention policy (PHI should not be retained for training purposes), reviewing data residency to confirm PHI stays in acceptable regions, and confirming the vendor is covered under the organization's risk management program (SOC 2 Type II or equivalent security certifications reviewed).

For an AI vendor specifically: confirm that inference requests are not used to train the model by default, and that the organization can request deletion of any retained inference content. These provisions may require explicit negotiation — they are not universally included in default enterprise terms.

Key Points to Hit:

BAA is a legal prerequisite, not optional
Model training opt-out is an AI-specific HIPAA concern not present in traditional vendor relationships
Data residency matters if the organization has state-level regulations beyond HIPAA
SOC 2 Type II is the baseline security certification to require

Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

A rigorous POC requires three elements: a representative dataset, an objective evaluation rubric, and identical conditions across vendors.

The dataset must be de-identified clinical cases representative of the production use case — not synthetic examples, which vendors may perform differently on than real clinical text. For discharge summary generation, this means real (de-identified) encounter data with similar complexity distribution to what production will process.

The evaluation rubric must be defined before running the POC, not after seeing the results. For clinical documentation, the rubric covers: required section presence, format compliance, clinical accuracy (ideally reviewed by a clinical panel rather than automated scoring alone), and hallucination rate (presence of clinically plausible but factually unsupported claims).

Identical conditions: same system prompt (or if testing vendor-specific prompt optimization, clearly documented prompt differences), same input context format, run at the same time of day to avoid diurnal latency variation, sufficient sample size for statistical significance (typically 100+ cases for use cases with binary pass/fail criteria).

Measure: latency (P50 and P95), cost per request, and quality scores. Report all three together — a cheaper, slower vendor may still be the right choice for asynchronous workflows where latency is not the constraint.

Key Points to Hit:

De-identified real data, not synthetic
Rubric defined before running (avoid confirmation bias)
Identical conditions across vendors
Sample size sufficient for statistical significance
Measure latency, cost, AND quality

Key Takeaways

Vendor qualification (HIPAA BAA, data residency, training opt-out) must precede technical evaluation — ineligible vendors should not consume evaluation resources
A signed HIPAA BAA with the AI vendor is a legal prerequisite before PHI can be transmitted to any external LLM API
Confirm explicitly that inference requests are excluded from model training — this is not universally true by default
Use-case-specific evaluations on de-identified clinical data predict clinical AI quality better than public benchmarks
Model cost at production scale, not at current volume — per-token economics change substantially with high request volumes
Design for multi-vendor strategy through an AI gateway even when starting with a single vendor
Negotiate rate limits, pricing tiers, and BAA retention clauses before signing — post-signature negotiating leverage is minimal

AI Vendor Evaluation#

Executive Summary#

Learning Objectives#

Business Problem#

Why This Technology Exists#

Conceptual Explanation#

Core Architecture#

Enterprise Considerations#

Healthcare Example#

Common Mistakes#

Best Practices#

Alternatives#

Trade-offs#

Interview Questions#

Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?#

Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?#

Key Takeaways#

Further Reading#

AI Vendor Evaluation

Executive Summary

Learning Objectives

Business Problem

Why This Technology Exists

Conceptual Explanation

Core Architecture

Enterprise Considerations

Healthcare Example

Common Mistakes

Best Practices

Alternatives

Trade-offs

Interview Questions

Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?

Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?

Key Takeaways

Further Reading