AI Vendor Evaluation

Executive Summary

Selecting an AI vendor for a healthcare enterprise is a decision with a 3–5 year operational horizon, HIPAA liability consequences, and technical lock-in risks that are difficult to reverse once clinical systems depend on a vendor's proprietary infrastructure. The decision is not primarily about which model currently benchmarks highest — model capabilities change quarterly. It is about which vendor relationship, contractual structure, data handling commitment, and platform architecture best match the organization's operational requirements, compliance obligations, and strategic direction. This chapter provides an evaluation framework for the four primary enterprise AI providers — Anthropic, Azure OpenAI Service, Google Vertex AI, and AWS Bedrock — applied to a healthcare context.

Learning Objectives

After reading this chapter, you will be able to:

Apply a structured evaluation framework to compare enterprise AI vendors across contractual, technical, and operational dimensions
Identify the HIPAA-specific evaluation criteria that distinguish clinical AI vendor selection from general enterprise AI vendor selection
Design a proof-of-concept evaluation that produces comparable, objective results across multiple vendors
Articulate the trade-offs between single-vendor and multi-vendor AI strategies at enterprise scale

Business Problem

An enterprise that commits to a single AI vendor without evaluating alternatives, or that evaluates vendors only on model benchmark scores, frequently discovers 18 months into deployment that: the vendor's HIPAA Business Associate Agreement contains data handling provisions that are incompatible with the organization's compliance requirements; the proprietary SDK and API patterns have created a migration cost that makes switching economically infeasible; the model tier required for clinical quality is priced above budget at scale; or the vendor's operational SLA does not meet the availability requirement for a clinical decision support workflow.

These failures are not the result of the chosen vendor being technically inferior — they are the result of evaluating the wrong dimensions at selection time. The correct evaluation framework evaluates the vendor relationship and contractual structure alongside the model capabilities.

Why This Technology Exists

Enterprise AI vendors emerged from two different directions: foundation model developers that built enterprise commercial offerings around their research models (Anthropic, OpenAI), and cloud platform providers that integrated foundation model access into their existing enterprise cloud and compliance infrastructure (Microsoft Azure, Google Cloud, Amazon Web Services).

The distinction matters for healthcare organizations. Foundation model developers offer direct access to their latest models with the most rapid capability releases, but require the organization to build compliance infrastructure (BAA, PHI handling, audit logging) on top of the vendor's API. Cloud platform providers offer model access within existing HIPAA-compliant cloud frameworks, with pre-existing BAAs that cover the AI services alongside storage, compute, and networking — but the integrated model catalog is subject to the cloud provider's release schedule, which may lag the foundation model developer's direct API.

Conceptual Explanation

Vendor evaluation in enterprise AI has two distinct phases that organizations frequently conflate:

Phase 1 — Qualification: Determining which vendors are eligible for clinical AI use. Eligibility requires a HIPAA Business Associate Agreement, acceptable data residency, acceptable data retention and training policies, and security posture compatible with the organization's risk tolerance. Vendors that do not meet qualification criteria are eliminated before technical evaluation begins.

Phase 2 — Selection: Among qualified vendors, comparing model capability, performance, cost, operational maturity, and strategic fit for the organization's specific use cases and infrastructure.

Conflating these phases — comparing benchmark scores across vendors before confirming HIPAA qualification — is a common mistake that can produce a technically superior vendor that is legally ineligible for clinical PHI processing.

Core Architecture

flowchart TD START["AI Vendor Evaluation\nInitiated"] subgraph "Phase 1 — Qualification Gate" Q1{"HIPAA BAA\nAvailable?"} Q2{"Data Residency\nAcceptable?"} Q3{"PHI Used for\nModel Training?"} Q4{"Security Posture\nAcceptable?"} DISQ["Vendor Disqualified\nFor Clinical AI"] QUAL["Vendor Qualified\nProceed to Phase 2"] end subgraph "Phase 2 — Technical Evaluation" T1["Model Quality\nEvaluation"] T2["Latency &\nThroughput Benchmarks"] T3["Cost Modeling\n(at production scale)"] T4["SDK & Integration\nAssessment"] T5["Operational\nMaturity Review"] end subgraph "Phase 3 — Strategic Fit" S1["Vendor Roadmap\nAlignment"] S2["Cloud Platform\nAlignment"] S3["Lock-in Risk\nAssessment"] S4["Procurement\n& Contracting"] end DECIDE["Vendor Selection\nDecision"] START --> Q1 Q1 -->|No| DISQ Q1 -->|Yes| Q2 Q2 -->|No| DISQ Q2 -->|Yes| Q3 Q3 -->|PHI used for training\nand cannot be opted out| DISQ Q3 -->|PHI not used for training| Q4 Q4 -->|Fails| DISQ Q4 -->|Passes| QUAL QUAL --> T1 & T2 & T3 T1 & T2 & T3 --> T4 --> T5 T5 --> S1 & S2 & S3 S1 & S2 & S3 --> S4 --> DECIDE

Components

HIPAA Business Associate Agreement (BAA)

A BAA is the contractual instrument that governs how a vendor may process, store, and transmit Protected Health Information (PHI) on behalf of a Covered Entity. For clinical AI use cases that process PHI, a signed BAA with the AI vendor is a legal prerequisite, not a preference.

Key BAA provisions to evaluate:

PHI data retention: How long does the vendor retain inference request content? Acceptable: short retention (hours to days) for debugging, no retention for training.
Training opt-out: By default, some AI vendors use API request content to improve their models. Healthcare organizations must confirm that clinical inference data is excluded from training by default, not by opt-out.
Subprocessors: The BAA should disclose and govern all subprocessors (third parties the vendor may share PHI with for service delivery).
Breach notification: The BAA must specify notification timelines in the event of a security incident.
Audit rights: The organization should have the right to audit or request evidence of the vendor's security controls.

Data Residency

HIPAA does not prescribe data residency requirements, but healthcare organizations in regulated states, or with multi-national operations subject to GDPR or local healthcare data laws, may have data residency requirements that constrain which cloud regions can process PHI. Evaluate whether the vendor supports inference in the required regions.

Model Capability Evaluation

Benchmark scores on standard datasets (MMLU, HellaSwag, HumanEval) are necessary but insufficient for clinical AI vendor selection. They measure generic reasoning, not clinical reasoning. The relevant evaluation is: on the organization's specific clinical use cases, with its specific data, how does each vendor's model perform?

Clinical capability dimensions:

Clinical reasoning accuracy: Does the model reason correctly about clinical scenarios relevant to the use case?
Hallucination rate: Does the model generate clinically plausible but factually incorrect claims?
Citation adherence: When grounded in clinical guidelines, does the model accurately apply them?
Format compliance: Does the model reliably produce structured output (discharge summaries, prior auth forms) in the required format?
Instruction following: Does the model consistently follow clinical documentation constraints?

Implementation Patterns

The Parallel POC Framework

A rigorous vendor evaluation runs a parallel proof-of-concept on a representative subset of a real clinical use case. The following pattern provides comparable, objective results:

python

# Educational Example — Parallel Vendor Evaluation Framework
# Illustrates how to run the same clinical evaluation across multiple LLM vendors
# Educational disclaimer: Not intended for clinical use

from dataclasses import dataclass
from typing import Callable, Optional
import time


@dataclass
class ClinicalEvalCase:
    """A single clinical evaluation case with expected output criteria."""
    case_id: str
    use_case: str          # e.g., "discharge_summary"
    input_context: str     # Structured clinical context (de-identified)
    required_sections: list[str]
    forbidden_content: list[str]
    reference_output: Optional[str] = None


@dataclass
class VendorEvalResult:
    """Result of evaluating one case against one vendor's model."""
    case_id: str
    vendor: str
    model_id: str
    output: str
    latency_ms: float
    input_tokens: int
    output_tokens: int
    sections_present: bool
    forbidden_content_absent: bool
    passed: bool


def evaluate_vendor_on_case(
    vendor_name: str,
    model_id: str,
    inference_fn: Callable[[str, str], tuple[str, int, int]],
    case: ClinicalEvalCase,
    system_prompt: str,
) -> VendorEvalResult:
    """
    Run a single evaluation case against a specific vendor's model.
    inference_fn: (system_prompt, user_prompt) -> (output_text, input_tokens, output_tokens)
    """
    start = time.monotonic()
    output, input_tokens, output_tokens = inference_fn(system_prompt, case.input_context)
    latency_ms = (time.monotonic() - start) * 1000

    sections_present = all(
        section.lower() in output.lower()
        for section in case.required_sections
    )
    forbidden_content_absent = not any(
        phrase.lower() in output.lower()
        for phrase in case.forbidden_content
    )

    return VendorEvalResult(
        case_id=case.case_id,
        vendor=vendor_name,
        model_id=model_id,
        output=output,
        latency_ms=latency_ms,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        sections_present=sections_present,
        forbidden_content_absent=forbidden_content_absent,
        passed=sections_present and forbidden_content_absent,
    )

Cost Modeling at Scale

Point-in-time pricing comparisons are unreliable for vendor selection because AI API pricing changes frequently. More useful is a cost model that applies the vendor's current pricing structure to production-representative call patterns:

python

# Educational Example — Vendor Cost Model
# Illustrative — verify current pricing in vendor documentation

from dataclasses import dataclass


@dataclass
class VendorCostModel:
    """
    Cost model for a single LLM vendor at projected production volume.
    All prices illustrative — verify current rates in vendor documentation.
    """
    vendor: str
    model_tier: str
    input_price_per_million_tokens: float   # $ per 1M input tokens
    output_price_per_million_tokens: float  # $ per 1M output tokens
    cache_read_price_per_million_tokens: float

    def monthly_cost(
        self,
        daily_requests: int,
        avg_input_tokens: int,
        avg_output_tokens: int,
        cache_hit_rate: float = 0.0,
    ) -> float:
        """Estimate monthly cost for a given usage pattern."""
        monthly_requests = daily_requests * 30
        uncached_input = avg_input_tokens * (1 - cache_hit_rate)
        cached_input = avg_input_tokens * cache_hit_rate

        input_cost = (
            monthly_requests * uncached_input / 1_000_000
            * self.input_price_per_million_tokens
        )
        cache_cost = (
            monthly_requests * cached_input / 1_000_000
            * self.cache_read_price_per_million_tokens
        )
        output_cost = (
            monthly_requests * avg_output_tokens / 1_000_000
            * self.output_price_per_million_tokens
        )
        return input_cost + cache_cost + output_cost

Enterprise Considerations

Multi-Vendor Strategy: A multi-vendor approach that uses one vendor for primary inference and a second for fallback provides resilience against vendor outages, negotiation leverage at contract renewal, and flexibility to route different use cases to the most cost-effective model for that use case. The cost is integration complexity. The AI gateway described in Chapter 6 is the enabling infrastructure for multi-vendor strategy — without gateway abstraction, multi-vendor quickly becomes an unmaintainable integration sprawl.

Contract Negotiation at Enterprise Scale: Enterprise healthcare organizations deploying AI at scale (millions of monthly tokens across multiple use cases) have negotiating leverage. Committed spend agreements can reduce effective per-token pricing significantly. Negotiate before committing to a vendor, not after. Key terms to negotiate: enterprise discount tiers, rate limit increases, dedicated capacity guarantees, and contractual SLA commitments.

API Stability: Model versions are deprecated, APIs change, SDK breaking changes are introduced. Evaluate the vendor's history of API stability, deprecation notice periods, and backward compatibility commitments. For clinical AI, a model version being deprecated without adequate notice is an operational risk — the successor model must be evaluated for clinical quality before it goes into production.

Support Model: Enterprise healthcare organizations require support SLA commitments beyond consumer-tier response times. Evaluate: What is the vendor's enterprise support tier? What is the committed response time for production-impacting incidents? Is a dedicated technical account manager available?

Security Considerations

PHI handling in inference requests: Confirm in writing that the vendor does not use inference request content for model training. This is a HIPAA privacy requirement, not a preference.
Encryption at rest and in transit: Confirm that inference requests are encrypted in transit (TLS 1.2+) and that any persisted data (for debugging, logging) is encrypted at rest with keys the organization can control or audit.
Logging and audit trail: Confirm the vendor provides per-request audit logs with sufficient metadata for HIPAA audit purposes.
Penetration testing and certifications: Evaluate vendor security certifications (SOC 2 Type II, ISO 27001, HITRUST) and the recency of penetration testing.

Healthcare Example

⊕ Healthcare Example

Educational Example — Illustrative Workflow. Not intended for clinical decision making.

The Reference Healthcare Organization evaluates three qualified vendors for its AI platform — Anthropic API, Azure OpenAI Service, and AWS Bedrock — against a discharge summary generation use case. The organization has already signed a BAA with Microsoft Azure for its EHR infrastructure; the question is whether to extend that BAA to Azure OpenAI Service or establish a direct relationship with Anthropic and/or AWS.

Qualification gate results:

Criterion	Anthropic API	Azure OpenAI Service	AWS Bedrock
HIPAA BAA available	Yes	Yes (via Azure HIPAA BAA)	Yes (via AWS HIPAA BAA)
Data residency (US)	Yes	Yes	Yes
PHI excluded from training	Yes (by default for API)	Yes	Yes
SOC 2 Type II	Yes	Yes	Yes
Qualified	Yes	Yes	Yes

Phase 2 evaluation results (discharge summary use case, de-identified test set, 100 cases):

Dimension	Anthropic (Claude Opus)	Azure OAI (GPT-4o)	AWS Bedrock (Claude Sonnet)
Section completion rate	97%	92%	95%
Clinical accuracy (panel review)	4.6/5.0	4.3/5.0	4.5/5.0
P50 latency	2.1s	2.4s	2.3s
P95 latency	5.8s	7.1s	6.4s
Format compliance	96%	89%	94%
Illustrative cost per 1K requests	$X.XX	$X.XX	$X.XX (see vendor docs)

All pricing figures are illustrative. Verify current pricing in vendor documentation before cost modeling.

Strategic fit assessment:

The organization already uses Azure for EHR hosting, Active Directory, and storage. Azure OpenAI Service would allow the AI service to operate under the existing enterprise HIPAA BAA and Azure RBAC. However, Anthropic's models achieved meaningfully higher clinical accuracy in panel review, and the organization's AI governance policy requires the highest-quality model available for clinical documentation. The final decision: use Anthropic API (with a direct BAA) as the primary inference endpoint for Tier 1 clinical use cases, and Azure OpenAI Service for Tier 2 (administrative) use cases where the existing Azure relationship simplifies compliance operations.

Common Mistakes

Evaluating on Generic Benchmarks. MMLU and HumanEval scores do not predict clinical AI quality. Organizations that select vendors based on public benchmark leaderboards without running use-case-specific evaluations may find that the "top benchmark" model underperforms for their specific clinical documentation tasks.

Signing a BAA Without Reading Retention Clauses. Some enterprise BAAs include retention provisions that allow the vendor to retain inference content for 30 or 90 days for security monitoring purposes. Organizations that do not negotiate or review these clauses may inadvertently allow PHI retention exceeding their own data minimization policies.

Selecting a Vendor Before Confirming Rate Limits. Default API rate limits for enterprise AI vendors may be orders of magnitude below the rate required for production clinical AI at hospital scale. Confirm that the vendor can provision adequate rate limits, and include rate limit SLA commitments in the contract.

Ignoring the Egress Vendor. For organizations running AI on a cloud provider other than the AI vendor (e.g., running their application on AWS but calling Anthropic API), cross-cloud data egress adds latency and cost. Evaluate total latency including network transit, not just model inference time.

Best Practices

Always complete the HIPAA qualification gate before beginning technical evaluation — eliminate non-qualifying vendors first
Run use-case-specific evaluations on de-identified clinical data rather than relying on public benchmarks
Model cost at production scale (millions of monthly tokens) before finalizing vendor selection
Negotiate BAA retention clauses explicitly: PHI should not be retained for training or for longer than operationally necessary
Confirm rate limit capacity for production scale before signing — include rate limit minimums in the contract
Design for multi-vendor through an AI gateway even if initially deploying single-vendor, to preserve optionality
Re-evaluate vendor selection annually — model capabilities, pricing, and compliance commitments evolve

Alternatives

The evaluation in this chapter covers the four primary enterprise AI cloud and API vendors. Additional alternatives include:

Self-hosted open-source models (Llama 3, Mistral): Eliminate vendor dependency and data sharing, but require significant GPU infrastructure, ML engineering capacity, and an internal model evaluation and safety program. Appropriate for organizations with extreme data sensitivity or sovereignty requirements.
Cohere Enterprise: Strong retrieval and embeddings focus; evaluate for RAG-heavy use cases.
Specialized clinical AI vendors (Nuance, Ambient AI): Domain-specific clinical AI products that sit on top of foundation models. Not general-purpose LLM vendors; evaluated separately as point solutions.

Trade-offs

Dimension	Direct API (Anthropic)	Cloud-Integrated (Azure OAI, Bedrock)	Self-Hosted
Latest model access	Immediate	Delayed (cloud integration lag)	Depends on open-source ecosystem
HIPAA BAA	Direct BAA required	Covered by cloud provider HIPAA BAA	Not applicable (no third party)
Existing cloud integration	Additional vendor	Unified with cloud platform	Native
Rate limits	Negotiate directly	Cloud provider quotas	Limited by own GPU capacity
Pricing control	Direct negotiation	Cloud commitment tiers	GPU infrastructure cost
Lock-in risk	Medium	Medium-High (cloud ecosystem)	Low (model portability)
Operational complexity	Low	Low	High

Interview Questions

Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?

Category: Architecture / Compliance Difficulty: Senior Role: AI Architect / FDE

Answer Framework:

The minimum required before PHI can be transmitted to an external LLM API is a signed HIPAA Business Associate Agreement with the vendor. Without a BAA, transmitting PHI to an external service is a potential HIPAA violation regardless of the vendor's security posture.

Beyond the BAA, the due diligence checklist includes: confirming the vendor's data retention policy (PHI should not be retained for training purposes), reviewing data residency to confirm PHI stays in acceptable regions, and confirming the vendor is covered under the organization's risk management program (SOC 2 Type II or equivalent security certifications reviewed).

For an AI vendor specifically: confirm that inference requests are not used to train the model by default, and that the organization can request deletion of any retained inference content. These provisions may require explicit negotiation — they are not universally included in default enterprise terms.

Key Points to Hit:

BAA is a legal prerequisite, not optional
Model training opt-out is an AI-specific HIPAA concern not present in traditional vendor relationships
Data residency matters if the organization has state-level regulations beyond HIPAA
SOC 2 Type II is the baseline security certification to require

Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

A rigorous POC requires three elements: a representative dataset, an objective evaluation rubric, and identical conditions across vendors.

The dataset must be de-identified clinical cases representative of the production use case — not synthetic examples, which vendors may perform differently on than real clinical text. For discharge summary generation, this means real (de-identified) encounter data with similar complexity distribution to what production will process.

The evaluation rubric must be defined before running the POC, not after seeing the results. For clinical documentation, the rubric covers: required section presence, format compliance, clinical accuracy (ideally reviewed by a clinical panel rather than automated scoring alone), and hallucination rate (presence of clinically plausible but factually unsupported claims).

Identical conditions: same system prompt (or if testing vendor-specific prompt optimization, clearly documented prompt differences), same input context format, run at the same time of day to avoid diurnal latency variation, sufficient sample size for statistical significance (typically 100+ cases for use cases with binary pass/fail criteria).

Measure: latency (P50 and P95), cost per request, and quality scores. Report all three together — a cheaper, slower vendor may still be the right choice for asynchronous workflows where latency is not the constraint.

Key Points to Hit:

De-identified real data, not synthetic
Rubric defined before running (avoid confirmation bias)
Identical conditions across vendors
Sample size sufficient for statistical significance
Measure latency, cost, AND quality

Key Takeaways

Vendor qualification (HIPAA BAA, data residency, training opt-out) must precede technical evaluation — ineligible vendors should not consume evaluation resources
A signed HIPAA BAA with the AI vendor is a legal prerequisite before PHI can be transmitted to any external LLM API
Confirm explicitly that inference requests are excluded from model training — this is not universally true by default
Use-case-specific evaluations on de-identified clinical data predict clinical AI quality better than public benchmarks
Model cost at production scale, not at current volume — per-token economics change substantially with high request volumes
Design for multi-vendor strategy through an AI gateway even when starting with a single vendor
Negotiate rate limits, pricing tiers, and BAA retention clauses before signing — post-signature negotiating leverage is minimal

Glossary

Business Associate Agreement (BAA): A HIPAA-required contract between a Covered Entity and a vendor (Business Associate) that processes PHI, governing how the vendor may use, store, and disclose that PHI.

Model training opt-out: A provision that prevents a vendor from using inference request content to improve its model. Some vendors enable this by default for enterprise API customers; others require explicit configuration.

Data residency: The requirement that PHI remain within a specified geographic region (e.g., US-only) during processing and storage.

Rate limit: A vendor-imposed ceiling on the number of API requests or tokens that can be processed per minute or per day. Relevant to production planning for high-volume clinical AI use cases.

AI Vendor Evaluation#

Executive Summary#

Learning Objectives#

Business Problem#

Why This Technology Exists#

Conceptual Explanation#

Core Architecture#

Components#

Implementation Patterns#

Enterprise Considerations#

Security Considerations#

Healthcare Example#

Common Mistakes#

Best Practices#

Alternatives#

Trade-offs#

Interview Questions#

Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?#

Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?#

Key Takeaways#

Glossary#

Further Reading#

AI Vendor Evaluation

Executive Summary

Learning Objectives

Business Problem

Why This Technology Exists

Conceptual Explanation

Core Architecture

Components

Implementation Patterns

Enterprise Considerations

Security Considerations

Healthcare Example

Common Mistakes

Best Practices

Alternatives

Trade-offs

Interview Questions

Q: A hospital CISO asks you to explain what due diligence is required before allowing PHI to be sent to an external LLM API. What is the minimum required?

Q: How would you design a proof-of-concept to compare two LLM vendors for a clinical documentation use case?

Key Takeaways

Glossary

Further Reading