AI Platform Architecture

Conceptual Explanation

An AI platform is not a monolith. It is a set of shared services, each providing a specific capability, that individual AI applications use through well-defined interfaces. The key insight is that the platform does not own the AI use cases — the product teams do. The platform provides the infrastructure that makes each use case more secure, more observable, and cheaper to build and operate.

The platform's responsibilities cluster around four concerns:

Control Plane: Governance of AI access. Who can call which model? With which prompts? Enforced at the AI gateway.

Data Plane: The AI requests and responses themselves. The platform handles routing, retry, fallback, and load balancing without the application team managing it.

Management Plane: Configuration, versioning, and lifecycle management of prompts, models, and evaluation datasets.

Observability Plane: Unified tracing, quality metrics, cost attribution, and alerting across all AI use cases.

Core Architecture

graph TD subgraph "Clinical Applications" A1["Discharge Summary\nService"] A2["Prior Auth\nService"] A3["Clinical Coding\nAssist"] A4["Future Use Cases\n[N+1...]"] end subgraph "AI Gateway — Control Plane" GW["AI Gateway\nAuth · Rate Limit · Route"] AR["Model Router\nTier Selection"] PC["Prompt Cache\nManager"] CA["Cost Attributor\nDept · Use Case Tagging"] AL["Audit Logger\nImmutable Log"] end subgraph "Management Plane" PR["Prompt Registry\nVersioned Prompts"] MR["Model Registry\nApproved Models + Versions"] EP["Evaluation Pipeline\nCI/CD for AI Quality"] DS["Dataset Store\nGolden Sets + Eval Data"] end subgraph "Shared AI Services" ES["Embedding Service\nClinical Vector Store"] VS["Vector Database\nClinical Knowledge"] LG["LLM Clients\nAnthropic · Azure OAI"] end subgraph "Observability Plane" OT["OpenTelemetry\nCollector"] TS["Time-Series DB\nMetrics + Cost"] QS["Quality Scorer\nAsync Eval"] DB["Clinical AI\nDashboard"] end A1 & A2 & A3 & A4 --> GW GW --> AR --> PC --> CA --> AL CA --> TS AL --> OT GW --> PR GW --> MR GW --> ES ES --> VS AR --> LG OT --> DB TS --> DB QS --> DB EP --> DS EP --> MR

Common Mistakes

Building the Platform Before the First Use Case. Platform design requires real requirements, which only emerge from building and operating actual AI use cases. Organizations that invest 6 months in platform design before deploying their first use case build platforms that do not match real needs. The correct approach: build use case 1 without a platform, identify the repeated patterns, extract them into a platform before use case 2.

Platform as a Bottleneck. If every use case requires a platform team member to make changes — onboarding, prompt deployment, model version updates — the platform becomes a bottleneck. Platform design must prioritize self-serve workflows. If a clinical AI team cannot add their prompts to the registry without platform team intervention, the registry is not a platform component — it is a managed service.

Ignoring Prompt Registry. The most commonly underestimated platform component is the prompt registry. Organizations that treat prompts as application configuration find themselves unable to audit which prompt was in production at the time of a clinical incident, unable to roll back a prompt change without a code deployment, and unable to run A/B tests on prompt variants. Prompt management is not a development convenience — it is a governance requirement.

One Registry for All Vendors. If the model registry tracks approved model versions per vendor, it must also track HIPAA BAA status per model-per-vendor. A model with a BAA on one vendor's platform does not have a BAA on another. The registry must prevent accidental PHI routing to non-BAA-covered endpoints.

Best Practices

Deploy the AI gateway as the first platform component, before the first production AI use case
Issue virtual keys per application, never share keys across applications
Store all LLM vendor master keys in a secrets manager with rotation policy
Design the prompt registry for self-serve deployment: the prompt owner, not the platform team, deploys prompt versions
Build the evaluation pipeline as a CI/CD step that runs automatically on every prompt or model version change
Index all clinical knowledge sources (guidelines, formularies, criteria) into a single shared vector store, not per-use-case stores
Measure platform value in use case delivery speed, not platform feature completeness

Alternatives

Cloud-Native AI Platforms: Major cloud providers offer managed AI platform capabilities: Azure AI Studio, AWS SageMaker, Google Vertex AI. These reduce infrastructure management overhead but increase vendor lock-in and may not support the full governance customization required for clinical AI. Evaluate against the organization's cloud strategy and BAA coverage.

LangChain + LangSmith: A popular open-source ecosystem that provides many platform capabilities (prompt management, tracing, evaluation) with a lower initial investment than a fully custom platform. The LangSmith managed service handles observability; LangChain provides the orchestration layer. Appropriate for organizations committed to the LangChain ecosystem.

LiteLLM Proxy: An open-source AI gateway that supports multiple LLM vendors, virtual keys, rate limiting, and spend tracking. A practical starting point that can be extended with custom middleware for clinical AI requirements.

Trade-offs

Dimension	No Platform	Partial Platform	Full Platform
Use case 1 delivery speed	Fastest	Fast	Slower (platform investment)
Use case N delivery speed	Slow (rebuild)	Medium	Fastest
Security consistency	Low	Medium	High
Governance auditability	Low	Medium	High
Team autonomy	High	Medium	High (with self-serve)
Cost attribution accuracy	None	Partial	Complete
Platform maintenance burden	None	Low	Medium

Interview Questions

Q: Design an internal AI platform architecture for a hospital system that needs to operate 10 clinical AI use cases securely.

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

Begin with the control plane. The AI gateway is the non-negotiable first component: it is the security boundary, the cost attribution point, the governance enforcement layer, and the abstraction that allows vendor flexibility. Every LLM call from every clinical AI application must traverse the gateway — enforced at the network layer, not by convention.

The gateway issues virtual API keys per application, maps to master vendor keys stored in a secrets manager, and enforces rate limits and token budgets per key. All LLM audit records flow through the gateway to a HIPAA-compliant log store.

The management plane provides the prompt registry and model registry. The prompt registry is version-controlled and self-serve: clinical AI teams deploy prompt versions through a CI/CD workflow, not by contacting the platform team. The model registry lists approved model versions with their HIPAA BAA status and approved use cases.

Shared AI services: a single embedding service with a shared clinical knowledge vector store. All 10 use cases that need semantic search query the same vector store — no per-use-case indexes for common clinical content.

The evaluation pipeline runs quality evaluation for every proposed prompt or model version change, gating deployment if metrics fall below threshold. This is the CI/CD system for AI quality.

Key Points to Hit:

AI gateway as the mandatory network boundary, enforced at infrastructure level
Virtual keys per application enable revocation and per-app rate limiting
Prompt registry is a governance requirement, not a convenience
Shared embedding service prevents N divergent vector stores
Evaluation pipeline makes governance automated rather than manual
Self-serve design prevents the platform team from becoming a bottleneck

Q: What are the top three platform capabilities that provide the most leverage for a healthcare organization scaling from 2 to 10 clinical AI use cases?

Category: Architecture Difficulty: Senior Role: AI Architect / FDE

Answer Framework:

First: the AI gateway with virtual keys and cost attribution. Without the gateway, each additional use case adds a new HIPAA surface area, a new API key management problem, and a new cost tracking gap. With the gateway, each new use case automatically inherits the security, cost attribution, and audit logging that were built once.

Second: the prompt registry. As use cases multiply, prompts become ungovernable without a registry. Clinical incidents will require reconstructing which prompt was in production at a given time. Prompt changes will be deployed informally without evaluation. The registry enforces discipline and makes incident investigation tractable.

Third: the shared evaluation pipeline. Manual evaluation cannot scale to 10 use cases without dedicated QA staff. An automated pipeline that runs quality evaluation on every model and prompt change, with results reported to the governance committee, is the scalable alternative to manual review.

Key Points to Hit:

AI gateway: security and cost attribution that scales horizontally across use cases
Prompt registry: governance enforcement that becomes non-negotiable at scale
Evaluation pipeline: quality assurance that cannot remain manual beyond 2–3 use cases

Key Takeaways

An AI platform provides shared infrastructure that makes each additional AI use case faster, cheaper, and more governable to build than the previous one
The AI gateway is the most critical platform component: it enforces security, governance, cost attribution, and audit logging at the organizational boundary
Virtual API keys per application enable fine-grained rate limiting, cost attribution, and revocation without rotating master vendor keys
The prompt registry is a governance requirement for clinical AI, not a development convenience: every production prompt must be versioned, approved, and rollback-capable
The evaluation pipeline is the CI/CD system for AI quality — it must gate every model and prompt deployment
Self-serve platform design is the primary defense against the platform team becoming a bottleneck to the AI program
Platform investment produces compounding returns: the cost per new use case drops significantly after the first two

AI Platform Architecture#

Conceptual Explanation#

Core Architecture#

Common Mistakes#

Best Practices#

Alternatives#

Trade-offs#

Interview Questions#

Q: Design an internal AI platform architecture for a hospital system that needs to operate 10 clinical AI use cases securely.#

Q: What are the top three platform capabilities that provide the most leverage for a healthcare organization scaling from 2 to 10 clinical AI use cases?#

Key Takeaways#