AI Platform Architecture
Conceptual Explanation
An AI platform is not a monolith. It is a set of shared services, each providing a specific capability, that individual AI applications use through well-defined interfaces. The key insight is that the platform does not own the AI use cases โ the product teams do. The platform provides the infrastructure that makes each use case more secure, more observable, and cheaper to build and operate.
The platform's responsibilities cluster around four concerns:
Control Plane: Governance of AI access. Who can call which model? With which prompts? Enforced at the AI gateway.
Data Plane: The AI requests and responses themselves. The platform handles routing, retry, fallback, and load balancing without the application team managing it.
Management Plane: Configuration, versioning, and lifecycle management of prompts, models, and evaluation datasets.
Observability Plane: Unified tracing, quality metrics, cost attribution, and alerting across all AI use cases.
Core Architecture
Common Mistakes
Building the Platform Before the First Use Case. Platform design requires real requirements, which only emerge from building and operating actual AI use cases. Organizations that invest 6 months in platform design before deploying their first use case build platforms that do not match real needs. The correct approach: build use case 1 without a platform, identify the repeated patterns, extract them into a platform before use case 2.
Platform as a Bottleneck. If every use case requires a platform team member to make changes โ onboarding, prompt deployment, model version updates โ the platform becomes a bottleneck. Platform design must prioritize self-serve workflows. If a clinical AI team cannot add their prompts to the registry without platform team intervention, the registry is not a platform component โ it is a managed service.
Ignoring Prompt Registry. The most commonly underestimated platform component is the prompt registry. Organizations that treat prompts as application configuration find themselves unable to audit which prompt was in production at the time of a clinical incident, unable to roll back a prompt change without a code deployment, and unable to run A/B tests on prompt variants. Prompt management is not a development convenience โ it is a governance requirement.
One Registry for All Vendors. If the model registry tracks approved model versions per vendor, it must also track HIPAA BAA status per model-per-vendor. A model with a BAA on one vendor's platform does not have a BAA on another. The registry must prevent accidental PHI routing to non-BAA-covered endpoints.
Best Practices
- Deploy the AI gateway as the first platform component, before the first production AI use case
- Issue virtual keys per application, never share keys across applications
- Store all LLM vendor master keys in a secrets manager with rotation policy
- Design the prompt registry for self-serve deployment: the prompt owner, not the platform team, deploys prompt versions
- Build the evaluation pipeline as a CI/CD step that runs automatically on every prompt or model version change
- Index all clinical knowledge sources (guidelines, formularies, criteria) into a single shared vector store, not per-use-case stores
- Measure platform value in use case delivery speed, not platform feature completeness
Alternatives
Cloud-Native AI Platforms: Major cloud providers offer managed AI platform capabilities: Azure AI Studio, AWS SageMaker, Google Vertex AI. These reduce infrastructure management overhead but increase vendor lock-in and may not support the full governance customization required for clinical AI. Evaluate against the organization's cloud strategy and BAA coverage.
LangChain + LangSmith: A popular open-source ecosystem that provides many platform capabilities (prompt management, tracing, evaluation) with a lower initial investment than a fully custom platform. The LangSmith managed service handles observability; LangChain provides the orchestration layer. Appropriate for organizations committed to the LangChain ecosystem.
LiteLLM Proxy: An open-source AI gateway that supports multiple LLM vendors, virtual keys, rate limiting, and spend tracking. A practical starting point that can be extended with custom middleware for clinical AI requirements.
Trade-offs
| Dimension | No Platform | Partial Platform | Full Platform |
|---|---|---|---|
| Use case 1 delivery speed | Fastest | Fast | Slower (platform investment) |
| Use case N delivery speed | Slow (rebuild) | Medium | Fastest |
| Security consistency | Low | Medium | High |
| Governance auditability | Low | Medium | High |
| Team autonomy | High | Medium | High (with self-serve) |
| Cost attribution accuracy | None | Partial | Complete |
| Platform maintenance burden | None | Low | Medium |
Interview Questions
Q: Design an internal AI platform architecture for a hospital system that needs to operate 10 clinical AI use cases securely.
Category: System Design Difficulty: Principal Role: AI Architect
Answer Framework:
Begin with the control plane. The AI gateway is the non-negotiable first component: it is the security boundary, the cost attribution point, the governance enforcement layer, and the abstraction that allows vendor flexibility. Every LLM call from every clinical AI application must traverse the gateway โ enforced at the network layer, not by convention.
The gateway issues virtual API keys per application, maps to master vendor keys stored in a secrets manager, and enforces rate limits and token budgets per key. All LLM audit records flow through the gateway to a HIPAA-compliant log store.
The management plane provides the prompt registry and model registry. The prompt registry is version-controlled and self-serve: clinical AI teams deploy prompt versions through a CI/CD workflow, not by contacting the platform team. The model registry lists approved model versions with their HIPAA BAA status and approved use cases.
Shared AI services: a single embedding service with a shared clinical knowledge vector store. All 10 use cases that need semantic search query the same vector store โ no per-use-case indexes for common clinical content.
The evaluation pipeline runs quality evaluation for every proposed prompt or model version change, gating deployment if metrics fall below threshold. This is the CI/CD system for AI quality.
Key Points to Hit:
- AI gateway as the mandatory network boundary, enforced at infrastructure level
- Virtual keys per application enable revocation and per-app rate limiting
- Prompt registry is a governance requirement, not a convenience
- Shared embedding service prevents N divergent vector stores
- Evaluation pipeline makes governance automated rather than manual
- Self-serve design prevents the platform team from becoming a bottleneck
Q: What are the top three platform capabilities that provide the most leverage for a healthcare organization scaling from 2 to 10 clinical AI use cases?
Category: Architecture Difficulty: Senior Role: AI Architect / FDE
Answer Framework:
First: the AI gateway with virtual keys and cost attribution. Without the gateway, each additional use case adds a new HIPAA surface area, a new API key management problem, and a new cost tracking gap. With the gateway, each new use case automatically inherits the security, cost attribution, and audit logging that were built once.
Second: the prompt registry. As use cases multiply, prompts become ungovernable without a registry. Clinical incidents will require reconstructing which prompt was in production at a given time. Prompt changes will be deployed informally without evaluation. The registry enforces discipline and makes incident investigation tractable.
Third: the shared evaluation pipeline. Manual evaluation cannot scale to 10 use cases without dedicated QA staff. An automated pipeline that runs quality evaluation on every model and prompt change, with results reported to the governance committee, is the scalable alternative to manual review.
Key Points to Hit:
- AI gateway: security and cost attribution that scales horizontally across use cases
- Prompt registry: governance enforcement that becomes non-negotiable at scale
- Evaluation pipeline: quality assurance that cannot remain manual beyond 2โ3 use cases
Key Takeaways
- An AI platform provides shared infrastructure that makes each additional AI use case faster, cheaper, and more governable to build than the previous one
- The AI gateway is the most critical platform component: it enforces security, governance, cost attribution, and audit logging at the organizational boundary
- Virtual API keys per application enable fine-grained rate limiting, cost attribution, and revocation without rotating master vendor keys
- The prompt registry is a governance requirement for clinical AI, not a development convenience: every production prompt must be versioned, approved, and rollback-capable
- The evaluation pipeline is the CI/CD system for AI quality โ it must gate every model and prompt deployment
- Self-serve platform design is the primary defense against the platform team becoming a bottleneck to the AI program
- Platform investment produces compounding returns: the cost per new use case drops significantly after the first two