API Design for AI Services

Executive Summary

Exposing AI capabilities as well-designed APIs is the mechanism by which a central AI platform team enables dozens of application teams to safely consume AI features without each team implementing rate limiting, authentication, model selection, or error handling independently. A poorly designed AI API forces every consumer to reimplement these concerns, produces inconsistent error handling across the product, and makes it impossible to evolve the underlying AI model without breaking all consumers. This chapter covers the design principles and implementation patterns for AI service APIs that are stable, versioned, and observable at enterprise scale.

Learning Objectives

  • Design AI service APIs with appropriate request/response schemas for LLM workloads
  • Apply versioning strategies that allow AI model upgrades without breaking API consumers
  • Implement streaming-aware API patterns for real-time token delivery
  • Design error responses that distinguish AI-specific failure modes from infrastructure errors

Business Problem

A Reference Healthcare Organization's AI platform team builds a clinical RAG service. Without a well-designed API contract, each consuming application — the EHR integration, the patient portal, the clinical mobile app — implements its own prompt construction, error handling, and retry logic. When the AI model is upgraded, each consumer must be updated independently. When rate limits are introduced, each consumer is impacted differently. The absence of an API contract makes the AI platform unmanageable at scale.

API Design Principles for AI Services

AI services have properties that traditional REST API design does not fully account for:

Streaming responses: LLM responses stream token-by-token. Buffering the full response before returning it to the client adds unnecessary latency for interactive use cases. The API must support Server-Sent Events (SSE) or WebSocket streaming natively.

Long-running jobs: Some AI operations (document analysis, multi-step reasoning) exceed the 30-second limit appropriate for synchronous REST. The API must support async job submission with status polling or webhook delivery.

Non-deterministic outputs: The same input may produce different outputs across calls. Consumers must not rely on deterministic AI responses. The API should provide request_id and optionally the model version to enable output comparison.

Model opacity: Consumers should not couple to a specific model name. The API should expose capability tiers (e.g., analysis, summary, classification) not model IDs. This allows the platform team to upgrade the underlying model without API changes.

Request and Response Schema

python
from pydantic import BaseModel, Field, validator
from typing import Optional, Literal
from datetime import datetime
import uuid

# Educational example — not for clinical use

class ClinicalRAGRequest(BaseModel):
    """
    Standard request schema for the Clinical Knowledge RAG API.
    Versioned: v1.
    """
    query: str = Field(
        ...,
        min_length=5,
        max_length=2000,
        description="Clinical question or query"
    )
    capability: Literal["clinical_rag", "document_summary", "code_lookup"] = Field(
        description="AI capability to invoke — determines model tier and retrieval strategy"
    )
    context: Optional[dict] = Field(
        default=None,
        description="Optional structured context (e.g., patient encounter context for personalized RAG)"
    )
    retrieval_config: Optional[dict] = Field(
        default=None,
        description="Optional retrieval overrides: max_chunks, source_filter, date_filter"
    )
    stream: bool = Field(
        default=False,
        description="If true, response is delivered as Server-Sent Events"
    )
    request_id: Optional[str] = Field(
        default_factory=lambda: str(uuid.uuid4()),
        description="Client-provided idempotency key; generated if not provided"
    )
    
    @validator("query")
    def query_must_not_contain_mrn_pattern(cls, v):
        """
        Basic validation: reject queries that appear to contain MRN-like patterns.
        This is a safety check, not a substitute for PHI authorization controls.
        """
        import re
        if re.search(r'\bMRN[:\s#]*\d{6,10}\b', v, re.IGNORECASE):
            raise ValueError(
                "Query appears to contain a patient MRN. "
                "Do not embed patient identifiers in queries. "
                "Provide patient context via the 'context' field with appropriate PHI authorization."
            )
        return v


class SourceCitation(BaseModel):
    document_title: str
    source_organization: str
    section: Optional[str]
    effective_date: Optional[str]
    relevance_score: float


class ClinicalRAGResponse(BaseModel):
    """
    Standard response schema for the Clinical Knowledge RAG API.
    """
    request_id: str
    response: str
    citations: list[SourceCitation]
    model_tier: str          # "frontier" | "standard" | "efficient"
    cached: bool             # True if response was served from semantic cache
    cached_at: Optional[str] # ISO timestamp if cached
    prompt_tokens: int
    completion_tokens: int
    latency_ms: float
    created_at: str


class AIErrorResponse(BaseModel):
    """
    Structured error response with AI-specific error codes.
    """
    request_id: str
    error_code: str
    error_message: str
    retry_after_seconds: Optional[int]   # Set for rate limit errors
    fallback_available: bool
    created_at: str

# AI-specific error codes
AI_ERROR_CODES = {
    "AI_TIMEOUT":           "AI inference exceeded timeout threshold",
    "AI_RATE_LIMITED":      "Token budget or rate limit exceeded",
    "AI_CONTENT_FILTERED":  "Request or response blocked by content safety policy",
    "AI_CONTEXT_TOO_LONG":  "Request exceeds maximum context window",
    "AI_MODEL_UNAVAILABLE": "Target model or provider temporarily unavailable",
    "AI_PHI_UNAUTHORIZED":  "Request contains PHI context; team not authorized for PHI access",
}

Streaming API Pattern

python
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
import asyncio
import json

app = FastAPI(title="Clinical AI API", version="1.0")

@app.post("/v1/clinical-rag")
async def clinical_rag(request: ClinicalRAGRequest):
    """
    Clinical Knowledge RAG endpoint.
    
    Supports both synchronous (JSON response) and streaming (SSE) modes.
    
    Educational Example — Not intended for clinical decision making.
    """
    if request.stream:
        return StreamingResponse(
            _stream_rag_response(request),
            media_type="text/event-stream",
            headers={
                "Cache-Control": "no-cache",
                "X-Accel-Buffering": "no",   # Disable nginx buffering for SSE
            }
        )
    
    # Synchronous mode
    result = await _execute_rag(request)
    return ClinicalRAGResponse(**result)


async def _stream_rag_response(request: ClinicalRAGRequest):
    """
    Generator for Server-Sent Events streaming.
    Yields tokens as they are generated; sends a [DONE] event at completion.
    """
    request_id = request.request_id
    
    # SSE event format: "data: {json}\n\n"
    try:
        async for token in _stream_llm(request):
            event = json.dumps({"request_id": request_id, "token": token})
            yield f"data: {event}\n\n"
        
        # Send completion event with metadata
        done_event = json.dumps({
            "request_id": request_id,
            "type": "done",
            "citations": [],    # Populated from retrieval layer
        })
        yield f"data: {done_event}\n\n"
        yield "data: [DONE]\n\n"
    
    except Exception as e:
        error_event = json.dumps({
            "request_id": request_id,
            "type": "error",
            "error_code": "AI_MODEL_UNAVAILABLE",
            "error_message": str(e),
        })
        yield f"data: {error_event}\n\n"


async def _execute_rag(request: ClinicalRAGRequest) -> dict:
    """Execute RAG pipeline and return result dict."""
    # Implementation connects to RAG pipeline and returns structured result
    ...


async def _stream_llm(request: ClinicalRAGRequest):
    """Stream tokens from LLM."""
    # Implementation connects to LLM streaming endpoint
    ...

API Versioning Strategy

AI APIs require explicit versioning because AI model upgrades can change output format, citation structure, and response behavior — changes that are not backwards compatible even if the API schema is unchanged.

text
/v1/clinical-rag      ← Current stable version
/v2/clinical-rag      ← Next version with breaking changes (schema or behavior)

Version strategy:
- Major version (v1 → v2): breaking schema changes or significant behavior changes
- Deprecation period: 6 months minimum for clinical integrations
- Response header: X-API-Version: 1.2.3 for minor version tracking
- Model upgrade: does NOT require version bump if output schema is unchanged

Error Handling Design

python
from fastapi import Request
from fastapi.responses import JSONResponse

@app.exception_handler(Exception)
async def ai_exception_handler(request: Request, exc: Exception) -> JSONResponse:
    """
    Unified exception handler that maps internal errors to AI-specific error codes.
    Never expose internal stack traces or model details in error responses.
    """
    request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    
    if isinstance(exc, TimeoutError):
        return JSONResponse(
            status_code=503,
            content=AIErrorResponse(
                request_id=request_id,
                error_code="AI_TIMEOUT",
                error_message="The AI service did not respond within the required time. Please retry.",
                retry_after_seconds=5,
                fallback_available=True,
                created_at=datetime.utcnow().isoformat(),
            ).dict()
        )
    
    if isinstance(exc, RateLimitExceeded):
        return JSONResponse(
            status_code=429,
            content=AIErrorResponse(
                request_id=request_id,
                error_code="AI_RATE_LIMITED",
                error_message="Token budget exceeded. Reduce request frequency or contact AI Platform team.",
                retry_after_seconds=60,
                fallback_available=False,
                created_at=datetime.utcnow().isoformat(),
            ).dict()
        )
    
    # Generic AI error — do not expose internal details
    return JSONResponse(
        status_code=500,
        content=AIErrorResponse(
            request_id=request_id,
            error_code="AI_MODEL_UNAVAILABLE",
            error_message="The AI service encountered an error. The team has been notified.",
            retry_after_seconds=30,
            fallback_available=True,
            created_at=datetime.utcnow().isoformat(),
        ).dict()
    )

Enterprise Considerations

API contract as a stability commitment: Downstream clinical systems (EHR integrations, clinical mobile apps) cannot tolerate breaking API changes between deployments. The AI API must be explicitly versioned, and old versions must be maintained through a deprecation window of at least 6 months for healthcare integrations.

Request ID propagation: Every AI API request must include a requestid that flows through the AI gateway, the LLM provider call, and the response. This enables end-to-end tracing when a clinician reports an incorrect AI recommendation — the requestid links the user-visible response to the specific LLM call, prompt version, and retrieved documents.

Common Mistakes

1. Exposing model names in the API. model: "gpt-4o" in the request schema forces every consumer to know and specify the model. When the platform upgrades the model, all consumers must update. Expose capability tiers, not model names.

2. Not returning citations with clinical responses. A clinical AI response that does not cite its sources cannot be verified by a clinician and is not suitable for clinical use. The API must always include source citations in responses to clinical queries.

3. No streaming support for interactive use cases. An AI API that buffers the full response before returning adds 5–30 seconds of perceived latency to an interactive clinical workflow. Streaming (SSE) is not optional for interactive clinical AI features.

Key Takeaways

  • Expose AI capability tiers (not model names) to decouple consumers from model upgrades
  • All AI API responses must include request_id for end-to-end traceability
  • Clinical AI responses must always include source citations
  • Streaming (SSE) is required for interactive AI features where users are watching the response generate
  • Maintain at least 6-month deprecation windows for healthcare API versions

Further Reading