Networking and AI API Gateway Design

Executive Summary

The AI API gateway is the single most important infrastructure component between client applications and the LLM layer: it enforces authentication, applies rate limiting, routes to the appropriate model based on request characteristics, provides observability across all AI calls, and implements circuit breaking when downstream providers degrade. Without a gateway, AI consumption is uncontrolled, costs are unpredictable, failure handling is inconsistent across teams, and compliance is impossible to enforce uniformly. This chapter covers the design, implementation, and operational patterns for enterprise AI API gateways, including the specific requirements imposed by clinical AI deployments.

Learning Objectives

Design a multi-provider AI gateway with authentication, rate limiting, and intelligent routing
Implement circuit breakers on LLM API calls to protect downstream systems
Configure request/response logging with PHI-safe redaction for clinical deployments
Apply cost controls through token budget enforcement and model tier routing

Business Problem

At a Reference Healthcare Organization with dozens of development teams consuming AI APIs, ungated access to LLM providers produces predictable outcomes: one team's runaway embedding job exhausts the monthly token budget, another team accidentally logs PHI in API request bodies, a third team's application surfaces LLM 429 errors directly to clinical users, and there is no organization-wide visibility into which AI capabilities are being used, by whom, and at what cost.

An AI gateway solves these problems at the infrastructure layer, enforcing policy uniformly without requiring each team to implement rate limiting, circuit breaking, and PHI redaction independently.

Why This Technology Exists

The LLM API ecosystem emerged as a collection of vendor-specific APIs with no standard authentication, no centralized rate limiting, and no cross-provider failover. Early enterprise AI deployments required each application team to implement rate limiting, retry logic, and observability independently — producing inconsistent behavior and duplicated implementation across teams.

The AI gateway pattern was borrowed from the broader API gateway pattern (Kong, Apigee, AWS API Gateway) and adapted for the specific requirements of LLM traffic: token-based rate limiting (requests per minute is insufficient; token throughput per minute is the correct unit), semantic routing (route to GPT-4 for complex tasks and GPT-3.5 for simple tasks based on request content), and streaming-aware observability (LLM responses stream token-by-token; gateway logging must handle this without blocking the stream).

Architecture

graph TD subgraph "Client Applications" EHR_APP["EHR Integration\n(CDS Hooks)"] MOBILE["Clinical Mobile App"] PORTAL["Patient Portal"] INTERNAL["Internal Tools\n(Dev / Analytics)"] end subgraph "AI API Gateway" AUTH["Authentication\n(mTLS / API Key / JWT)"] AUTHZ["Authorization\n(Team quotas, model access control)"] RATELIMIT["Rate Limiting\n(Tokens per minute per team)"] ROUTER["Intelligent Router\nModel selection by tier / task"] CIRCUIT["Circuit Breaker\nPer-provider health tracking"] REDACT["PHI Redaction\n(Pre-logging scrubber)"] LOG["Audit Logger\nRequest metadata + token counts"] end subgraph "AI Providers" ANTHROPIC["Anthropic API\n(claude-opus-4-8)"] AZURE_OAI["Azure OpenAI\n(GPT-4o)"] VERTEX["Google Vertex AI\n(Gemini)"] SELF_HOSTED["Self-Hosted vLLM\n(On-premises GPU)"] end subgraph "Observability" PROMETHEUS["Prometheus\nToken usage, latency, errors"] GRAFANA["Grafana\nCost and usage dashboards"] ALERT["Alerting\nBudget thresholds, error rates"] end EHR_APP & MOBILE & PORTAL & INTERNAL --> AUTH AUTH --> AUTHZ --> RATELIMIT --> ROUTER ROUTER --> CIRCUIT CIRCUIT --> ANTHROPIC & AZURE_OAI & VERTEX & SELF_HOSTED ROUTER --> REDACT --> LOG --> PROMETHEUS --> GRAFANA PROMETHEUS --> ALERT

Gateway Implementation

python

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
from typing import Optional
import asyncio
import time

# Educational example — not for clinical use

class ModelTier(Enum):
    FRONTIER = "frontier"        # claude-opus-4-8, gpt-4o — highest capability, highest cost
    STANDARD = "standard"        # claude-sonnet-4-6, gpt-4o-mini — balanced
    EFFICIENT = "efficient"      # claude-haiku-4-5, gpt-3.5-turbo — lowest cost
    SELF_HOSTED = "self_hosted"  # On-premises GPU inference


@dataclass
class TeamConfig:
    """Per-team rate limits and model access control."""
    team_id: str
    monthly_token_budget: int           # Total tokens per month across all models
    tokens_per_minute_limit: int        # Rate limit (token-level, not request-level)
    allowed_model_tiers: list[ModelTier]
    allowed_use_cases: list[str]        # e.g., ["clinical_rag", "admin_summarization"]
    phi_access: bool = False            # Whether this team can process PHI


@dataclass
class CircuitBreakerState:
    provider: str
    is_open: bool = False              # True = circuit is broken (blocking requests)
    failure_count: int = 0
    last_failure_time: Optional[datetime] = None
    half_open_until: Optional[datetime] = None


class AIAPIGateway:
    """
    Enterprise AI API gateway with authentication, rate limiting,
    intelligent routing, circuit breaking, and PHI-safe logging.
    
    Educational example — not for clinical use.
    """
    
    def __init__(
        self,
        team_configs: dict[str, TeamConfig],
        circuit_breaker_threshold: int = 5,    # failures before opening circuit
        circuit_breaker_timeout: int = 60,     # seconds to wait before half-open
    ):
        self.team_configs = team_configs
        self.circuit_breakers: dict[str, CircuitBreakerState] = {
            "anthropic": CircuitBreakerState("anthropic"),
            "azure_openai": CircuitBreakerState("azure_openai"),
            "vertex": CircuitBreakerState("vertex"),
            "self_hosted": CircuitBreakerState("self_hosted"),
        }
        self.cb_threshold = circuit_breaker_threshold
        self.cb_timeout = circuit_breaker_timeout
        self._token_usage: dict[str, list[tuple[datetime, int]]] = {}  # team_id → [(timestamp, tokens)]
    
    async def route_request(
        self,
        team_id: str,
        messages: list[dict],
        requested_model_tier: ModelTier,
        use_case: str,
        max_tokens: int = 1024,
        phi_context: bool = False,     # Caller must declare if request may contain PHI
    ) -> dict:
        """
        Route an AI request through the gateway with full policy enforcement.
        Returns the LLM response dict.
        """
        
        # 1. Authentication and authorization
        team_config = self._authenticate_team(team_id)
        self._authorize_request(team_config, requested_model_tier, use_case, phi_context)
        
        # 2. Rate limiting (token-level)
        self._check_rate_limit(team_id, team_config, max_tokens)
        
        # 3. Intelligent model selection
        provider, model = self._select_model(requested_model_tier, team_config)
        
        # 4. Circuit breaker check
        self._check_circuit_breaker(provider)
        
        # 5. Execute request with circuit breaker tracking
        try:
            response = await self._execute_request(provider, model, messages, max_tokens)
            self._record_success(provider)
        except Exception as e:
            self._record_failure(provider)
            raise
        
        # 6. Record token usage for budget tracking
        total_tokens = response.get("usage", {}).get("total_tokens", max_tokens)
        self._record_token_usage(team_id, total_tokens)
        
        # 7. Emit metrics
        await self._emit_metrics(team_id, provider, model, use_case, response)
        
        return response
    
    def _authenticate_team(self, team_id: str) -> TeamConfig:
        """Validate team API key and return team configuration."""
        if team_id not in self.team_configs:
            raise PermissionError(f"Unknown team: {team_id}")
        return self.team_configs[team_id]
    
    def _authorize_request(
        self,
        team_config: TeamConfig,
        model_tier: ModelTier,
        use_case: str,
        phi_context: bool,
    ) -> None:
        """Enforce model and use-case authorization."""
        if model_tier not in team_config.allowed_model_tiers:
            raise PermissionError(
                f"Team {team_config.team_id} is not authorized for {model_tier.value} models"
            )
        if use_case not in team_config.allowed_use_cases:
            raise PermissionError(
                f"Team {team_config.team_id} is not authorized for use case: {use_case}"
            )
        if phi_context and not team_config.phi_access:
            raise PermissionError(
                f"Team {team_config.team_id} is not authorized to process PHI. "
                "Contact AI Platform team to request PHI access with appropriate compliance review."
            )
    
    def _check_rate_limit(
        self,
        team_id: str,
        team_config: TeamConfig,
        requested_tokens: int,
    ) -> None:
        """Enforce token-per-minute rate limit using sliding window."""
        now = datetime.utcnow()
        window_start = now - timedelta(minutes=1)
        
        if team_id not in self._token_usage:
            self._token_usage[team_id] = []
        
        # Remove entries outside the sliding window
        self._token_usage[team_id] = [
            (ts, tokens) for ts, tokens in self._token_usage[team_id]
            if ts >= window_start
        ]
        
        current_window_tokens = sum(tokens for _, tokens in self._token_usage[team_id])
        
        if current_window_tokens + requested_tokens > team_config.tokens_per_minute_limit:
            raise Exception(
                f"Rate limit exceeded for team {team_id}. "
                f"Used {current_window_tokens}/{team_config.tokens_per_minute_limit} tokens/min."
            )
    
    def _check_circuit_breaker(self, provider: str) -> None:
        """Check circuit breaker state for the target provider."""
        cb = self.circuit_breakers[provider]
        
        if not cb.is_open:
            return
        
        # Check if half-open window has passed
        if cb.half_open_until and datetime.utcnow() >= cb.half_open_until:
            # Allow one test request through (half-open state)
            return
        
        raise Exception(
            f"Circuit breaker is open for provider {provider}. "
            f"Provider is experiencing failures. "
            f"Retry after {cb.half_open_until.isoformat() if cb.half_open_until else 'N/A'}."
        )
    
    def _record_success(self, provider: str) -> None:
        """Reset circuit breaker on success."""
        cb = self.circuit_breakers[provider]
        cb.is_open = False
        cb.failure_count = 0
        cb.half_open_until = None
    
    def _record_failure(self, provider: str) -> None:
        """Record failure; open circuit if threshold exceeded."""
        cb = self.circuit_breakers[provider]
        cb.failure_count += 1
        cb.last_failure_time = datetime.utcnow()
        
        if cb.failure_count >= self.cb_threshold:
            cb.is_open = True
            cb.half_open_until = datetime.utcnow() + timedelta(seconds=self.cb_timeout)
    
    def _select_model(
        self,
        requested_tier: ModelTier,
        team_config: TeamConfig,
    ) -> tuple[str, str]:
        """
        Select provider and model based on tier, provider health, and team access.
        
        Model ID mapping is illustrative — verify current model IDs in official docs.
        """
        tier_to_models = {
            ModelTier.FRONTIER: [
                ("anthropic", "claude-opus-4-8"),
                ("azure_openai", "gpt-4o"),
            ],
            ModelTier.STANDARD: [
                ("anthropic", "claude-sonnet-4-6"),
                ("azure_openai", "gpt-4o-mini"),
            ],
            ModelTier.EFFICIENT: [
                ("anthropic", "claude-haiku-4-5-20251001"),
            ],
            ModelTier.SELF_HOSTED: [
                ("self_hosted", "clinical-llm"),
            ],
        }
        
        candidates = tier_to_models.get(requested_tier, [])
        
        for provider, model in candidates:
            cb = self.circuit_breakers.get(provider)
            if cb and cb.is_open:
                continue  # Skip providers with open circuit breakers
            return provider, model
        
        raise Exception(
            f"No healthy provider available for tier {requested_tier.value}. "
            f"All providers may be circuit-broken."
        )
    
    async def _execute_request(
        self,
        provider: str,
        model: str,
        messages: list[dict],
        max_tokens: int,
    ) -> dict:
        """Execute the request against the selected provider."""
        if provider == "anthropic":
            return await self._call_anthropic(model, messages, max_tokens)
        elif provider == "azure_openai":
            return await self._call_azure_openai(model, messages, max_tokens)
        elif provider == "self_hosted":
            return await self._call_self_hosted(model, messages, max_tokens)
        raise ValueError(f"Unknown provider: {provider}")
    
    def _record_token_usage(self, team_id: str, tokens: int) -> None:
        """Record token consumption for rate limiting and budget tracking."""
        self._token_usage.setdefault(team_id, []).append((datetime.utcnow(), tokens))
    
    async def _emit_metrics(
        self,
        team_id: str,
        provider: str,
        model: str,
        use_case: str,
        response: dict,
    ) -> None:
        """Emit metrics for observability. PHI must not appear in metric labels."""
        # Implementation: increment Prometheus counters
        # Labels: team_id, provider, model, use_case (never include request content)
        pass
    
    # Provider-specific callers omitted for brevity — see cloud-ai-platforms.md
    async def _call_anthropic(self, model, messages, max_tokens): ...
    async def _call_azure_openai(self, model, messages, max_tokens): ...
    async def _call_self_hosted(self, model, messages, max_tokens): ...

PHI-Safe Request Logging

Logging AI request bodies in clinical deployments requires PHI redaction before any data reaches the logging pipeline.

python

import re
from typing import Any

# HIPAA Safe Harbor redaction patterns (18 identifiers)
# This is a simplified illustrative example — consult a healthcare data expert
# for production PHI de-identification.

PHI_REDACTION_PATTERNS = [
    # Names (very rough — context-dependent; NLP-based detection is more reliable)
    (r'\bDr\.\s+[A-Z][a-z]+\s+[A-Z][a-z]+\b', '[PROVIDER_NAME]'),
    # MRN patterns
    (r'\bMRN[:\s#]*\d{6,10}\b', '[MRN]'),
    # Date of birth
    (r'\bDOB[:\s]*\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', '[DOB]'),
    # Phone numbers
    (r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b', '[PHONE]'),
    # SSN
    (r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]'),
    # Email
    (r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b', '[EMAIL]'),
    # ZIP codes
    (r'\b\d{5}(?:-\d{4})?\b', '[ZIP]'),
]


def redact_phi_from_log_entry(content: str) -> str:
    """
    Apply PHI redaction patterns to a string before logging.
    
    This is a simplified pattern-based redactor. Production healthcare
    deployments should use a trained NER model or dedicated PHI detection
    service for more accurate de-identification.
    
    Educational Example — Not for clinical use.
    """
    redacted = content
    for pattern, replacement in PHI_REDACTION_PATTERNS:
        redacted = re.sub(pattern, replacement, redacted, flags=re.IGNORECASE)
    return redacted


def build_phi_safe_log_entry(
    team_id: str,
    provider: str,
    model: str,
    use_case: str,
    prompt_tokens: int,
    completion_tokens: int,
    latency_ms: float,
    phi_context: bool,
) -> dict:
    """
    Build a log entry that is safe to write to the logging pipeline.
    Never include request or response content in logs for PHI-handling teams.
    """
    return {
        "timestamp": datetime.utcnow().isoformat(),
        "team_id": team_id,
        "provider": provider,
        "model": model,
        "use_case": use_case,
        "prompt_tokens": prompt_tokens,
        "completion_tokens": completion_tokens,
        "total_tokens": prompt_tokens + completion_tokens,
        "latency_ms": round(latency_ms, 1),
        "phi_context": phi_context,
        # NEVER log: messages, response content, system prompt content (may contain PHI)
    }

Enterprise Considerations

Gateway as cost control: Token budgets per team, per use case, and per month are enforced at the gateway layer. When a team's monthly budget is 80% consumed, the gateway can automatically downgrade model tier (from Frontier to Standard) and notify the team lead rather than blocking requests.

Streaming response handling: LLM responses stream token-by-token. A naïve gateway design attempts to buffer the entire response before forwarding it to the client — this defeats the latency benefit of streaming and increases memory pressure. Production gateways must implement streaming passthrough, which complicates metrics collection (total tokens are not known until the stream ends).

Multi-region deployment: For clinical AI with high availability requirements, deploy the gateway in two regions with active-active load balancing. Circuit breaker state must be synchronized across gateway instances (use Redis for shared state).

Cost attribution: The gateway is the authoritative source for AI cost attribution by team, use case, and model. Emit token counts, provider, and model to the cost management platform on every request. This data drives showback reports for department budget owners.

Common Mistakes

1. Request-level rate limiting for LLM traffic. Limiting to 100 requests per minute ignores that a single request may consume 10,000 tokens while another consumes 100. Token-per-minute limits provide meaningful budget control; request limits do not.

2. Logging request content for PHI-handling teams. Even with "best effort" redaction in the application, logging full request bodies for PHI-processing teams risks PHI appearing in logs. For PHI-handling teams, log only token counts, provider, model, and use-case metadata — never request content.

3. Single-provider design without circuit breakers. If the application calls the LLM provider directly with no circuit breaker, an API outage surfaces as HTTP 500 errors to clinical users. A gateway with circuit breaking can failover to a secondary provider or serve a graceful degradation response.

4. Not enforcing PHI access at the gateway. If PHI access is checked only in application code, a team can bypass it by calling the LLM API directly. Enforce PHI access at the gateway (where it's controlled infrastructure) rather than trusting application-layer enforcement.

Best Practices

Rate limit AI traffic by tokens per minute, not requests per minute
Enforce PHI access control and data residency rules at the gateway layer, not application layer
Never log request or response content for PHI-handling teams; log only metadata (tokens, provider, model, use case)
Implement circuit breakers with automatic failover to secondary providers
Emit cost attribution metrics (tokens × provider × model) on every request for budget management
Deploy circuit breaker state in shared Redis to synchronize across gateway instances

Trade-offs

Approach	Strengths	Weaknesses
Custom gateway (this chapter)	Full control, PHI-specific logic	Development and operational cost
LiteLLM (open source)	Multi-provider routing out of box	Less control over PHI-specific features
Kong AI Gateway	Mature plugin ecosystem	Additional licensing cost
AWS API Gateway + Bedrock	Fully managed, IAM integration	AWS-only; limited cross-provider routing

Interview Questions

Q: Your AI gateway is receiving 10,000 tokens per minute from one team, which is within their rate limit. However, this is causing 429 errors from the upstream LLM provider. How do you address this?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

The root cause is that the team's token budget is within limits, but the upstream provider's organization-level rate limit is being exhausted by the combination of all teams. This is a total throughput problem, not a per-team problem.

The solution has two layers: short-term and structural.

Short-term: The gateway should implement request queuing with backpressure rather than immediately returning 429 to the client. When the upstream provider returns 429, the gateway queues the request and retries after the Retry-After header duration, up to a configurable wait timeout. From the client's perspective, the request completes with higher latency rather than failing.

Structural: Negotiate a higher organization-level rate limit with the LLM provider, or implement multi-provider routing. With multi-provider routing, when Provider A is rate limited, the gateway routes to Provider B (Azure OpenAI in addition to Anthropic), subject to model equivalence requirements. This requires model equivalence assessment for the specific use cases — the organization cannot blindly substitute one LLM for another in clinical contexts without validating output quality equivalence.

Key Points to Hit:

Per-team limits do not prevent aggregate organization-level rate limit exhaustion
Request queuing with retry-after as the short-term response
Multi-provider routing as the structural response
Model equivalence validation for clinical use cases before substituting providers

Key Takeaways

The AI API gateway is the enforcement point for authentication, authorization, rate limiting, PHI access control, circuit breaking, and cost attribution
Rate limiting must be token-based, not request-based, to provide meaningful budget control
PHI-handling teams must have log content redacted or omitted; log only metadata
Circuit breakers protect clinical users from LLM provider degradation; automatic failover maintains availability
The gateway is the authoritative source for AI cost attribution data across teams and use cases

Glossary

Circuit Breaker: A fault tolerance pattern that stops sending requests to a downstream service after a failure threshold is exceeded, allowing the service time to recover before re-attempting traffic.

Token Budget: An organization-defined limit on the number of LLM tokens a team may consume per time period, used to control AI spending.

PHI Access Control: Gateway-enforced authorization that restricts which teams and use cases can process requests containing Protected Health Information.

Model Tier Routing: Selecting a model tier (Frontier, Standard, Efficient) based on the request's complexity or urgency, balancing cost against capability.

Networking and AI API Gateway Design#

Executive Summary#

Learning Objectives#

Business Problem#

Why This Technology Exists#

Architecture#

Gateway Implementation#

PHI-Safe Request Logging#

Enterprise Considerations#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: Your AI gateway is receiving 10,000 tokens per minute from one team, which is within their rate limit. However, this is causing 429 errors from the upstream LLM provider. How do you address this?#

Key Takeaways#

Glossary#

Further Reading#