Networking and AI API Gateway Design
Executive Summary
The AI API gateway is the single most important infrastructure component between client applications and the LLM layer: it enforces authentication, applies rate limiting, routes to the appropriate model based on request characteristics, provides observability across all AI calls, and implements circuit breaking when downstream providers degrade. Without a gateway, AI consumption is uncontrolled, costs are unpredictable, failure handling is inconsistent across teams, and compliance is impossible to enforce uniformly. This chapter covers the design, implementation, and operational patterns for enterprise AI API gateways, including the specific requirements imposed by clinical AI deployments.
Learning Objectives
- Design a multi-provider AI gateway with authentication, rate limiting, and intelligent routing
- Implement circuit breakers on LLM API calls to protect downstream systems
- Configure request/response logging with PHI-safe redaction for clinical deployments
- Apply cost controls through token budget enforcement and model tier routing
Business Problem
At a Reference Healthcare Organization with dozens of development teams consuming AI APIs, ungated access to LLM providers produces predictable outcomes: one team's runaway embedding job exhausts the monthly token budget, another team accidentally logs PHI in API request bodies, a third team's application surfaces LLM 429 errors directly to clinical users, and there is no organization-wide visibility into which AI capabilities are being used, by whom, and at what cost.
An AI gateway solves these problems at the infrastructure layer, enforcing policy uniformly without requiring each team to implement rate limiting, circuit breaking, and PHI redaction independently.
Why This Technology Exists
The LLM API ecosystem emerged as a collection of vendor-specific APIs with no standard authentication, no centralized rate limiting, and no cross-provider failover. Early enterprise AI deployments required each application team to implement rate limiting, retry logic, and observability independently β producing inconsistent behavior and duplicated implementation across teams.
The AI gateway pattern was borrowed from the broader API gateway pattern (Kong, Apigee, AWS API Gateway) and adapted for the specific requirements of LLM traffic: token-based rate limiting (requests per minute is insufficient; token throughput per minute is the correct unit), semantic routing (route to GPT-4 for complex tasks and GPT-3.5 for simple tasks based on request content), and streaming-aware observability (LLM responses stream token-by-token; gateway logging must handle this without blocking the stream).
Architecture
Gateway Implementation
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
from typing import Optional
import asyncio
import time
# Educational example β not for clinical use
class ModelTier(Enum):
FRONTIER = "frontier" # claude-opus-4-8, gpt-4o β highest capability, highest cost
STANDARD = "standard" # claude-sonnet-4-6, gpt-4o-mini β balanced
EFFICIENT = "efficient" # claude-haiku-4-5, gpt-3.5-turbo β lowest cost
SELF_HOSTED = "self_hosted" # On-premises GPU inference
@dataclass
class TeamConfig:
"""Per-team rate limits and model access control."""
team_id: str
monthly_token_budget: int # Total tokens per month across all models
tokens_per_minute_limit: int # Rate limit (token-level, not request-level)
allowed_model_tiers: list[ModelTier]
allowed_use_cases: list[str] # e.g., ["clinical_rag", "admin_summarization"]
phi_access: bool = False # Whether this team can process PHI
@dataclass
class CircuitBreakerState:
provider: str
is_open: bool = False # True = circuit is broken (blocking requests)
failure_count: int = 0
last_failure_time: Optional[datetime] = None
half_open_until: Optional[datetime] = None
class AIAPIGateway:
"""
Enterprise AI API gateway with authentication, rate limiting,
intelligent routing, circuit breaking, and PHI-safe logging.
Educational example β not for clinical use.
"""
def __init__(
self,
team_configs: dict[str, TeamConfig],
circuit_breaker_threshold: int = 5, # failures before opening circuit
circuit_breaker_timeout: int = 60, # seconds to wait before half-open
):
self.team_configs = team_configs
self.circuit_breakers: dict[str, CircuitBreakerState] = {
"anthropic": CircuitBreakerState("anthropic"),
"azure_openai": CircuitBreakerState("azure_openai"),
"vertex": CircuitBreakerState("vertex"),
"self_hosted": CircuitBreakerState("self_hosted"),
}
self.cb_threshold = circuit_breaker_threshold
self.cb_timeout = circuit_breaker_timeout
self._token_usage: dict[str, list[tuple[datetime, int]]] = {} # team_id β [(timestamp, tokens)]
async def route_request(
self,
team_id: str,
messages: list[dict],
requested_model_tier: ModelTier,
use_case: str,
max_tokens: int = 1024,
phi_context: bool = False, # Caller must declare if request may contain PHI
) -> dict:
"""
Route an AI request through the gateway with full policy enforcement.
Returns the LLM response dict.
"""
# 1. Authentication and authorization
team_config = self._authenticate_team(team_id)
self._authorize_request(team_config, requested_model_tier, use_case, phi_context)
# 2. Rate limiting (token-level)
self._check_rate_limit(team_id, team_config, max_tokens)
# 3. Intelligent model selection
provider, model = self._select_model(requested_model_tier, team_config)
# 4. Circuit breaker check
self._check_circuit_breaker(provider)
# 5. Execute request with circuit breaker tracking
try:
response = await self._execute_request(provider, model, messages, max_tokens)
self._record_success(provider)
except Exception as e:
self._record_failure(provider)
raise
# 6. Record token usage for budget tracking
total_tokens = response.get("usage", {}).get("total_tokens", max_tokens)
self._record_token_usage(team_id, total_tokens)
# 7. Emit metrics
await self._emit_metrics(team_id, provider, model, use_case, response)
return response
def _authenticate_team(self, team_id: str) -> TeamConfig:
"""Validate team API key and return team configuration."""
if team_id not in self.team_configs:
raise PermissionError(f"Unknown team: {team_id}")
return self.team_configs[team_id]
def _authorize_request(
self,
team_config: TeamConfig,
model_tier: ModelTier,
use_case: str,
phi_context: bool,
) -> None:
"""Enforce model and use-case authorization."""
if model_tier not in team_config.allowed_model_tiers:
raise PermissionError(
f"Team {team_config.team_id} is not authorized for {model_tier.value} models"
)
if use_case not in team_config.allowed_use_cases:
raise PermissionError(
f"Team {team_config.team_id} is not authorized for use case: {use_case}"
)
if phi_context and not team_config.phi_access:
raise PermissionError(
f"Team {team_config.team_id} is not authorized to process PHI. "
"Contact AI Platform team to request PHI access with appropriate compliance review."
)
def _check_rate_limit(
self,
team_id: str,
team_config: TeamConfig,
requested_tokens: int,
) -> None:
"""Enforce token-per-minute rate limit using sliding window."""
now = datetime.utcnow()
window_start = now - timedelta(minutes=1)
if team_id not in self._token_usage:
self._token_usage[team_id] = []
# Remove entries outside the sliding window
self._token_usage[team_id] = [
(ts, tokens) for ts, tokens in self._token_usage[team_id]
if ts >= window_start
]
current_window_tokens = sum(tokens for _, tokens in self._token_usage[team_id])
if current_window_tokens + requested_tokens > team_config.tokens_per_minute_limit:
raise Exception(
f"Rate limit exceeded for team {team_id}. "
f"Used {current_window_tokens}/{team_config.tokens_per_minute_limit} tokens/min."
)
def _check_circuit_breaker(self, provider: str) -> None:
"""Check circuit breaker state for the target provider."""
cb = self.circuit_breakers[provider]
if not cb.is_open:
return
# Check if half-open window has passed
if cb.half_open_until and datetime.utcnow() >= cb.half_open_until:
# Allow one test request through (half-open state)
return
raise Exception(
f"Circuit breaker is open for provider {provider}. "
f"Provider is experiencing failures. "
f"Retry after {cb.half_open_until.isoformat() if cb.half_open_until else 'N/A'}."
)
def _record_success(self, provider: str) -> None:
"""Reset circuit breaker on success."""
cb = self.circuit_breakers[provider]
cb.is_open = False
cb.failure_count = 0
cb.half_open_until = None
def _record_failure(self, provider: str) -> None:
"""Record failure; open circuit if threshold exceeded."""
cb = self.circuit_breakers[provider]
cb.failure_count += 1
cb.last_failure_time = datetime.utcnow()
if cb.failure_count >= self.cb_threshold:
cb.is_open = True
cb.half_open_until = datetime.utcnow() + timedelta(seconds=self.cb_timeout)
def _select_model(
self,
requested_tier: ModelTier,
team_config: TeamConfig,
) -> tuple[str, str]:
"""
Select provider and model based on tier, provider health, and team access.
Model ID mapping is illustrative β verify current model IDs in official docs.
"""
tier_to_models = {
ModelTier.FRONTIER: [
("anthropic", "claude-opus-4-8"),
("azure_openai", "gpt-4o"),
],
ModelTier.STANDARD: [
("anthropic", "claude-sonnet-4-6"),
("azure_openai", "gpt-4o-mini"),
],
ModelTier.EFFICIENT: [
("anthropic", "claude-haiku-4-5-20251001"),
],
ModelTier.SELF_HOSTED: [
("self_hosted", "clinical-llm"),
],
}
candidates = tier_to_models.get(requested_tier, [])
for provider, model in candidates:
cb = self.circuit_breakers.get(provider)
if cb and cb.is_open:
continue # Skip providers with open circuit breakers
return provider, model
raise Exception(
f"No healthy provider available for tier {requested_tier.value}. "
f"All providers may be circuit-broken."
)
async def _execute_request(
self,
provider: str,
model: str,
messages: list[dict],
max_tokens: int,
) -> dict:
"""Execute the request against the selected provider."""
if provider == "anthropic":
return await self._call_anthropic(model, messages, max_tokens)
elif provider == "azure_openai":
return await self._call_azure_openai(model, messages, max_tokens)
elif provider == "self_hosted":
return await self._call_self_hosted(model, messages, max_tokens)
raise ValueError(f"Unknown provider: {provider}")
def _record_token_usage(self, team_id: str, tokens: int) -> None:
"""Record token consumption for rate limiting and budget tracking."""
self._token_usage.setdefault(team_id, []).append((datetime.utcnow(), tokens))
async def _emit_metrics(
self,
team_id: str,
provider: str,
model: str,
use_case: str,
response: dict,
) -> None:
"""Emit metrics for observability. PHI must not appear in metric labels."""
# Implementation: increment Prometheus counters
# Labels: team_id, provider, model, use_case (never include request content)
pass
# Provider-specific callers omitted for brevity β see cloud-ai-platforms.md
async def _call_anthropic(self, model, messages, max_tokens): ...
async def _call_azure_openai(self, model, messages, max_tokens): ...
async def _call_self_hosted(self, model, messages, max_tokens): ...PHI-Safe Request Logging
Logging AI request bodies in clinical deployments requires PHI redaction before any data reaches the logging pipeline.
import re
from typing import Any
# HIPAA Safe Harbor redaction patterns (18 identifiers)
# This is a simplified illustrative example β consult a healthcare data expert
# for production PHI de-identification.
PHI_REDACTION_PATTERNS = [
# Names (very rough β context-dependent; NLP-based detection is more reliable)
(r'\bDr\.\s+[A-Z][a-z]+\s+[A-Z][a-z]+\b', '[PROVIDER_NAME]'),
# MRN patterns
(r'\bMRN[:\s#]*\d{6,10}\b', '[MRN]'),
# Date of birth
(r'\bDOB[:\s]*\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', '[DOB]'),
# Phone numbers
(r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b', '[PHONE]'),
# SSN
(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]'),
# Email
(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b', '[EMAIL]'),
# ZIP codes
(r'\b\d{5}(?:-\d{4})?\b', '[ZIP]'),
]
def redact_phi_from_log_entry(content: str) -> str:
"""
Apply PHI redaction patterns to a string before logging.
This is a simplified pattern-based redactor. Production healthcare
deployments should use a trained NER model or dedicated PHI detection
service for more accurate de-identification.
Educational Example β Not for clinical use.
"""
redacted = content
for pattern, replacement in PHI_REDACTION_PATTERNS:
redacted = re.sub(pattern, replacement, redacted, flags=re.IGNORECASE)
return redacted
def build_phi_safe_log_entry(
team_id: str,
provider: str,
model: str,
use_case: str,
prompt_tokens: int,
completion_tokens: int,
latency_ms: float,
phi_context: bool,
) -> dict:
"""
Build a log entry that is safe to write to the logging pipeline.
Never include request or response content in logs for PHI-handling teams.
"""
return {
"timestamp": datetime.utcnow().isoformat(),
"team_id": team_id,
"provider": provider,
"model": model,
"use_case": use_case,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
"latency_ms": round(latency_ms, 1),
"phi_context": phi_context,
# NEVER log: messages, response content, system prompt content (may contain PHI)
}Enterprise Considerations
Gateway as cost control: Token budgets per team, per use case, and per month are enforced at the gateway layer. When a team's monthly budget is 80% consumed, the gateway can automatically downgrade model tier (from Frontier to Standard) and notify the team lead rather than blocking requests.
Streaming response handling: LLM responses stream token-by-token. A naΓ―ve gateway design attempts to buffer the entire response before forwarding it to the client β this defeats the latency benefit of streaming and increases memory pressure. Production gateways must implement streaming passthrough, which complicates metrics collection (total tokens are not known until the stream ends).
Multi-region deployment: For clinical AI with high availability requirements, deploy the gateway in two regions with active-active load balancing. Circuit breaker state must be synchronized across gateway instances (use Redis for shared state).
Cost attribution: The gateway is the authoritative source for AI cost attribution by team, use case, and model. Emit token counts, provider, and model to the cost management platform on every request. This data drives showback reports for department budget owners.
Common Mistakes
1. Request-level rate limiting for LLM traffic. Limiting to 100 requests per minute ignores that a single request may consume 10,000 tokens while another consumes 100. Token-per-minute limits provide meaningful budget control; request limits do not.
2. Logging request content for PHI-handling teams. Even with "best effort" redaction in the application, logging full request bodies for PHI-processing teams risks PHI appearing in logs. For PHI-handling teams, log only token counts, provider, model, and use-case metadata β never request content.
3. Single-provider design without circuit breakers. If the application calls the LLM provider directly with no circuit breaker, an API outage surfaces as HTTP 500 errors to clinical users. A gateway with circuit breaking can failover to a secondary provider or serve a graceful degradation response.
4. Not enforcing PHI access at the gateway. If PHI access is checked only in application code, a team can bypass it by calling the LLM API directly. Enforce PHI access at the gateway (where it's controlled infrastructure) rather than trusting application-layer enforcement.
Best Practices
- Rate limit AI traffic by tokens per minute, not requests per minute
- Enforce PHI access control and data residency rules at the gateway layer, not application layer
- Never log request or response content for PHI-handling teams; log only metadata (tokens, provider, model, use case)
- Implement circuit breakers with automatic failover to secondary providers
- Emit cost attribution metrics (tokens Γ provider Γ model) on every request for budget management
- Deploy circuit breaker state in shared Redis to synchronize across gateway instances
Trade-offs
| Approach | Strengths | Weaknesses |
|---|---|---|
| Custom gateway (this chapter) | Full control, PHI-specific logic | Development and operational cost |
| LiteLLM (open source) | Multi-provider routing out of box | Less control over PHI-specific features |
| Kong AI Gateway | Mature plugin ecosystem | Additional licensing cost |
| AWS API Gateway + Bedrock | Fully managed, IAM integration | AWS-only; limited cross-provider routing |
Interview Questions
Q: Your AI gateway is receiving 10,000 tokens per minute from one team, which is within their rate limit. However, this is causing 429 errors from the upstream LLM provider. How do you address this?
Category: System Design Difficulty: Senior Role: AI Architect
Answer Framework:
The root cause is that the team's token budget is within limits, but the upstream provider's organization-level rate limit is being exhausted by the combination of all teams. This is a total throughput problem, not a per-team problem.
The solution has two layers: short-term and structural.
Short-term: The gateway should implement request queuing with backpressure rather than immediately returning 429 to the client. When the upstream provider returns 429, the gateway queues the request and retries after the Retry-After header duration, up to a configurable wait timeout. From the client's perspective, the request completes with higher latency rather than failing.
Structural: Negotiate a higher organization-level rate limit with the LLM provider, or implement multi-provider routing. With multi-provider routing, when Provider A is rate limited, the gateway routes to Provider B (Azure OpenAI in addition to Anthropic), subject to model equivalence requirements. This requires model equivalence assessment for the specific use cases β the organization cannot blindly substitute one LLM for another in clinical contexts without validating output quality equivalence.
Key Points to Hit:
- Per-team limits do not prevent aggregate organization-level rate limit exhaustion
- Request queuing with retry-after as the short-term response
- Multi-provider routing as the structural response
- Model equivalence validation for clinical use cases before substituting providers
Key Takeaways
- The AI API gateway is the enforcement point for authentication, authorization, rate limiting, PHI access control, circuit breaking, and cost attribution
- Rate limiting must be token-based, not request-based, to provide meaningful budget control
- PHI-handling teams must have log content redacted or omitted; log only metadata
- Circuit breakers protect clinical users from LLM provider degradation; automatic failover maintains availability
- The gateway is the authoritative source for AI cost attribution data across teams and use cases
Glossary
Circuit Breaker: A fault tolerance pattern that stops sending requests to a downstream service after a failure threshold is exceeded, allowing the service time to recover before re-attempting traffic.
Token Budget: An organization-defined limit on the number of LLM tokens a team may consume per time period, used to control AI spending.
PHI Access Control: Gateway-enforced authorization that restricts which teams and use cases can process requests containing Protected Health Information.
Model Tier Routing: Selecting a model tier (Frontier, Standard, Efficient) based on the request's complexity or urgency, balancing cost against capability.
Further Reading
- Cloud AI Platforms β Provider-specific API patterns the gateway abstracts
- Security Considerations β Authentication and authorization patterns
- Observability and Monitoring β Metrics emitted by the gateway
- HIPAA and AI β PHI requirements that drive gateway PHI access controls