Networking and AI API Gateway Design

Common Mistakes

1. Request-level rate limiting for LLM traffic. Limiting to 100 requests per minute ignores that a single request may consume 10,000 tokens while another consumes 100. Token-per-minute limits provide meaningful budget control; request limits do not.

2. Logging request content for PHI-handling teams. Even with "best effort" redaction in the application, logging full request bodies for PHI-processing teams risks PHI appearing in logs. For PHI-handling teams, log only token counts, provider, model, and use-case metadata — never request content.

3. Single-provider design without circuit breakers. If the application calls the LLM provider directly with no circuit breaker, an API outage surfaces as HTTP 500 errors to clinical users. A gateway with circuit breaking can failover to a secondary provider or serve a graceful degradation response.

4. Not enforcing PHI access at the gateway. If PHI access is checked only in application code, a team can bypass it by calling the LLM API directly. Enforce PHI access at the gateway (where it's controlled infrastructure) rather than trusting application-layer enforcement.

Best Practices

Rate limit AI traffic by tokens per minute, not requests per minute
Enforce PHI access control and data residency rules at the gateway layer, not application layer
Never log request or response content for PHI-handling teams; log only metadata (tokens, provider, model, use case)
Implement circuit breakers with automatic failover to secondary providers
Emit cost attribution metrics (tokens × provider × model) on every request for budget management
Deploy circuit breaker state in shared Redis to synchronize across gateway instances

Trade-offs

Approach	Strengths	Weaknesses
Custom gateway (this chapter)	Full control, PHI-specific logic	Development and operational cost
LiteLLM (open source)	Multi-provider routing out of box	Less control over PHI-specific features
Kong AI Gateway	Mature plugin ecosystem	Additional licensing cost
AWS API Gateway + Bedrock	Fully managed, IAM integration	AWS-only; limited cross-provider routing

Interview Questions

Q: Your AI gateway is receiving 10,000 tokens per minute from one team, which is within their rate limit. However, this is causing 429 errors from the upstream LLM provider. How do you address this?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

The root cause is that the team's token budget is within limits, but the upstream provider's organization-level rate limit is being exhausted by the combination of all teams. This is a total throughput problem, not a per-team problem.

The solution has two layers: short-term and structural.

Short-term: The gateway should implement request queuing with backpressure rather than immediately returning 429 to the client. When the upstream provider returns 429, the gateway queues the request and retries after the Retry-After header duration, up to a configurable wait timeout. From the client's perspective, the request completes with higher latency rather than failing.

Structural: Negotiate a higher organization-level rate limit with the LLM provider, or implement multi-provider routing. With multi-provider routing, when Provider A is rate limited, the gateway routes to Provider B (Azure OpenAI in addition to Anthropic), subject to model equivalence requirements. This requires model equivalence assessment for the specific use cases — the organization cannot blindly substitute one LLM for another in clinical contexts without validating output quality equivalence.

Key Points to Hit:

Per-team limits do not prevent aggregate organization-level rate limit exhaustion
Request queuing with retry-after as the short-term response
Multi-provider routing as the structural response
Model equivalence validation for clinical use cases before substituting providers

Key Takeaways

The AI API gateway is the enforcement point for authentication, authorization, rate limiting, PHI access control, circuit breaking, and cost attribution
Rate limiting must be token-based, not request-based, to provide meaningful budget control
PHI-handling teams must have log content redacted or omitted; log only metadata
Circuit breakers protect clinical users from LLM provider degradation; automatic failover maintains availability
The gateway is the authoritative source for AI cost attribution data across teams and use cases

Networking and AI API Gateway Design#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: Your AI gateway is receiving 10,000 tokens per minute from one team, which is within their rate limit. However, this is causing 429 errors from the upstream LLM provider. How do you address this?#

Key Takeaways#