Networking and AI API Gateway Design
Common Mistakes
1. Request-level rate limiting for LLM traffic. Limiting to 100 requests per minute ignores that a single request may consume 10,000 tokens while another consumes 100. Token-per-minute limits provide meaningful budget control; request limits do not.
2. Logging request content for PHI-handling teams. Even with "best effort" redaction in the application, logging full request bodies for PHI-processing teams risks PHI appearing in logs. For PHI-handling teams, log only token counts, provider, model, and use-case metadata — never request content.
3. Single-provider design without circuit breakers. If the application calls the LLM provider directly with no circuit breaker, an API outage surfaces as HTTP 500 errors to clinical users. A gateway with circuit breaking can failover to a secondary provider or serve a graceful degradation response.
4. Not enforcing PHI access at the gateway. If PHI access is checked only in application code, a team can bypass it by calling the LLM API directly. Enforce PHI access at the gateway (where it's controlled infrastructure) rather than trusting application-layer enforcement.
Best Practices
- Rate limit AI traffic by tokens per minute, not requests per minute
- Enforce PHI access control and data residency rules at the gateway layer, not application layer
- Never log request or response content for PHI-handling teams; log only metadata (tokens, provider, model, use case)
- Implement circuit breakers with automatic failover to secondary providers
- Emit cost attribution metrics (tokens × provider × model) on every request for budget management
- Deploy circuit breaker state in shared Redis to synchronize across gateway instances
Trade-offs
| Approach | Strengths | Weaknesses |
|---|---|---|
| Custom gateway (this chapter) | Full control, PHI-specific logic | Development and operational cost |
| LiteLLM (open source) | Multi-provider routing out of box | Less control over PHI-specific features |
| Kong AI Gateway | Mature plugin ecosystem | Additional licensing cost |
| AWS API Gateway + Bedrock | Fully managed, IAM integration | AWS-only; limited cross-provider routing |
Interview Questions
Q: Your AI gateway is receiving 10,000 tokens per minute from one team, which is within their rate limit. However, this is causing 429 errors from the upstream LLM provider. How do you address this?
Category: System Design Difficulty: Senior Role: AI Architect
Answer Framework:
The root cause is that the team's token budget is within limits, but the upstream provider's organization-level rate limit is being exhausted by the combination of all teams. This is a total throughput problem, not a per-team problem.
The solution has two layers: short-term and structural.
Short-term: The gateway should implement request queuing with backpressure rather than immediately returning 429 to the client. When the upstream provider returns 429, the gateway queues the request and retries after the Retry-After header duration, up to a configurable wait timeout. From the client's perspective, the request completes with higher latency rather than failing.
Structural: Negotiate a higher organization-level rate limit with the LLM provider, or implement multi-provider routing. With multi-provider routing, when Provider A is rate limited, the gateway routes to Provider B (Azure OpenAI in addition to Anthropic), subject to model equivalence requirements. This requires model equivalence assessment for the specific use cases — the organization cannot blindly substitute one LLM for another in clinical contexts without validating output quality equivalence.
Key Points to Hit:
- Per-team limits do not prevent aggregate organization-level rate limit exhaustion
- Request queuing with retry-after as the short-term response
- Multi-provider routing as the structural response
- Model equivalence validation for clinical use cases before substituting providers
Key Takeaways
- The AI API gateway is the enforcement point for authentication, authorization, rate limiting, PHI access control, circuit breaking, and cost attribution
- Rate limiting must be token-based, not request-based, to provide meaningful budget control
- PHI-handling teams must have log content redacted or omitted; log only metadata
- Circuit breakers protect clinical users from LLM provider degradation; automatic failover maintains availability
- The gateway is the authoritative source for AI cost attribution data across teams and use cases