LLM Serving Infrastructure

Conceptual Explanation

The KV Cache

During transformer inference, each attention layer computes key-value pairs for every token in the context. These KV pairs are reused for all subsequent tokens in the same generation — this is the KV cache. Without a KV cache, each new token would require recomputing all previous tokens' attention weights, making generation O(N²) in sequence length.

The KV cache is the primary memory consumer during inference. A 70B parameter model with a 32K context window requires approximately 80–120 GB of KV cache at full occupancy. Efficient KV cache management is the primary engineering challenge in LLM serving.

Continuous Batching

Traditional batching processes a fixed batch of requests together, waiting for all to complete before starting the next batch. This is wasteful: short requests finish early but must wait for long requests before the GPU can take new work.

Continuous batching (also called iteration-level scheduling) processes requests together at the token level rather than the request level. When a request finishes generating, its GPU slot is immediately released to a waiting request without waiting for other requests in the batch to finish. This dramatically increases GPU utilization and throughput.

text

Traditional batching:
Request A (10 tokens) ████░░░░░░ wait for B
Request B (50 tokens) ████████████████████████████████████████████████████
Request C (30 tokens)                                            ████████████████████████████████

Continuous batching:
Request A (10 tokens) ████
Request B (50 tokens) ████████████████████████████████████████████████████
Request C (30 tokens)     ████████████████████████████████ (starts when A finishes)

PagedAttention (vLLM)

vLLM introduced PagedAttention, which manages KV cache memory similarly to how operating system virtual memory manages RAM. Instead of allocating contiguous memory for each request's KV cache (wasteful, fragmented), KV cache is divided into fixed-size pages that can be allocated non-contiguously. This reduces memory waste from 20–40% to under 4%, dramatically increasing the number of concurrent requests a given GPU can serve.

Core Architecture

Serving Architecture Patterns

Pattern 1 — Managed API (Anthropic, OpenAI, Azure OpenAI)

text

Application → AI Gateway → Managed Inference API → Response

All infrastructure managed by vendor. Pay per token. No GPU management, no model loading, no serving engine configuration.

Pattern 2 — Self-Hosted with vLLM

text

Application → AI Gateway → vLLM Inference Server → GPU Cluster

Organization runs its own model on its own (or cloud-rented) GPUs. Pay for GPU hours plus infrastructure. Full control of model, data, and latency.

Pattern 3 — Hybrid

text

Application → AI Gateway → {
    Frontier model API (for complex reasoning tasks),
    Self-hosted model (for high-volume routine tasks)
}

Routes requests based on task complexity and cost sensitivity. Frontier model for clinical reasoning; self-hosted for high-volume document classification.

Architecture Diagram

graph TD subgraph "Client Layer" APP["AI Application"] GW["AI Gateway\n(LiteLLM / custom)"] end subgraph "Managed Inference" ANT["Anthropic API\nclaude-opus-4-8 / sonnet / haiku"] OAI["Azure OpenAI\nGPT-4o / o1"] end subgraph "Self-Hosted Inference (vLLM)" VLLM["vLLM Server\nOpenAI-compatible API"] PG["PagedAttention\nKV Cache Manager"] CB["Continuous Batching\nScheduler"] GPU["GPU Cluster\nA100 / H100 / L40S"] end subgraph "Model Registry" WEIGHTS["Model Weights\nHugging Face / local"] QUANT["Quantized Models\nGPTQ / AWQ / INT8"] end APP --> GW GW -->|"Complex / PHI-sensitive"| ANT GW -->|"Azure-only clients"| OAI GW -->|"High volume / cost-sensitive"| VLLM VLLM --> PG & CB --> GPU WEIGHTS & QUANT --> VLLM

Common Mistakes

1. Deploying vLLM without configuring gpu-memory-utilization. The default value may not allocate sufficient memory for the KV cache, causing requests to be queued unnecessarily. Set --gpu-memory-utilization 0.90 for maximum throughput.

2. Not load testing before production. LLM serving performance is highly non-linear — throughput does not degrade gracefully under overload. Run load tests at 1×, 2×, and 3× expected peak traffic to characterize the failure mode before it occurs in production.

3. Calculating self-hosted ROI without including engineering time. GPU infrastructure costs are visible; the engineering time to operate the infrastructure is invisible in the comparison. Include infrastructure engineering FTE cost in the self-hosted TCO calculation.

4. Using quantization without quality evaluation. Quantization quality impact varies by model and by task. Run the evaluation suite on the quantized model before deploying to production.

5. No API fallback from self-hosted to managed. If the self-hosted GPU cluster goes down, the AI application must fall back to a managed API. An AI gateway with failover routing handles this automatically.

Best Practices

Use managed inference APIs until monthly API costs justify self-hosted infrastructure investment
If self-hosting, use vLLM with PagedAttention and continuous batching for production serving
Apply AWQ quantization for 70B+ models to halve GPU memory requirements with minimal quality cost
Always sit self-hosted inference behind an AI gateway for authentication, logging, and failover
Load test at 2–3× expected peak before declaring production readiness
Maintain a managed API fallback for self-hosted infrastructure — never make AI availability dependent on a single serving layer
Version-pin model weights in the same way production code versions are pinned

Trade-offs

Cost vs. quality: Open-weight models served self-hosted are cheaper at scale but lower quality than frontier models. The right choice depends on the use case's quality threshold.

Control vs. operational burden: Self-hosted inference provides complete control over data, model, and latency — but requires GPU expertise, capacity planning, and 24/7 on-call coverage. Managed APIs eliminate this burden at higher per-token cost.

Latency vs. throughput: Single-request latency and batch throughput are in tension. vLLM's continuous batching optimizes for throughput; individual request time-to-first-token may increase slightly under high concurrency compared to single-request mode.

Interview Questions

Q: Explain continuous batching and why it significantly improves LLM inference throughput compared to traditional batching.

Category: Architecture Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Traditional batching processes requests as a group: all requests in a batch start together and the batch completes when the longest request finishes. Short requests finish early but hold their GPU slot while waiting for long requests — wasting capacity.

Continuous batching (iteration-level scheduling) operates at the token level. Each iteration, the scheduler can add new requests to vacant slots and remove completed requests. When a short request finishes generating, its GPU slot is immediately reallocated to a waiting request — there is no artificial synchronization boundary between requests.

The result is significantly higher GPU utilization: instead of GPU utilization dropping when fast requests finish and wait for slow requests, the GPU stays fully utilized as new requests continuously fill vacated slots. Empirically, continuous batching achieves 2–10× higher throughput than traditional fixed batching at the same latency profile.

Key Points to Hit:

Traditional batching synchronizes at request level (group completes together)
Continuous batching operates at token level (vacant slots filled immediately)
GPU utilization is the key metric improved
2–10× throughput improvement in practice

Red Flags:

Confusing continuous batching with dynamic batching (batch size varies)
Not understanding the GPU utilization implication

Key Takeaways

LLM serving performance is determined by KV cache management and batching strategy, not just model size
PagedAttention (vLLM) eliminates KV cache memory waste, enabling 2–4× more concurrent requests per GPU
Continuous batching increases GPU utilization dramatically vs. traditional fixed batching
Managed APIs are economical for most enterprises; self-hosted is justified at high token volume and for data residency requirements
AWQ 4-bit quantization provides the best quality-to-memory trade-off for production self-hosted deployments
Always sit inference behind an AI gateway and maintain a managed API fallback
Self-hosted TCO must include engineering time, not just GPU cost

LLM Serving Infrastructure#

Conceptual Explanation#

The KV Cache#

Continuous Batching#

PagedAttention (vLLM)#

Core Architecture#

Serving Architecture Patterns#

Architecture Diagram#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: Explain continuous batching and why it significantly improves LLM inference throughput compared to traditional batching.#

Key Takeaways#

LLM Serving Infrastructure

Conceptual Explanation

The KV Cache

Continuous Batching

PagedAttention (vLLM)

Core Architecture

Serving Architecture Patterns

Architecture Diagram

Common Mistakes

Best Practices

Trade-offs

Interview Questions

Q: Explain continuous batching and why it significantly improves LLM inference throughput compared to traditional batching.

Key Takeaways