LLM Serving Infrastructure

Executive Summary

LLM serving infrastructure is the layer that determines whether a language model runs fast enough, cheaply enough, and reliably enough to support production AI systems at enterprise scale. The choice between managed inference APIs (Anthropic, OpenAI, Azure OpenAI) and self-hosted inference (vLLM, Ollama, TGI) is one of the most consequential infrastructure decisions an AI architect makes — it determines cost structure, latency profile, data residency, and operational burden for the lifetime of the deployment. This chapter covers the internals of LLM serving engines, the techniques that determine inference performance (continuous batching, KV cache, quantization), and the decision framework for choosing the right serving approach for a given enterprise context.

Learning Objectives

  • Explain how continuous batching and KV cache management determine LLM serving throughput and latency
  • Compare managed inference APIs vs. self-hosted serving across cost, latency, operational complexity, and compliance
  • Design a self-hosted LLM inference stack using vLLM with production-grade configuration
  • Quantify the performance impact of quantization techniques (INT8, GPTQ, AWQ) and apply them appropriately
  • Evaluate when self-hosted inference is economically justified vs. when managed APIs are the correct choice

Business Problem

LLM inference is computationally expensive. A single inference request for a large language model requires loading billions of parameters into GPU memory, performing matrix multiplications across those parameters for each generated token, and repeating this for every token in the output. At the throughput required for enterprise production — hundreds or thousands of concurrent requests — naive serving architectures become either slow, expensive, or both.

The infrastructure challenge is: how do you serve LLM inference at the latency and cost required for production use cases, while maintaining the reliability and compliance posture the enterprise environment requires?

Why This Technology Exists

Early LLM deployments used a simple architecture: load the model on a GPU, process one request at a time, return the result. This works for demos and low-traffic environments. At enterprise scale, three problems emerge:

Throughput bottleneck: A single request occupies the GPU for the entire generation time. Other requests queue behind it. At 50 concurrent users, queue depth makes latency unacceptable.

Memory inefficiency: The KV (key-value) cache generated during inference — the internal state of attention layers — was previously allocated per-request with worst-case memory, wasting GPU memory on unused capacity.

Cost at scale: API costs for frontier models at enterprise token volumes can reach $100K–$1M+ annually (illustrative — verify current API pricing). Self-hosted open-weight models can serve the same workload at infrastructure cost alone.

Specialized LLM serving engines (vLLM, Text Generation Inference, TensorRT-LLM) were built to address these problems through continuous batching and efficient memory management.

Conceptual Explanation

The KV Cache

During transformer inference, each attention layer computes key-value pairs for every token in the context. These KV pairs are reused for all subsequent tokens in the same generation — this is the KV cache. Without a KV cache, each new token would require recomputing all previous tokens' attention weights, making generation O(N²) in sequence length.

The KV cache is the primary memory consumer during inference. A 70B parameter model with a 32K context window requires approximately 80–120 GB of KV cache at full occupancy. Efficient KV cache management is the primary engineering challenge in LLM serving.

Continuous Batching

Traditional batching processes a fixed batch of requests together, waiting for all to complete before starting the next batch. This is wasteful: short requests finish early but must wait for long requests before the GPU can take new work.

Continuous batching (also called iteration-level scheduling) processes requests together at the token level rather than the request level. When a request finishes generating, its GPU slot is immediately released to a waiting request without waiting for other requests in the batch to finish. This dramatically increases GPU utilization and throughput.

Implementation code omitted in the Playbook edition. For complete code examples, production patterns, and advanced implementation details, see the Enterprise AI Technical Reference.

PagedAttention (vLLM)

vLLM introduced PagedAttention, which manages KV cache memory similarly to how operating system virtual memory manages RAM. Instead of allocating contiguous memory for each request's KV cache (wasteful, fragmented), KV cache is divided into fixed-size pages that can be allocated non-contiguously. This reduces memory waste from 20–40% to under 4%, dramatically increasing the number of concurrent requests a given GPU can serve.

Core Architecture

Serving Architecture Patterns

Pattern 1 — Managed API (Anthropic, OpenAI, Azure OpenAI)

text
Application → AI Gateway → Managed Inference API → Response

All infrastructure managed by vendor. Pay per token. No GPU management, no model loading, no serving engine configuration.

Pattern 2 — Self-Hosted with vLLM

text
Application → AI Gateway → vLLM Inference Server → GPU Cluster

Organization runs its own model on its own (or cloud-rented) GPUs. Pay for GPU hours plus infrastructure. Full control of model, data, and latency.

Pattern 3 — Hybrid

text
Application → AI Gateway → {
    Frontier model API (for complex reasoning tasks),
    Self-hosted model (for high-volume routine tasks)
}

Routes requests based on task complexity and cost sensitivity. Frontier model for clinical reasoning; self-hosted for high-volume document classification.

Architecture Diagram

Enterprise Considerations

Make-vs-buy decision for inference: Self-hosted inference is economically justified when monthly API costs exceed the cost of GPU infrastructure + engineering time to operate it. The break-even point depends on token volume, model size, and GPU cost (illustrative — calculate for specific deployment using current pricing). For most enterprises below $50K/month in API costs, managed APIs are more economical.

Model version management: Self-hosted models require explicit version pinning. Unlike managed APIs where the vendor manages model versions (with appropriate notice for deprecation), self-hosted teams must manage model weight storage, version tracking, and upgrade testing themselves.

GPU availability: Enterprise GPU procurement has lead times of weeks to months for on-premises and minutes to hours for cloud. Production architectures must plan for burst capacity and have API fallback for traffic spikes that exceed self-hosted capacity.

Tensor parallelism requirements: Large models (70B+ parameters) require splitting across multiple GPUs (tensor parallelism). vLLM's --tensor-parallel-size parameter controls this. A 70B FP16 model requires 4× A100 80GB GPUs; AWQ quantization reduces this to 2×.

Healthcare Example

⊕ Healthcare Example

Educational Example — Not intended for clinical use.

A Reference Healthcare Organization evaluates self-hosted inference for its discharge summary AI, driven by a concern that sending PHI to an external API requires a third-party BAA and data leaves the hospital's network. The architecture evaluation:

  • Option A: Anthropic API with BAA — frontier model quality, $X/month at projected volume (illustrative), PHI leaves network under BAA
  • Option B: Self-hosted Llama-3.1-70B on 2× A100 80GB (AWQ quantization) — lower quality than frontier models (measured in POC), $Y/month GPU cost, PHI stays on-premises

POC evaluation of Option B against discharge summary quality criteria showed the edit rate for the open-weight model was 42% vs. 24% for the frontier model. The clinical informatics team determined the quality gap was clinically significant — physicians spent more time correcting the draft than the tool saved. Decision: Option A (API with BAA) for this quality-sensitive use case; Option B reserved for high-volume, lower-quality-threshold administrative tasks (medical coding, scheduling).

Common Mistakes

1. Deploying vLLM without configuring gpu-memory-utilization. The default value may not allocate sufficient memory for the KV cache, causing requests to be queued unnecessarily. Set --gpu-memory-utilization 0.90 for maximum throughput.

2. Not load testing before production. LLM serving performance is highly non-linear — throughput does not degrade gracefully under overload. Run load tests at 1×, 2×, and 3× expected peak traffic to characterize the failure mode before it occurs in production.

3. Calculating self-hosted ROI without including engineering time. GPU infrastructure costs are visible; the engineering time to operate the infrastructure is invisible in the comparison. Include infrastructure engineering FTE cost in the self-hosted TCO calculation.

4. Using quantization without quality evaluation. Quantization quality impact varies by model and by task. Run the evaluation suite on the quantized model before deploying to production.

5. No API fallback from self-hosted to managed. If the self-hosted GPU cluster goes down, the AI application must fall back to a managed API. An AI gateway with failover routing handles this automatically.

Best Practices

  • Use managed inference APIs until monthly API costs justify self-hosted infrastructure investment
  • If self-hosting, use vLLM with PagedAttention and continuous batching for production serving
  • Apply AWQ quantization for 70B+ models to halve GPU memory requirements with minimal quality cost
  • Always sit self-hosted inference behind an AI gateway for authentication, logging, and failover
  • Load test at 2–3× expected peak before declaring production readiness
  • Maintain a managed API fallback for self-hosted infrastructure — never make AI availability dependent on a single serving layer
  • Version-pin model weights in the same way production code versions are pinned

Trade-offs

Cost vs. quality: Open-weight models served self-hosted are cheaper at scale but lower quality than frontier models. The right choice depends on the use case's quality threshold.

Control vs. operational burden: Self-hosted inference provides complete control over data, model, and latency — but requires GPU expertise, capacity planning, and 24/7 on-call coverage. Managed APIs eliminate this burden at higher per-token cost.

Latency vs. throughput: Single-request latency and batch throughput are in tension. vLLM's continuous batching optimizes for throughput; individual request time-to-first-token may increase slightly under high concurrency compared to single-request mode.

Interview Questions

Q: Explain continuous batching and why it significantly improves LLM inference throughput compared to traditional batching.

Category: Architecture Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Traditional batching processes requests as a group: all requests in a batch start together and the batch completes when the longest request finishes. Short requests finish early but hold their GPU slot while waiting for long requests — wasting capacity.

Continuous batching (iteration-level scheduling) operates at the token level. Each iteration, the scheduler can add new requests to vacant slots and remove completed requests. When a short request finishes generating, its GPU slot is immediately reallocated to a waiting request — there is no artificial synchronization boundary between requests.

The result is significantly higher GPU utilization: instead of GPU utilization dropping when fast requests finish and wait for slow requests, the GPU stays fully utilized as new requests continuously fill vacated slots. Empirically, continuous batching achieves 2–10× higher throughput than traditional fixed batching at the same latency profile.

Key Points to Hit:

  • Traditional batching synchronizes at request level (group completes together)
  • Continuous batching operates at token level (vacant slots filled immediately)
  • GPU utilization is the key metric improved
  • 2–10× throughput improvement in practice

Red Flags:

  • Confusing continuous batching with dynamic batching (batch size varies)
  • Not understanding the GPU utilization implication

Key Takeaways

  • LLM serving performance is determined by KV cache management and batching strategy, not just model size
  • PagedAttention (vLLM) eliminates KV cache memory waste, enabling 2–4× more concurrent requests per GPU
  • Continuous batching increases GPU utilization dramatically vs. traditional fixed batching
  • Managed APIs are economical for most enterprises; self-hosted is justified at high token volume and for data residency requirements
  • AWQ 4-bit quantization provides the best quality-to-memory trade-off for production self-hosted deployments
  • Always sit inference behind an AI gateway and maintain a managed API fallback
  • Self-hosted TCO must include engineering time, not just GPU cost

Further Reading