GPU Infrastructure for AI Inference

Executive Summary

GPU infrastructure decisions are among the highest-leverage architectural choices in enterprise AI deployments: the wrong GPU class, the wrong batching strategy, or the wrong quantization setting can produce 10× differences in throughput and cost. This chapter covers the engineering fundamentals of GPU-accelerated AI inference — GPU selection criteria, CUDA memory management, batch inference optimization, quantization trade-offs, and the operational practices required to run GPU infrastructure reliably in production. Healthcare and enterprise contexts are used throughout, including the self-hosted inference patterns required when data residency or PHI restrictions prevent use of managed cloud AI APIs.

Learning Objectives

Select GPU hardware for an inference workload based on model size, throughput requirements, and budget
Understand how batch size, sequence length, and quantization interact with GPU memory capacity
Configure vLLM for production inference with PagedAttention and continuous batching
Implement quantization (INT8, GPTQ, AWQ) and understand the quality impact for each
Monitor GPU utilization, KV cache fill rate, and throughput in production

Business Problem

A Reference Healthcare Organization deploying clinical decision support AI faces a specific constraint: PHI cannot be sent to external cloud AI APIs under certain regulatory interpretations, or the organization's HIPAA Business Associate Agreement does not cover the specific AI provider. This requires self-hosted inference: the organization operates its own LLM infrastructure within its on-premises or dedicated cloud environment. This creates GPU infrastructure requirements that were previously outside clinical IT's operational domain.

Even organizations using managed cloud APIs need to understand GPU infrastructure when decisions arise about self-hosted model fine-tuning, offline batch inference for large document processing, and cost optimization through model quantization.

Why This Technology Exists

LLM inference is compute-bound by the attention mechanism, which has O(n²) complexity with respect to sequence length. GPUs accelerate this computation through massive parallelism: a modern AI accelerator can execute tens of thousands of parallel floating-point operations per clock cycle, whereas a CPU executes a handful. For a 70B-parameter LLM, the matrix multiplications in a single forward pass would take minutes on a CPU; a high-end GPU completes them in under a second.

The CUDA programming model, vLLM's PagedAttention, and continuous batching are specific optimizations that extract maximum throughput from GPU hardware for the inference workload. Without these, GPU utilization is low and inference throughput is limited by hardware that is dramatically under-utilized.

Quantization Trade-offs

Quantization reduces the numerical precision of model weights and/or activations to reduce memory footprint and potentially improve throughput. Each quantization scheme makes different quality/efficiency trade-offs.

Implementation code omitted in the Playbook edition. For complete code examples, production patterns, and advanced implementation details, see the Enterprise AI Technical Reference.

Architecture Diagram

graph TD subgraph "Client Layer" CLI["Clinical Workstation"] API_GW["AI Gateway\n(rate limit, auth, routing)"] end subgraph "Inference Tier — High Priority" VLLM1["vLLM Server — Node 1\n2× A100 80GB\nContinuous batching"] VLLM2["vLLM Server — Node 2\n2× A100 80GB\nPrefix caching enabled"] end subgraph "Inference Tier — Batch" BATCH["Batch Inference Worker\nOff-peak hours\nSpot instances"] end subgraph "GPU Infrastructure" NODE1["GPU Node 1\n4× A100 SXM 80GB\n320GB VRAM total\nNVLink interconnect"] NODE2["GPU Node 2\n4× A100 SXM 80GB"] DCGM["DCGM Exporter\nGPU metrics"] end subgraph "Observability" PROM["Prometheus\nMetrics collection"] GRAFANA["Grafana\nGPU dashboards"] ALERT["AlertManager\nOOM / throughput alerts"] end CLI --> API_GW API_GW --> VLLM1 & VLLM2 API_GW --> BATCH VLLM1 & VLLM2 --> NODE1 & NODE2 BATCH --> NODE1 NODE1 & NODE2 --> DCGM --> PROM --> GRAFANA PROM --> ALERT

Common Mistakes

1. Not accounting for KV cache in VRAM planning. Loading a 70B model in INT8 requires ~75 GB, but the KV cache for a long-context request adds significantly more VRAM. Planning only for model weights and ignoring KV cache leads to OOM errors in production.

2. Setting gpu<em>memory</em>utilization too high. A value of 0.95 leaves almost no headroom for KV cache growth. Production deployments should set this to 0.85 to prevent OOM errors on unexpectedly long sequences.

3. Enabling request logging on PHI-handling infrastructure. vLLM logs request bodies by default. On a PHI-processing server, this logs patient data to disk. Always --disable-log-requests for clinical inference.

4. Choosing quantization without evaluating on the target task. INT8 and 4-bit quantization behave differently on different tasks. A 4-bit quantized model may perform well on simple question-answering and poorly on complex clinical reasoning. Always evaluate quantized models against your specific use case.

Best Practices

Plan VRAM requirements to include model weights, KV cache, CUDA overhead, and headroom — not model weights alone
Set gpu<em>memory</em>utilization to 0.85 in vLLM for production deployments
Disable request logging on PHI-handling inference servers
Evaluate quantized models against your specific clinical task before deploying to production
Monitor KV cache utilization as the leading indicator of GPU capacity pressure
Use tensor parallelism (NVLink) for models that exceed single-GPU VRAM; use pipeline parallelism only as a fallback

Trade-offs

Deployment	VRAM	Throughput	Latency	Cost	PHI Suitability
Managed API (Anthropic, Azure)	None (provider)	High	Low-Medium	Per-token	Depends on BAA
Cloud GPU Instance (on-demand)	Provisioned	High	Low	Per-hour	Depends on region
Cloud GPU Instance (spot)	Provisioned	High	Low	~70% discount	Same as on-demand
On-premises GPU	Capital expense	High	Low	CapEx + OpEx	Full control

Interview Questions

Q: A healthcare organization wants to deploy a self-hosted 70B-parameter LLM for clinical decision support. Estimate the GPU infrastructure required and describe the key configuration decisions.

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

A 70B-parameter model in FP16 requires approximately 140 GB of VRAM for model weights alone (illustrative — verify with the specific model and framework). Adding KV cache for a 4K context length at batch size 16 adds roughly 30–50 GB. Total VRAM requirement: approximately 170–200 GB for a production serving configuration.

This requires either 4× GPUs with 80 GB VRAM each (with NVLink), or 3× with INT8 quantization (~75 GB for weights). I would recommend 4× 80 GB GPUs with NVLink in FP16 for clinical use because INT8 quantization requires empirical validation of quality impact on the specific clinical tasks being supported.

Configuration decisions: tensor parallel size = 4 (spreads model across all 4 GPUs), gpumemoryutilization = 0.85, enable prefix caching for shared system prompt and retrieved context, disable request logging (PHI protection), implement TLS and API key authentication.

Operational requirements: NVIDIA DCGM for GPU monitoring, alert on KV cache utilization >90%, alert on TTFT p95 >2 seconds, maintain a separate HIPAA audit log of user-level access.

Key Points to Hit:

VRAM calculation: model weights + KV cache + overhead
Tensor parallelism (NVLink) for multi-GPU scaling
PHI-specific configuration: disable request logging
Monitoring: KV cache utilization as the leading indicator

Key Takeaways

GPU VRAM planning must include model weights, KV cache, CUDA overhead, and 15% headroom — not model weights alone
vLLM's PagedAttention and continuous batching are the core throughput optimizations for production inference
INT8 quantization approximately halves VRAM requirements; 4-bit (GPTQ/AWQ) reduces VRAM by ~75% with meaningful quality trade-offs — always evaluate against your specific task
KV cache utilization is the leading indicator of GPU capacity pressure; monitor it as a production SLI
PHI-handling inference servers require request logging to be disabled; implement a separate PHI-safe audit log

GPU Infrastructure for AI Inference#

Executive Summary#

Learning Objectives#

Business Problem#

Why This Technology Exists#

Quantization Trade-offs#

Architecture Diagram#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: A healthcare organization wants to deploy a self-hosted 70B-parameter LLM for clinical decision support. Estimate the GPU infrastructure required and describe the key configuration decisions.#

Key Takeaways#

Further Reading#