GPU Infrastructure for AI Inference
Executive Summary
GPU infrastructure decisions are among the highest-leverage architectural choices in enterprise AI deployments: the wrong GPU class, the wrong batching strategy, or the wrong quantization setting can produce 10Ć differences in throughput and cost. This chapter covers the engineering fundamentals of GPU-accelerated AI inference ā GPU selection criteria, CUDA memory management, batch inference optimization, quantization trade-offs, and the operational practices required to run GPU infrastructure reliably in production. Healthcare and enterprise contexts are used throughout, including the self-hosted inference patterns required when data residency or PHI restrictions prevent use of managed cloud AI APIs.
Learning Objectives
- Select GPU hardware for an inference workload based on model size, throughput requirements, and budget
- Understand how batch size, sequence length, and quantization interact with GPU memory capacity
- Configure vLLM for production inference with PagedAttention and continuous batching
- Implement quantization (INT8, GPTQ, AWQ) and understand the quality impact for each
- Monitor GPU utilization, KV cache fill rate, and throughput in production
Business Problem
A Reference Healthcare Organization deploying clinical decision support AI faces a specific constraint: PHI cannot be sent to external cloud AI APIs under certain regulatory interpretations, or the organization's HIPAA Business Associate Agreement does not cover the specific AI provider. This requires self-hosted inference: the organization operates its own LLM infrastructure within its on-premises or dedicated cloud environment. This creates GPU infrastructure requirements that were previously outside clinical IT's operational domain.
Even organizations using managed cloud APIs need to understand GPU infrastructure when decisions arise about self-hosted model fine-tuning, offline batch inference for large document processing, and cost optimization through model quantization.
Why This Technology Exists
LLM inference is compute-bound by the attention mechanism, which has O(n²) complexity with respect to sequence length. GPUs accelerate this computation through massive parallelism: a modern AI accelerator can execute tens of thousands of parallel floating-point operations per clock cycle, whereas a CPU executes a handful. For a 70B-parameter LLM, the matrix multiplications in a single forward pass would take minutes on a CPU; a high-end GPU completes them in under a second.
The CUDA programming model, vLLM's PagedAttention, and continuous batching are specific optimizations that extract maximum throughput from GPU hardware for the inference workload. Without these, GPU utilization is low and inference throughput is limited by hardware that is dramatically under-utilized.
Quantization Trade-offs
Quantization reduces the numerical precision of model weights and/or activations to reduce memory footprint and potentially improve throughput. Each quantization scheme makes different quality/efficiency trade-offs.
Implementation code omitted in the Playbook edition. For complete code examples, production patterns, and advanced implementation details, see the Enterprise AI Technical Reference.
Architecture Diagram
Common Mistakes
1. Not accounting for KV cache in VRAM planning. Loading a 70B model in INT8 requires ~75 GB, but the KV cache for a long-context request adds significantly more VRAM. Planning only for model weights and ignoring KV cache leads to OOM errors in production.
2. Setting gpu<em>memory</em>utilization too high. A value of 0.95 leaves almost no headroom for KV cache growth. Production deployments should set this to 0.85 to prevent OOM errors on unexpectedly long sequences.
3. Enabling request logging on PHI-handling infrastructure. vLLM logs request bodies by default. On a PHI-processing server, this logs patient data to disk. Always --disable-log-requests for clinical inference.
4. Choosing quantization without evaluating on the target task. INT8 and 4-bit quantization behave differently on different tasks. A 4-bit quantized model may perform well on simple question-answering and poorly on complex clinical reasoning. Always evaluate quantized models against your specific use case.
Best Practices
- Plan VRAM requirements to include model weights, KV cache, CUDA overhead, and headroom ā not model weights alone
- Set
gpu<em>memory</em>utilizationto 0.85 in vLLM for production deployments - Disable request logging on PHI-handling inference servers
- Evaluate quantized models against your specific clinical task before deploying to production
- Monitor KV cache utilization as the leading indicator of GPU capacity pressure
- Use tensor parallelism (NVLink) for models that exceed single-GPU VRAM; use pipeline parallelism only as a fallback
Trade-offs
| Deployment | VRAM | Throughput | Latency | Cost | PHI Suitability |
|---|---|---|---|---|---|
| Managed API (Anthropic, Azure) | None (provider) | High | Low-Medium | Per-token | Depends on BAA |
| Cloud GPU Instance (on-demand) | Provisioned | High | Low | Per-hour | Depends on region |
| Cloud GPU Instance (spot) | Provisioned | High | Low | ~70% discount | Same as on-demand |
| On-premises GPU | Capital expense | High | Low | CapEx + OpEx | Full control |
Interview Questions
Q: A healthcare organization wants to deploy a self-hosted 70B-parameter LLM for clinical decision support. Estimate the GPU infrastructure required and describe the key configuration decisions.
Category: System Design Difficulty: Principal Role: AI Architect
Answer Framework:
A 70B-parameter model in FP16 requires approximately 140 GB of VRAM for model weights alone (illustrative ā verify with the specific model and framework). Adding KV cache for a 4K context length at batch size 16 adds roughly 30ā50 GB. Total VRAM requirement: approximately 170ā200 GB for a production serving configuration.
This requires either 4Ć GPUs with 80 GB VRAM each (with NVLink), or 3Ć with INT8 quantization (~75 GB for weights). I would recommend 4Ć 80 GB GPUs with NVLink in FP16 for clinical use because INT8 quantization requires empirical validation of quality impact on the specific clinical tasks being supported.
Configuration decisions: tensor parallel size = 4 (spreads model across all 4 GPUs), gpumemoryutilization = 0.85, enable prefix caching for shared system prompt and retrieved context, disable request logging (PHI protection), implement TLS and API key authentication.
Operational requirements: NVIDIA DCGM for GPU monitoring, alert on KV cache utilization >90%, alert on TTFT p95 >2 seconds, maintain a separate HIPAA audit log of user-level access.
Key Points to Hit:
- VRAM calculation: model weights + KV cache + overhead
- Tensor parallelism (NVLink) for multi-GPU scaling
- PHI-specific configuration: disable request logging
- Monitoring: KV cache utilization as the leading indicator
Key Takeaways
- GPU VRAM planning must include model weights, KV cache, CUDA overhead, and 15% headroom ā not model weights alone
- vLLM's PagedAttention and continuous batching are the core throughput optimizations for production inference
- INT8 quantization approximately halves VRAM requirements; 4-bit (GPTQ/AWQ) reduces VRAM by ~75% with meaningful quality trade-offs ā always evaluate against your specific task
- KV cache utilization is the leading indicator of GPU capacity pressure; monitor it as a production SLI
- PHI-handling inference servers require request logging to be disabled; implement a separate PHI-safe audit log
Further Reading
- LLM Serving Infrastructure ā vLLM configuration and serving patterns
- Cloud AI Platforms ā Managed API alternatives to self-hosted inference
- Healthcare AI Landscape ā PHI and data residency requirements