GPU Infrastructure for AI Inference
Quantization Trade-offs
Quantization reduces the numerical precision of model weights and/or activations to reduce memory footprint and potentially improve throughput. Each quantization scheme makes different quality/efficiency trade-offs.
QUANTIZATION_COMPARISON = {
"FP16": {
"bits": 16,
"memory_reduction": "baseline",
"quality": "Full quality — identical to training precision",
"throughput": "Baseline throughput",
"when_to_use": "Default for production inference when VRAM is available",
"caveats": "Most VRAM requirement",
},
"INT8": {
"bits": 8,
"memory_reduction": "~50%",
"quality": "Near-identical for most tasks; measure on your use case",
"throughput": "Neutral to slight improvement on supported hardware",
"when_to_use": "When FP16 doesn't fit; preferred quantization for accuracy-sensitive use cases",
"caveats": "Calibration dataset affects quality; verify per-task performance",
},
"GPTQ": {
"bits": "4 (grouped)",
"memory_reduction": "~75%",
"quality": "Noticeable degradation on complex reasoning tasks",
"throughput": "Good throughput improvement",
"when_to_use": "Hardware-constrained environments; less accuracy-critical tasks",
"caveats": "Requires pre-quantized model; quality varies by model and task",
},
"AWQ": {
"bits": "4 (channel-wise)",
"memory_reduction": "~75%",
"quality": "Better than GPTQ for same bit width; state of the art for 4-bit",
"throughput": "Good throughput improvement",
"when_to_use": "Preferred over GPTQ when 4-bit quantization is required",
"caveats": "Requires pre-quantized model; calibration on representative data is important",
},
"GGUF": {
"bits": "2-8 (configurable per layer)",
"memory_reduction": "Variable",
"quality": "Depends on bit depth; Q5/Q6 near-lossless",
"throughput": "CPU-optimized; lower GPU throughput than native formats",
"when_to_use": "CPU inference or consumer GPU hardware; Ollama deployments",
"caveats": "Not optimized for high-throughput server inference",
},
}Architecture Diagram
Common Mistakes
1. Not accounting for KV cache in VRAM planning. Loading a 70B model in INT8 requires ~75 GB, but the KV cache for a long-context request adds significantly more VRAM. Planning only for model weights and ignoring KV cache leads to OOM errors in production.
2. Setting gpu<em>memory</em>utilization too high. A value of 0.95 leaves almost no headroom for KV cache growth. Production deployments should set this to 0.85 to prevent OOM errors on unexpectedly long sequences.
3. Enabling request logging on PHI-handling infrastructure. vLLM logs request bodies by default. On a PHI-processing server, this logs patient data to disk. Always --disable-log-requests for clinical inference.
4. Choosing quantization without evaluating on the target task. INT8 and 4-bit quantization behave differently on different tasks. A 4-bit quantized model may perform well on simple question-answering and poorly on complex clinical reasoning. Always evaluate quantized models against your specific use case.
Best Practices
- Plan VRAM requirements to include model weights, KV cache, CUDA overhead, and headroom — not model weights alone
- Set
gpu<em>memory</em>utilizationto 0.85 in vLLM for production deployments - Disable request logging on PHI-handling inference servers
- Evaluate quantized models against your specific clinical task before deploying to production
- Monitor KV cache utilization as the leading indicator of GPU capacity pressure
- Use tensor parallelism (NVLink) for models that exceed single-GPU VRAM; use pipeline parallelism only as a fallback
Trade-offs
| Deployment | VRAM | Throughput | Latency | Cost | PHI Suitability |
|---|---|---|---|---|---|
| Managed API (Anthropic, Azure) | None (provider) | High | Low-Medium | Per-token | Depends on BAA |
| Cloud GPU Instance (on-demand) | Provisioned | High | Low | Per-hour | Depends on region |
| Cloud GPU Instance (spot) | Provisioned | High | Low | ~70% discount | Same as on-demand |
| On-premises GPU | Capital expense | High | Low | CapEx + OpEx | Full control |
Interview Questions
Q: A healthcare organization wants to deploy a self-hosted 70B-parameter LLM for clinical decision support. Estimate the GPU infrastructure required and describe the key configuration decisions.
Category: System Design Difficulty: Principal Role: AI Architect
Answer Framework:
A 70B-parameter model in FP16 requires approximately 140 GB of VRAM for model weights alone (illustrative — verify with the specific model and framework). Adding KV cache for a 4K context length at batch size 16 adds roughly 30–50 GB. Total VRAM requirement: approximately 170–200 GB for a production serving configuration.
This requires either 4× GPUs with 80 GB VRAM each (with NVLink), or 3× with INT8 quantization (~75 GB for weights). I would recommend 4× 80 GB GPUs with NVLink in FP16 for clinical use because INT8 quantization requires empirical validation of quality impact on the specific clinical tasks being supported.
Configuration decisions: tensor parallel size = 4 (spreads model across all 4 GPUs), gpumemoryutilization = 0.85, enable prefix caching for shared system prompt and retrieved context, disable request logging (PHI protection), implement TLS and API key authentication.
Operational requirements: NVIDIA DCGM for GPU monitoring, alert on KV cache utilization >90%, alert on TTFT p95 >2 seconds, maintain a separate HIPAA audit log of user-level access.
Key Points to Hit:
- VRAM calculation: model weights + KV cache + overhead
- Tensor parallelism (NVLink) for multi-GPU scaling
- PHI-specific configuration: disable request logging
- Monitoring: KV cache utilization as the leading indicator
Key Takeaways
- GPU VRAM planning must include model weights, KV cache, CUDA overhead, and 15% headroom — not model weights alone
- vLLM's PagedAttention and continuous batching are the core throughput optimizations for production inference
- INT8 quantization approximately halves VRAM requirements; 4-bit (GPTQ/AWQ) reduces VRAM by ~75% with meaningful quality trade-offs — always evaluate against your specific task
- KV cache utilization is the leading indicator of GPU capacity pressure; monitor it as a production SLI
- PHI-handling inference servers require request logging to be disabled; implement a separate PHI-safe audit log