GPU Infrastructure for AI Inference

Quantization Trade-offs

Quantization reduces the numerical precision of model weights and/or activations to reduce memory footprint and potentially improve throughput. Each quantization scheme makes different quality/efficiency trade-offs.

python

QUANTIZATION_COMPARISON = {
    "FP16": {
        "bits": 16,
        "memory_reduction": "baseline",
        "quality": "Full quality — identical to training precision",
        "throughput": "Baseline throughput",
        "when_to_use": "Default for production inference when VRAM is available",
        "caveats": "Most VRAM requirement",
    },
    "INT8": {
        "bits": 8,
        "memory_reduction": "~50%",
        "quality": "Near-identical for most tasks; measure on your use case",
        "throughput": "Neutral to slight improvement on supported hardware",
        "when_to_use": "When FP16 doesn't fit; preferred quantization for accuracy-sensitive use cases",
        "caveats": "Calibration dataset affects quality; verify per-task performance",
    },
    "GPTQ": {
        "bits": "4 (grouped)",
        "memory_reduction": "~75%",
        "quality": "Noticeable degradation on complex reasoning tasks",
        "throughput": "Good throughput improvement",
        "when_to_use": "Hardware-constrained environments; less accuracy-critical tasks",
        "caveats": "Requires pre-quantized model; quality varies by model and task",
    },
    "AWQ": {
        "bits": "4 (channel-wise)",
        "memory_reduction": "~75%",
        "quality": "Better than GPTQ for same bit width; state of the art for 4-bit",
        "throughput": "Good throughput improvement",
        "when_to_use": "Preferred over GPTQ when 4-bit quantization is required",
        "caveats": "Requires pre-quantized model; calibration on representative data is important",
    },
    "GGUF": {
        "bits": "2-8 (configurable per layer)",
        "memory_reduction": "Variable",
        "quality": "Depends on bit depth; Q5/Q6 near-lossless",
        "throughput": "CPU-optimized; lower GPU throughput than native formats",
        "when_to_use": "CPU inference or consumer GPU hardware; Ollama deployments",
        "caveats": "Not optimized for high-throughput server inference",
    },
}

Architecture Diagram

graph TD subgraph "Client Layer" CLI["Clinical Workstation"] API_GW["AI Gateway\n(rate limit, auth, routing)"] end subgraph "Inference Tier — High Priority" VLLM1["vLLM Server — Node 1\n2× A100 80GB\nContinuous batching"] VLLM2["vLLM Server — Node 2\n2× A100 80GB\nPrefix caching enabled"] end subgraph "Inference Tier — Batch" BATCH["Batch Inference Worker\nOff-peak hours\nSpot instances"] end subgraph "GPU Infrastructure" NODE1["GPU Node 1\n4× A100 SXM 80GB\n320GB VRAM total\nNVLink interconnect"] NODE2["GPU Node 2\n4× A100 SXM 80GB"] DCGM["DCGM Exporter\nGPU metrics"] end subgraph "Observability" PROM["Prometheus\nMetrics collection"] GRAFANA["Grafana\nGPU dashboards"] ALERT["AlertManager\nOOM / throughput alerts"] end CLI --> API_GW API_GW --> VLLM1 & VLLM2 API_GW --> BATCH VLLM1 & VLLM2 --> NODE1 & NODE2 BATCH --> NODE1 NODE1 & NODE2 --> DCGM --> PROM --> GRAFANA PROM --> ALERT

Common Mistakes

1. Not accounting for KV cache in VRAM planning. Loading a 70B model in INT8 requires ~75 GB, but the KV cache for a long-context request adds significantly more VRAM. Planning only for model weights and ignoring KV cache leads to OOM errors in production.

2. Setting gpu<em>memory</em>utilization too high. A value of 0.95 leaves almost no headroom for KV cache growth. Production deployments should set this to 0.85 to prevent OOM errors on unexpectedly long sequences.

3. Enabling request logging on PHI-handling infrastructure. vLLM logs request bodies by default. On a PHI-processing server, this logs patient data to disk. Always --disable-log-requests for clinical inference.

4. Choosing quantization without evaluating on the target task. INT8 and 4-bit quantization behave differently on different tasks. A 4-bit quantized model may perform well on simple question-answering and poorly on complex clinical reasoning. Always evaluate quantized models against your specific use case.

Best Practices

Plan VRAM requirements to include model weights, KV cache, CUDA overhead, and headroom — not model weights alone
Set gpu<em>memory</em>utilization to 0.85 in vLLM for production deployments
Disable request logging on PHI-handling inference servers
Evaluate quantized models against your specific clinical task before deploying to production
Monitor KV cache utilization as the leading indicator of GPU capacity pressure
Use tensor parallelism (NVLink) for models that exceed single-GPU VRAM; use pipeline parallelism only as a fallback

Trade-offs

Deployment	VRAM	Throughput	Latency	Cost	PHI Suitability
Managed API (Anthropic, Azure)	None (provider)	High	Low-Medium	Per-token	Depends on BAA
Cloud GPU Instance (on-demand)	Provisioned	High	Low	Per-hour	Depends on region
Cloud GPU Instance (spot)	Provisioned	High	Low	~70% discount	Same as on-demand
On-premises GPU	Capital expense	High	Low	CapEx + OpEx	Full control

Interview Questions

Q: A healthcare organization wants to deploy a self-hosted 70B-parameter LLM for clinical decision support. Estimate the GPU infrastructure required and describe the key configuration decisions.

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

A 70B-parameter model in FP16 requires approximately 140 GB of VRAM for model weights alone (illustrative — verify with the specific model and framework). Adding KV cache for a 4K context length at batch size 16 adds roughly 30–50 GB. Total VRAM requirement: approximately 170–200 GB for a production serving configuration.

This requires either 4× GPUs with 80 GB VRAM each (with NVLink), or 3× with INT8 quantization (~75 GB for weights). I would recommend 4× 80 GB GPUs with NVLink in FP16 for clinical use because INT8 quantization requires empirical validation of quality impact on the specific clinical tasks being supported.

Configuration decisions: tensor parallel size = 4 (spreads model across all 4 GPUs), gpumemoryutilization = 0.85, enable prefix caching for shared system prompt and retrieved context, disable request logging (PHI protection), implement TLS and API key authentication.

Operational requirements: NVIDIA DCGM for GPU monitoring, alert on KV cache utilization >90%, alert on TTFT p95 >2 seconds, maintain a separate HIPAA audit log of user-level access.

Key Points to Hit:

VRAM calculation: model weights + KV cache + overhead
Tensor parallelism (NVLink) for multi-GPU scaling
PHI-specific configuration: disable request logging
Monitoring: KV cache utilization as the leading indicator

Key Takeaways

GPU VRAM planning must include model weights, KV cache, CUDA overhead, and 15% headroom — not model weights alone
vLLM's PagedAttention and continuous batching are the core throughput optimizations for production inference
INT8 quantization approximately halves VRAM requirements; 4-bit (GPTQ/AWQ) reduces VRAM by ~75% with meaningful quality trade-offs — always evaluate against your specific task
KV cache utilization is the leading indicator of GPU capacity pressure; monitor it as a production SLI
PHI-handling inference servers require request logging to be disabled; implement a separate PHI-safe audit log

GPU Infrastructure for AI Inference#

Quantization Trade-offs#

Architecture Diagram#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: A healthcare organization wants to deploy a self-hosted 70B-parameter LLM for clinical decision support. Estimate the GPU infrastructure required and describe the key configuration decisions.#

Key Takeaways#