GPU Infrastructure for AI Inference

Executive Summary

GPU infrastructure decisions are among the highest-leverage architectural choices in enterprise AI deployments: the wrong GPU class, the wrong batching strategy, or the wrong quantization setting can produce 10Ɨ differences in throughput and cost. This chapter covers the engineering fundamentals of GPU-accelerated AI inference — GPU selection criteria, CUDA memory management, batch inference optimization, quantization trade-offs, and the operational practices required to run GPU infrastructure reliably in production. Healthcare and enterprise contexts are used throughout, including the self-hosted inference patterns required when data residency or PHI restrictions prevent use of managed cloud AI APIs.

Learning Objectives

  • Select GPU hardware for an inference workload based on model size, throughput requirements, and budget
  • Understand how batch size, sequence length, and quantization interact with GPU memory capacity
  • Configure vLLM for production inference with PagedAttention and continuous batching
  • Implement quantization (INT8, GPTQ, AWQ) and understand the quality impact for each
  • Monitor GPU utilization, KV cache fill rate, and throughput in production

Business Problem

A Reference Healthcare Organization deploying clinical decision support AI faces a specific constraint: PHI cannot be sent to external cloud AI APIs under certain regulatory interpretations, or the organization's HIPAA Business Associate Agreement does not cover the specific AI provider. This requires self-hosted inference: the organization operates its own LLM infrastructure within its on-premises or dedicated cloud environment. This creates GPU infrastructure requirements that were previously outside clinical IT's operational domain.

Even organizations using managed cloud APIs need to understand GPU infrastructure when decisions arise about self-hosted model fine-tuning, offline batch inference for large document processing, and cost optimization through model quantization.

Why This Technology Exists

LLM inference is compute-bound by the attention mechanism, which has O(n²) complexity with respect to sequence length. GPUs accelerate this computation through massive parallelism: a modern AI accelerator can execute tens of thousands of parallel floating-point operations per clock cycle, whereas a CPU executes a handful. For a 70B-parameter LLM, the matrix multiplications in a single forward pass would take minutes on a CPU; a high-end GPU completes them in under a second.

The CUDA programming model, vLLM's PagedAttention, and continuous batching are specific optimizations that extract maximum throughput from GPU hardware for the inference workload. Without these, GPU utilization is low and inference throughput is limited by hardware that is dramatically under-utilized.

GPU Selection Framework

python
from dataclasses import dataclass
from typing import Optional

@dataclass
class ModelRequirements:
    parameter_count_billions: float   # e.g., 70 for a 70B model
    context_length_tokens: int        # Maximum sequence length to support
    target_throughput_tokens_per_sec: int   # Production throughput target
    latency_p99_ms: int               # 99th percentile latency target
    precision: str                    # "fp16" | "int8" | "int4"


@dataclass
class GPUSpec:
    name: str
    vram_gb: float
    theoretical_tflops_fp16: float    # illustrative — verify current vendor specs
    nvlink: bool                      # Multi-GPU memory interconnect
    cloud_instance_type: str          # e.g., p4d.24xlarge (AWS)


# Approximate VRAM requirements by model and precision
# These are illustrative — verify against the specific model and framework version.
def estimate_vram_requirement_gb(
    param_count_billions: float,
    precision: str,
    context_length_tokens: int,
    batch_size: int = 1,
) -> dict:
    """
    Estimate GPU VRAM requirements for LLM inference.
    
    Illustrative estimates — verify against specific model and serving framework.
    """
    # Bytes per parameter by precision
    bytes_per_param = {"fp16": 2, "int8": 1, "int4": 0.5}.get(precision, 2)
    
    param_memory_gb = (param_count_billions * 1e9 * bytes_per_param) / (1024 ** 3)
    
    # KV cache estimate (rough): depends on num_layers, num_heads, head_dim, context_length
    # For a 70B model: ~80 layers, 64 heads, 128 head_dim
    # KV size per token per layer ā‰ˆ 2 * num_heads * head_dim * bytes_per_elem
    # This is a rough approximation; use the serving framework's VRAM calculator for accuracy
    kv_cache_gb_per_1k_tokens = param_count_billions * 0.05  # very rough estimate
    kv_cache_gb = (context_length_tokens / 1000) * kv_cache_gb_per_1k_tokens * batch_size
    
    # CUDA overhead and framework buffers
    overhead_gb = 2.0
    
    total_gb = param_memory_gb + kv_cache_gb + overhead_gb
    
    return {
        "param_memory_gb": round(param_memory_gb, 1),
        "kv_cache_gb": round(kv_cache_gb, 1),
        "overhead_gb": overhead_gb,
        "total_estimated_gb": round(total_gb, 1),
        "note": "Illustrative estimate — verify with serving framework VRAM calculator"
    }


# Example estimates (illustrative — verify with official resources)
ILLUSTRATIVE_GPU_SELECTION_GUIDE = {
    "7B_fp16": {
        "description": "7B parameter model, FP16 precision",
        "min_vram_gb": 16,
        "recommended_gpu_class": "Single mid-range AI GPU (e.g., A10G class)",
        "quantization_option": "Run at FP16; INT8 for cost reduction"
    },
    "13B_fp16": {
        "description": "13B parameter model, FP16 precision",
        "min_vram_gb": 30,
        "recommended_gpu_class": "Single high-end GPU or dual mid-range",
        "quantization_option": "INT8 quantization to fit on single mid-range GPU"
    },
    "70B_fp16": {
        "description": "70B parameter model, FP16 precision",
        "min_vram_gb": 145,
        "recommended_gpu_class": "4Ɨ high-VRAM GPUs with NVLink",
        "quantization_option": "INT8 (2Ɨ fewer GPUs), INT4 (1-2 GPUs, quality trade-off)"
    },
    "70B_int8": {
        "description": "70B parameter model, INT8 quantization",
        "min_vram_gb": 75,
        "recommended_gpu_class": "2Ɨ high-VRAM GPUs with NVLink",
        "quantization_option": "INT4 to further reduce to single GPU (with quality loss)"
    },
}

vLLM Configuration for Production

python
# Production vLLM launch for clinical inference
# Educational example — not for clinical use

VLLM_PRODUCTION_CONFIG = """
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \\    # verify model ID
    --host 0.0.0.0 \\
    --port 8000 \\
    --dtype float16 \\                      # fp16 inference
    --max-model-len 8192 \\                 # Maximum context length
    --max-num-seqs 256 \\                   # Max concurrent sequences
    --gpu-memory-utilization 0.85 \\        # Reserve 15% for KV cache headroom
    --tensor-parallel-size 2 \\             # Spread model across 2 GPUs (for 13B+)
    --enable-prefix-caching \\              # KV cache reuse for shared prefixes
    --served-model-name clinical-llm \\     # API alias (hide model name from clients)
    --api-key ${VLLM_API_KEY} \\           # Require authentication
    --disable-log-requests \\              # PHI: disable request logging
    --uvicorn-log-level warning
"""

# Python client for vLLM-served model
from openai import OpenAI
import os

def create_vllm_client(base_url: str, api_key: str) -> OpenAI:
    """Create an OpenAI-compatible client for the self-hosted vLLM server."""
    return OpenAI(
        base_url=f"{base_url}/v1",
        api_key=api_key,
    )


def clinical_inference(
    client: OpenAI,
    system_prompt: str,
    user_message: str,
    model_alias: str = "clinical-llm",
    max_tokens: int = 1024,
    temperature: float = 0.1,    # Low temperature for clinical accuracy
) -> dict:
    """
    Run inference against self-hosted vLLM server.
    
    Educational Example — Not intended for clinical decision making.
    """
    response = client.chat.completions.create(
        model=model_alias,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
        max_tokens=max_tokens,
        temperature=temperature,
    )
    
    return {
        "content": response.choices[0].message.content,
        "finish_reason": response.choices[0].finish_reason,
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "model": response.model,
    }

Quantization Trade-offs

Quantization reduces the numerical precision of model weights and/or activations to reduce memory footprint and potentially improve throughput. Each quantization scheme makes different quality/efficiency trade-offs.

python
QUANTIZATION_COMPARISON = {
    "FP16": {
        "bits": 16,
        "memory_reduction": "baseline",
        "quality": "Full quality — identical to training precision",
        "throughput": "Baseline throughput",
        "when_to_use": "Default for production inference when VRAM is available",
        "caveats": "Most VRAM requirement",
    },
    "INT8": {
        "bits": 8,
        "memory_reduction": "~50%",
        "quality": "Near-identical for most tasks; measure on your use case",
        "throughput": "Neutral to slight improvement on supported hardware",
        "when_to_use": "When FP16 doesn't fit; preferred quantization for accuracy-sensitive use cases",
        "caveats": "Calibration dataset affects quality; verify per-task performance",
    },
    "GPTQ": {
        "bits": "4 (grouped)",
        "memory_reduction": "~75%",
        "quality": "Noticeable degradation on complex reasoning tasks",
        "throughput": "Good throughput improvement",
        "when_to_use": "Hardware-constrained environments; less accuracy-critical tasks",
        "caveats": "Requires pre-quantized model; quality varies by model and task",
    },
    "AWQ": {
        "bits": "4 (channel-wise)",
        "memory_reduction": "~75%",
        "quality": "Better than GPTQ for same bit width; state of the art for 4-bit",
        "throughput": "Good throughput improvement",
        "when_to_use": "Preferred over GPTQ when 4-bit quantization is required",
        "caveats": "Requires pre-quantized model; calibration on representative data is important",
    },
    "GGUF": {
        "bits": "2-8 (configurable per layer)",
        "memory_reduction": "Variable",
        "quality": "Depends on bit depth; Q5/Q6 near-lossless",
        "throughput": "CPU-optimized; lower GPU throughput than native formats",
        "when_to_use": "CPU inference or consumer GPU hardware; Ollama deployments",
        "caveats": "Not optimized for high-throughput server inference",
    },
}

GPU Monitoring and Observability

python
# Key GPU metrics for AI inference production monitoring
# Use with Prometheus DCGM exporter or vLLM metrics endpoint
GPU_METRICS_TO_MONITOR = {
    # GPU hardware metrics (via NVIDIA DCGM)
    "DCGM_FI_DEV_GPU_UTIL": {
        "description": "GPU utilization percentage",
        "target": ">70% during peak",
        "alert_threshold": "<30% sustained — may indicate batching misconfiguration",
    },
    "DCGM_FI_DEV_FB_USED": {
        "description": "GPU framebuffer (VRAM) used",
        "target": "<85% of total VRAM",
        "alert_threshold": ">90% — risk of OOM errors",
    },
    "DCGM_FI_DEV_SM_CLOCK": {
        "description": "GPU SM clock frequency",
        "note": "Drops on thermal throttle — check cooling",
    },
    
    # vLLM inference metrics (from /metrics endpoint)
    "vllm:gpu_cache_usage_perc": {
        "description": "KV cache utilization percentage",
        "target": "40-80% during steady state",
        "alert_threshold": ">90% — requests will be queued; scale GPU fleet",
    },
    "vllm:num_waiting_seqs": {
        "description": "Requests waiting in queue for GPU capacity",
        "target": "<5 during normal operation",
        "alert_threshold": ">20 sustained — GPU capacity insufficient for load",
    },
    "vllm:avg_generation_throughput_toks_per_s": {
        "description": "Token generation throughput",
        "note": "Track as baseline; degrade indicates model or hardware issue",
    },
    "vllm:time_to_first_token_seconds": {
        "description": "Latency to first token (TTFT)",
        "target": "<2 seconds p95 for interactive use",
        "alert_threshold": ">5 seconds p95",
    },
}

Architecture Diagram

Healthcare and Data Residency

For a Reference Healthcare Organization with data residency or PHI transmission restrictions, self-hosted GPU inference provides the compliance posture that managed cloud AI APIs cannot: all PHI remains within the organization's network perimeter and is processed only on hardware under the organization's physical control.

Educational Example — Illustrative Workflow. Not intended for clinical decision making.

Key operational requirements for PHI-handling inference infrastructure:

  • Disable request/response logging in the inference server (vLLM --disable-log-requests) unless PHI-safe audit logging is implemented separately
  • Encrypt all data at rest (model weights, KV cache) and in transit (TLS for API)
  • Implement role-based authentication on the inference API: clinical users authenticate through the EHR SSO; the inference API validates tokens
  • Maintain a HIPAA-compliant audit log of which users accessed which clinical AI features, without capturing PHI in log bodies

Common Mistakes

1. Not accounting for KV cache in VRAM planning. Loading a 70B model in INT8 requires ~75 GB, but the KV cache for a long-context request adds significantly more VRAM. Planning only for model weights and ignoring KV cache leads to OOM errors in production.

2. Setting gpu<em>memory</em>utilization too high. A value of 0.95 leaves almost no headroom for KV cache growth. Production deployments should set this to 0.85 to prevent OOM errors on unexpectedly long sequences.

3. Enabling request logging on PHI-handling infrastructure. vLLM logs request bodies by default. On a PHI-processing server, this logs patient data to disk. Always --disable-log-requests for clinical inference.

4. Choosing quantization without evaluating on the target task. INT8 and 4-bit quantization behave differently on different tasks. A 4-bit quantized model may perform well on simple question-answering and poorly on complex clinical reasoning. Always evaluate quantized models against your specific use case.

Best Practices

  • Plan VRAM requirements to include model weights, KV cache, CUDA overhead, and headroom — not model weights alone
  • Set gpu<em>memory</em>utilization to 0.85 in vLLM for production deployments
  • Disable request logging on PHI-handling inference servers
  • Evaluate quantized models against your specific clinical task before deploying to production
  • Monitor KV cache utilization as the leading indicator of GPU capacity pressure
  • Use tensor parallelism (NVLink) for models that exceed single-GPU VRAM; use pipeline parallelism only as a fallback

Trade-offs

Deployment VRAM Throughput Latency Cost PHI Suitability
Managed API (Anthropic, Azure) None (provider) High Low-Medium Per-token Depends on BAA
Cloud GPU Instance (on-demand) Provisioned High Low Per-hour Depends on region
Cloud GPU Instance (spot) Provisioned High Low ~70% discount Same as on-demand
On-premises GPU Capital expense High Low CapEx + OpEx Full control

Interview Questions

Q: A healthcare organization wants to deploy a self-hosted 70B-parameter LLM for clinical decision support. Estimate the GPU infrastructure required and describe the key configuration decisions.

Category: System Design Difficulty: Principal Role: AI Architect

Answer Framework:

A 70B-parameter model in FP16 requires approximately 140 GB of VRAM for model weights alone (illustrative — verify with the specific model and framework). Adding KV cache for a 4K context length at batch size 16 adds roughly 30–50 GB. Total VRAM requirement: approximately 170–200 GB for a production serving configuration.

This requires either 4Ɨ GPUs with 80 GB VRAM each (with NVLink), or 3Ɨ with INT8 quantization (~75 GB for weights). I would recommend 4Ɨ 80 GB GPUs with NVLink in FP16 for clinical use because INT8 quantization requires empirical validation of quality impact on the specific clinical tasks being supported.

Configuration decisions: tensor parallel size = 4 (spreads model across all 4 GPUs), gpumemoryutilization = 0.85, enable prefix caching for shared system prompt and retrieved context, disable request logging (PHI protection), implement TLS and API key authentication.

Operational requirements: NVIDIA DCGM for GPU monitoring, alert on KV cache utilization >90%, alert on TTFT p95 >2 seconds, maintain a separate HIPAA audit log of user-level access.

Key Points to Hit:

  • VRAM calculation: model weights + KV cache + overhead
  • Tensor parallelism (NVLink) for multi-GPU scaling
  • PHI-specific configuration: disable request logging
  • Monitoring: KV cache utilization as the leading indicator

Key Takeaways

  • GPU VRAM planning must include model weights, KV cache, CUDA overhead, and 15% headroom — not model weights alone
  • vLLM's PagedAttention and continuous batching are the core throughput optimizations for production inference
  • INT8 quantization approximately halves VRAM requirements; 4-bit (GPTQ/AWQ) reduces VRAM by ~75% with meaningful quality trade-offs — always evaluate against your specific task
  • KV cache utilization is the leading indicator of GPU capacity pressure; monitor it as a production SLI
  • PHI-handling inference servers require request logging to be disabled; implement a separate PHI-safe audit log

Further Reading