AI Infrastructure — Quick Reference

One-Line Definition

The AI infrastructure layer is the compute, storage, networking, caching, and orchestration substrate that transforms LLM API calls and vector operations into reliable, scalable, cost-managed enterprise services.


Section Map

Chapter Core Topic Key Decision
01 Vector Databases HNSW vs. IVF; hosted vs. self-managed
02 LLM Serving vLLM vs. managed API; quantization level
03 Cloud AI Platforms AWS Bedrock vs. Azure OpenAI vs. Vertex AI
04 Data Pipelines Chunking strategy; incremental vs. full re-index
05 Orchestration Airflow vs. Temporal; durable workflows
06 Caching Semantic cache threshold; TTL by document type
07 GPU Infrastructure VRAM planning; quantization vs. quality trade-off
08 API Gateway Token rate limits; circuit breaker; PHI logging policy

Vector Database — Quick Reference

ANN Index Selection

Index Best For Parameters Trade-off
HNSW Low-latency similarity search M (graph edges), ef_construction (build quality), ef (query beam) High memory; best recall
IVF Large-scale filtered search nlists (clusters), nprobe (lists to search) Coarser recall; supports pre-filtering
Flat Ground truth / evaluation None Exact — no approximation; doesn't scale

HNSW Parameter Rule of Thumb

text
M = 16–64          (higher = better recall, more memory)
ef_construction = 200–400   (build time quality; set once)
ef = 50–200        (query time beam; increase for higher recall at cost of speed)

Metadata Filtering

  • Pre-filter (IVF): filter on metadata before ANN search — correct for sparse results after filter
  • Post-filter (HNSW): ANN search then filter — risks returning fewer results than requested

LLM Serving — Quick Reference

vLLM Production Command

bash
vllm serve <model-id> \
    --dtype float16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --disable-log-requests \   # PHI safety
    --api-key ${VLLM_API_KEY}

Quantization Selection

Precision VRAM Quality Impact Use When
FP16 Baseline None Default; VRAM available
INT8 ~50% Near-identical; evaluate per task First quantization choice
GPTQ/AWQ (4-bit) ~75% Noticeable on complex reasoning VRAM constrained; less accuracy-critical

Key vLLM Metrics

text
vllm:gpu_cache_usage_perc    → alert >90% (capacity pressure)
vllm:num_waiting_seqs        → alert >20 sustained (scale out)
vllm:time_to_first_token     → alert p95 >2s (latency SLO)

Cloud AI Platform — Quick Reference

Dimension AWS Bedrock Azure OpenAI Google Vertex AI
Auth IAM / boto3 Entra ID / AzureOpenAI Service Account / vertexai SDK
Private networking VPC endpoint Private Link + VNet Private Service Connect
HIPAA BAA Available Available Available
Training data opt-out Yes Yes (PTU) Yes (Vertex agreements)
Model selection Anthropic, Meta, Mistral, Titan OpenAI models Google Gemini + tuned models
Unique strength AWS ecosystem; Guardrails Microsoft enterprise agreements BigQuery / Analytics integration

LiteLLM Multi-Provider Abstraction

python
from litellm import completion

# Identical call syntax across providers
response = completion(
    model="anthropic/claude-opus-4-8",     # or "azure/gpt-4o" or "vertex_ai/gemini-pro"
    messages=[{"role": "user", "content": "..."}]
)

Data Pipeline — Quick Reference

Chunking Strategy by Document Type

Document Type Preferred Strategy Rationale
Clinical guidelines Section-boundary Keeps recommendations intact
Unstructured clinical notes Fixed-size with overlap No reliable structure markers
Drug formulary (table data) Row-per-chunk Each row is a retrieval unit
Policy documents Paragraph-boundary Semantic units align with paragraphs

Incremental Pipeline Pattern

python
# Skip unchanged documents using content hash
content_hash = hashlib.sha256(content).hexdigest()
if existing_hash == content_hash:
    skip()   # No change detected

Update Cadence by Document Type (Illustrative)

Document Type Update Cadence Rationale
Drug interaction database Every 4 hours Continuously updated
Hospital formulary Daily 5-business-day SLA for changes
Clinical protocols Every 3 days Safety-critical; low change rate
Clinical guidelines Weekly Quarterly publication cycle

Caching — Quick Reference

Three Cache Layers

Layer Where What it Caches Cost Reduction
Semantic Cache Application Query → Response pairs High (no LLM call)
Prompt Cache API provider System prompt + context KV Medium (token discount)
KV Cache Model server Shared prefix attention KV No cost reduction; latency

Semantic Cache Thresholds

text
Clinical AI:    ≥ 0.95 cosine similarity   (high stakes; low tolerance for mismatch)
Customer support: ≥ 0.85                  (lower stakes; broader matching acceptable)

Cache TTL by Document Type

python
CACHE_TTL = {
    "drug_interaction":   4 * 3600,     # 4 hours
    "formulary":          86400,        # 1 day
    "hospital_protocol":  3 * 86400,    # 3 days
    "clinical_guideline": 7 * 86400,    # 7 days
}

Never Cache

  • Patient-specific queries (PHI in responses)
  • Queries with implicit temporal context ("today's on-call")
  • Queries that trigger EHR writes or notifications

GPU Infrastructure — Quick Reference

VRAM Planning Formula

text
Total VRAM = Model Weights + KV Cache + CUDA Overhead + 15% Headroom
                                                         ↑
                              Set gpu_memory_utilization = 0.85 in vLLM

Illustrative GPU Sizing (verify with current vendor specs)

Model Size Precision Min VRAM Topology
7B FP16 ~16 GB 1× mid-range GPU
13B FP16 ~30 GB 1× high-VRAM GPU
70B FP16 ~145 GB 4× 80 GB GPUs (NVLink)
70B INT8 ~75 GB 2× 80 GB GPUs (NVLink)

Illustrative — verify with official model documentation and serving framework VRAM calculator.


API Gateway — Quick Reference

Rate Limiting Unit

text
✅ Tokens per minute per team     (budget-meaningful)
❌ Requests per minute per team   (doesn't reflect actual cost)

PHI Logging Policy

python
# ✅ Safe to log
{"team_id": "clinical", "model": "claude-opus-4-8", "prompt_tokens": 450, "completion_tokens": 210}

# ❌ Never log for PHI teams
{"messages": "Patient John Doe, DOB 04/15/1960, MRN 123456, presents with..."}

Circuit Breaker Defaults

text
failure_threshold: 5 failures within 60 seconds → open circuit
cooldown_timeout: 60 seconds in open state → try half-open
recovery: first successful request closes circuit

Common Interview Questions

Q: What chunking strategy would you use for clinical guidelines and why? Section-boundary chunking — clinical guidelines are structured with numbered recommendations that are the unit of clinical utility. Splitting at token boundaries cuts recommendations in half, creating chunks that lack context.

Q: What is the difference between prompt caching and semantic caching? Prompt caching is provider-side: it caches the KV computation for repeated prompt prefixes, reducing cost and TTFT. Semantic caching is application-side: it stores previous query-response pairs indexed by embedding, returning cached responses for semantically similar new queries without any LLM call.

Q: Why should clinical AI rate limits be token-based, not request-based? A single clinical RAG request may consume 4,000 tokens (long context with retrieved chunks), while a simple lookup consumes 200. Request-level limits do not reflect actual API cost or provider rate limit consumption. Token-per-minute limits provide accurate budget control.

Q: What GPU metrics indicate that the inference fleet needs to scale out? KV cache utilization >90% and num<em>waiting</em>seqs >20 sustained are the leading indicators. GPU utilization alone is insufficient — the GPU may be fully utilized but all compute is wasted waiting for slow memory access. KV cache fill is the primary capacity metric for LLM inference.


See Also