AI Infrastructure — Quick Reference

One-Line Definition

The AI infrastructure layer is the compute, storage, networking, caching, and orchestration substrate that transforms LLM API calls and vector operations into reliable, scalable, cost-managed enterprise services.

Section Map

Chapter	Core Topic	Key Decision
01	Vector Databases	HNSW vs. IVF; hosted vs. self-managed
02	LLM Serving	vLLM vs. managed API; quantization level
03	Cloud AI Platforms	AWS Bedrock vs. Azure OpenAI vs. Vertex AI
04	Data Pipelines	Chunking strategy; incremental vs. full re-index
05	Orchestration	Airflow vs. Temporal; durable workflows
06	Caching	Semantic cache threshold; TTL by document type
07	GPU Infrastructure	VRAM planning; quantization vs. quality trade-off
08	API Gateway	Token rate limits; circuit breaker; PHI logging policy

Vector Database — Quick Reference

ANN Index Selection

Index	Best For	Parameters	Trade-off
HNSW	Low-latency similarity search	M (graph edges), ef_construction (build quality), ef (query beam)	High memory; best recall
IVF	Large-scale filtered search	nlists (clusters), nprobe (lists to search)	Coarser recall; supports pre-filtering
Flat	Ground truth / evaluation	None	Exact — no approximation; doesn't scale

HNSW Parameter Rule of Thumb

text

M = 16–64          (higher = better recall, more memory)
ef_construction = 200–400   (build time quality; set once)
ef = 50–200        (query time beam; increase for higher recall at cost of speed)

Metadata Filtering

Pre-filter (IVF): filter on metadata before ANN search — correct for sparse results after filter
Post-filter (HNSW): ANN search then filter — risks returning fewer results than requested

LLM Serving — Quick Reference

vLLM Production Command

bash

vllm serve <model-id> \
    --dtype float16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --disable-log-requests \   # PHI safety
    --api-key ${VLLM_API_KEY}

Quantization Selection

Precision	VRAM	Quality Impact	Use When
FP16	Baseline	None	Default; VRAM available
INT8	~50%	Near-identical; evaluate per task	First quantization choice
GPTQ/AWQ (4-bit)	~75%	Noticeable on complex reasoning	VRAM constrained; less accuracy-critical

Key vLLM Metrics

text

vllm:gpu_cache_usage_perc    → alert >90% (capacity pressure)
vllm:num_waiting_seqs        → alert >20 sustained (scale out)
vllm:time_to_first_token     → alert p95 >2s (latency SLO)

Cloud AI Platform — Quick Reference

Dimension	AWS Bedrock	Azure OpenAI	Google Vertex AI
Auth	IAM / boto3	Entra ID / AzureOpenAI	Service Account / vertexai SDK
Private networking	VPC endpoint	Private Link + VNet	Private Service Connect
HIPAA BAA	Available	Available	Available
Training data opt-out	Yes	Yes (PTU)	Yes (Vertex agreements)
Model selection	Anthropic, Meta, Mistral, Titan	OpenAI models	Google Gemini + tuned models
Unique strength	AWS ecosystem; Guardrails	Microsoft enterprise agreements	BigQuery / Analytics integration

LiteLLM Multi-Provider Abstraction

python

from litellm import completion

# Identical call syntax across providers
response = completion(
    model="anthropic/claude-opus-4-8",     # or "azure/gpt-4o" or "vertex_ai/gemini-pro"
    messages=[{"role": "user", "content": "..."}]
)

Data Pipeline — Quick Reference

Chunking Strategy by Document Type

Document Type	Preferred Strategy	Rationale
Clinical guidelines	Section-boundary	Keeps recommendations intact
Unstructured clinical notes	Fixed-size with overlap	No reliable structure markers
Drug formulary (table data)	Row-per-chunk	Each row is a retrieval unit
Policy documents	Paragraph-boundary	Semantic units align with paragraphs

Incremental Pipeline Pattern

python

# Skip unchanged documents using content hash
content_hash = hashlib.sha256(content).hexdigest()
if existing_hash == content_hash:
    skip()   # No change detected

Update Cadence by Document Type (Illustrative)

Document Type	Update Cadence	Rationale
Drug interaction database	Every 4 hours	Continuously updated
Hospital formulary	Daily	5-business-day SLA for changes
Clinical protocols	Every 3 days	Safety-critical; low change rate
Clinical guidelines	Weekly	Quarterly publication cycle

Caching — Quick Reference

Three Cache Layers

Layer	Where	What it Caches	Cost Reduction
Semantic Cache	Application	Query → Response pairs	High (no LLM call)
Prompt Cache	API provider	System prompt + context KV	Medium (token discount)
KV Cache	Model server	Shared prefix attention KV	No cost reduction; latency

Semantic Cache Thresholds

text

Clinical AI:    ≥ 0.95 cosine similarity   (high stakes; low tolerance for mismatch)
Customer support: ≥ 0.85                  (lower stakes; broader matching acceptable)

Cache TTL by Document Type

python

CACHE_TTL = {
    "drug_interaction":   4 * 3600,     # 4 hours
    "formulary":          86400,        # 1 day
    "hospital_protocol":  3 * 86400,    # 3 days
    "clinical_guideline": 7 * 86400,    # 7 days
}

Never Cache

Patient-specific queries (PHI in responses)
Queries with implicit temporal context ("today's on-call")
Queries that trigger EHR writes or notifications

GPU Infrastructure — Quick Reference

VRAM Planning Formula

text

Total VRAM = Model Weights + KV Cache + CUDA Overhead + 15% Headroom
                                                         ↑
                              Set gpu_memory_utilization = 0.85 in vLLM

Illustrative GPU Sizing (verify with current vendor specs)

Model Size	Precision	Min VRAM	Topology
7B	FP16	~16 GB	1× mid-range GPU
13B	FP16	~30 GB	1× high-VRAM GPU
70B	FP16	~145 GB	4× 80 GB GPUs (NVLink)
70B	INT8	~75 GB	2× 80 GB GPUs (NVLink)

Illustrative — verify with official model documentation and serving framework VRAM calculator.

API Gateway — Quick Reference

Rate Limiting Unit

text

✅ Tokens per minute per team     (budget-meaningful)
❌ Requests per minute per team   (doesn't reflect actual cost)

PHI Logging Policy

python

# ✅ Safe to log
{"team_id": "clinical", "model": "claude-opus-4-8", "prompt_tokens": 450, "completion_tokens": 210}

# ❌ Never log for PHI teams
{"messages": "Patient John Doe, DOB 04/15/1960, MRN 123456, presents with..."}

Circuit Breaker Defaults

text

failure_threshold: 5 failures within 60 seconds → open circuit
cooldown_timeout: 60 seconds in open state → try half-open
recovery: first successful request closes circuit

Common Interview Questions

Q: What chunking strategy would you use for clinical guidelines and why? Section-boundary chunking — clinical guidelines are structured with numbered recommendations that are the unit of clinical utility. Splitting at token boundaries cuts recommendations in half, creating chunks that lack context.

Q: What is the difference between prompt caching and semantic caching? Prompt caching is provider-side: it caches the KV computation for repeated prompt prefixes, reducing cost and TTFT. Semantic caching is application-side: it stores previous query-response pairs indexed by embedding, returning cached responses for semantically similar new queries without any LLM call.

Q: Why should clinical AI rate limits be token-based, not request-based? A single clinical RAG request may consume 4,000 tokens (long context with retrieved chunks), while a simple lookup consumes 200. Request-level limits do not reflect actual API cost or provider rate limit consumption. Token-per-minute limits provide accurate budget control.

Q: What GPU metrics indicate that the inference fleet needs to scale out? KV cache utilization >90% and num<em>waiting</em>seqs >20 sustained are the leading indicators. GPU utilization alone is insufficient — the GPU may be fully utilized but all compute is wasted waiting for slow memory access. KV cache fill is the primary capacity metric for LLM inference.

AI Infrastructure — Quick Reference#

One-Line Definition#

Section Map#

Vector Database — Quick Reference#

ANN Index Selection#

HNSW Parameter Rule of Thumb#

Metadata Filtering#

LLM Serving — Quick Reference#

vLLM Production Command#

Quantization Selection#

Key vLLM Metrics#

Cloud AI Platform — Quick Reference#

LiteLLM Multi-Provider Abstraction#

Data Pipeline — Quick Reference#

Chunking Strategy by Document Type#

Incremental Pipeline Pattern#

Update Cadence by Document Type (Illustrative)#

Caching — Quick Reference#

Three Cache Layers#

Semantic Cache Thresholds#

Cache TTL by Document Type#

Never Cache#

GPU Infrastructure — Quick Reference#

VRAM Planning Formula#

Illustrative GPU Sizing (verify with current vendor specs)#

API Gateway — Quick Reference#

Rate Limiting Unit#

PHI Logging Policy#

Circuit Breaker Defaults#

Common Interview Questions#

See Also#

AI Infrastructure — Quick Reference

One-Line Definition

Section Map

Vector Database — Quick Reference

ANN Index Selection

HNSW Parameter Rule of Thumb

Metadata Filtering

LLM Serving — Quick Reference

vLLM Production Command

Quantization Selection

Key vLLM Metrics

Cloud AI Platform — Quick Reference

LiteLLM Multi-Provider Abstraction

Data Pipeline — Quick Reference

Chunking Strategy by Document Type

Incremental Pipeline Pattern

Update Cadence by Document Type (Illustrative)

Caching — Quick Reference

Three Cache Layers

Semantic Cache Thresholds

Cache TTL by Document Type

Never Cache

GPU Infrastructure — Quick Reference

VRAM Planning Formula

Illustrative GPU Sizing (verify with current vendor specs)

API Gateway — Quick Reference

Rate Limiting Unit

PHI Logging Policy

Circuit Breaker Defaults

Common Interview Questions

See Also