AI Infrastructure — Quick Reference#
One-Line Definition#
The AI infrastructure layer is the compute, storage, networking, caching, and orchestration substrate that transforms LLM API calls and vector operations into reliable, scalable, cost-managed enterprise services.
Section Map#
| Chapter |
Core Topic |
Key Decision |
| 01 |
Vector Databases |
HNSW vs. IVF; hosted vs. self-managed |
| 02 |
LLM Serving |
vLLM vs. managed API; quantization level |
| 03 |
Cloud AI Platforms |
AWS Bedrock vs. Azure OpenAI vs. Vertex AI |
| 04 |
Data Pipelines |
Chunking strategy; incremental vs. full re-index |
| 05 |
Orchestration |
Airflow vs. Temporal; durable workflows |
| 06 |
Caching |
Semantic cache threshold; TTL by document type |
| 07 |
GPU Infrastructure |
VRAM planning; quantization vs. quality trade-off |
| 08 |
API Gateway |
Token rate limits; circuit breaker; PHI logging policy |
Vector Database — Quick Reference#
ANN Index Selection#
| Index |
Best For |
Parameters |
Trade-off |
| HNSW |
Low-latency similarity search |
M (graph edges), ef_construction (build quality), ef (query beam) |
High memory; best recall |
| IVF |
Large-scale filtered search |
nlists (clusters), nprobe (lists to search) |
Coarser recall; supports pre-filtering |
| Flat |
Ground truth / evaluation |
None |
Exact — no approximation; doesn't scale |
HNSW Parameter Rule of Thumb#
M = 16–64 (higher = better recall, more memory)
ef_construction = 200–400 (build time quality; set once)
ef = 50–200 (query time beam; increase for higher recall at cost of speed)
- Pre-filter (IVF): filter on metadata before ANN search — correct for sparse results after filter
- Post-filter (HNSW): ANN search then filter — risks returning fewer results than requested
LLM Serving — Quick Reference#
vLLM Production Command#
vllm serve <model-id> \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--disable-log-requests \ # PHI safety
--api-key ${VLLM_API_KEY}
Quantization Selection#
| Precision |
VRAM |
Quality Impact |
Use When |
| FP16 |
Baseline |
None |
Default; VRAM available |
| INT8 |
~50% |
Near-identical; evaluate per task |
First quantization choice |
| GPTQ/AWQ (4-bit) |
~75% |
Noticeable on complex reasoning |
VRAM constrained; less accuracy-critical |
Key vLLM Metrics#
vllm:gpu_cache_usage_perc → alert >90% (capacity pressure)
vllm:num_waiting_seqs → alert >20 sustained (scale out)
vllm:time_to_first_token → alert p95 >2s (latency SLO)
| Dimension |
AWS Bedrock |
Azure OpenAI |
Google Vertex AI |
| Auth |
IAM / boto3 |
Entra ID / AzureOpenAI |
Service Account / vertexai SDK |
| Private networking |
VPC endpoint |
Private Link + VNet |
Private Service Connect |
| HIPAA BAA |
Available |
Available |
Available |
| Training data opt-out |
Yes |
Yes (PTU) |
Yes (Vertex agreements) |
| Model selection |
Anthropic, Meta, Mistral, Titan |
OpenAI models |
Google Gemini + tuned models |
| Unique strength |
AWS ecosystem; Guardrails |
Microsoft enterprise agreements |
BigQuery / Analytics integration |
LiteLLM Multi-Provider Abstraction#
from litellm import completion
# Identical call syntax across providers
response = completion(
model="anthropic/claude-opus-4-8", # or "azure/gpt-4o" or "vertex_ai/gemini-pro"
messages=[{"role": "user", "content": "..."}]
)
Data Pipeline — Quick Reference#
Chunking Strategy by Document Type#
| Document Type |
Preferred Strategy |
Rationale |
| Clinical guidelines |
Section-boundary |
Keeps recommendations intact |
| Unstructured clinical notes |
Fixed-size with overlap |
No reliable structure markers |
| Drug formulary (table data) |
Row-per-chunk |
Each row is a retrieval unit |
| Policy documents |
Paragraph-boundary |
Semantic units align with paragraphs |
Incremental Pipeline Pattern#
# Skip unchanged documents using content hash
content_hash = hashlib.sha256(content).hexdigest()
if existing_hash == content_hash:
skip() # No change detected
Update Cadence by Document Type (Illustrative)#
| Document Type |
Update Cadence |
Rationale |
| Drug interaction database |
Every 4 hours |
Continuously updated |
| Hospital formulary |
Daily |
5-business-day SLA for changes |
| Clinical protocols |
Every 3 days |
Safety-critical; low change rate |
| Clinical guidelines |
Weekly |
Quarterly publication cycle |
Caching — Quick Reference#
Three Cache Layers#
| Layer |
Where |
What it Caches |
Cost Reduction |
| Semantic Cache |
Application |
Query → Response pairs |
High (no LLM call) |
| Prompt Cache |
API provider |
System prompt + context KV |
Medium (token discount) |
| KV Cache |
Model server |
Shared prefix attention KV |
No cost reduction; latency |
Semantic Cache Thresholds#
Clinical AI: ≥ 0.95 cosine similarity (high stakes; low tolerance for mismatch)
Customer support: ≥ 0.85 (lower stakes; broader matching acceptable)
Cache TTL by Document Type#
CACHE_TTL = {
"drug_interaction": 4 * 3600, # 4 hours
"formulary": 86400, # 1 day
"hospital_protocol": 3 * 86400, # 3 days
"clinical_guideline": 7 * 86400, # 7 days
}
Never Cache#
- Patient-specific queries (PHI in responses)
- Queries with implicit temporal context ("today's on-call")
- Queries that trigger EHR writes or notifications
GPU Infrastructure — Quick Reference#
Total VRAM = Model Weights + KV Cache + CUDA Overhead + 15% Headroom
↑
Set gpu_memory_utilization = 0.85 in vLLM
Illustrative GPU Sizing (verify with current vendor specs)#
| Model Size |
Precision |
Min VRAM |
Topology |
| 7B |
FP16 |
~16 GB |
1× mid-range GPU |
| 13B |
FP16 |
~30 GB |
1× high-VRAM GPU |
| 70B |
FP16 |
~145 GB |
4× 80 GB GPUs (NVLink) |
| 70B |
INT8 |
~75 GB |
2× 80 GB GPUs (NVLink) |
Illustrative — verify with official model documentation and serving framework VRAM calculator.
API Gateway — Quick Reference#
Rate Limiting Unit#
✅ Tokens per minute per team (budget-meaningful)
❌ Requests per minute per team (doesn't reflect actual cost)
PHI Logging Policy#
# ✅ Safe to log
{"team_id": "clinical", "model": "claude-opus-4-8", "prompt_tokens": 450, "completion_tokens": 210}
# ❌ Never log for PHI teams
{"messages": "Patient John Doe, DOB 04/15/1960, MRN 123456, presents with..."}
Circuit Breaker Defaults#
failure_threshold: 5 failures within 60 seconds → open circuit
cooldown_timeout: 60 seconds in open state → try half-open
recovery: first successful request closes circuit
Common Interview Questions#
Q: What chunking strategy would you use for clinical guidelines and why? Section-boundary chunking — clinical guidelines are structured with numbered recommendations that are the unit of clinical utility. Splitting at token boundaries cuts recommendations in half, creating chunks that lack context.
Q: What is the difference between prompt caching and semantic caching? Prompt caching is provider-side: it caches the KV computation for repeated prompt prefixes, reducing cost and TTFT. Semantic caching is application-side: it stores previous query-response pairs indexed by embedding, returning cached responses for semantically similar new queries without any LLM call.
Q: Why should clinical AI rate limits be token-based, not request-based? A single clinical RAG request may consume 4,000 tokens (long context with retrieved chunks), while a simple lookup consumes 200. Request-level limits do not reflect actual API cost or provider rate limit consumption. Token-per-minute limits provide accurate budget control.
Q: What GPU metrics indicate that the inference fleet needs to scale out? KV cache utilization >90% and num<em>waiting</em>seqs >20 sustained are the leading indicators. GPU utilization alone is insufficient — the GPU may be fully utilized but all compute is wasted waiting for slow memory access. KV cache fill is the primary capacity metric for LLM inference.
See Also#