AI Infrastructure — Quick Reference
Common Interview Questions
Q: What chunking strategy would you use for clinical guidelines and why? Section-boundary chunking — clinical guidelines are structured with numbered recommendations that are the unit of clinical utility. Splitting at token boundaries cuts recommendations in half, creating chunks that lack context.
Q: What is the difference between prompt caching and semantic caching? Prompt caching is provider-side: it caches the KV computation for repeated prompt prefixes, reducing cost and TTFT. Semantic caching is application-side: it stores previous query-response pairs indexed by embedding, returning cached responses for semantically similar new queries without any LLM call.
Q: Why should clinical AI rate limits be token-based, not request-based? A single clinical RAG request may consume 4,000 tokens (long context with retrieved chunks), while a simple lookup consumes 200. Request-level limits do not reflect actual API cost or provider rate limit consumption. Token-per-minute limits provide accurate budget control.
Q: What GPU metrics indicate that the inference fleet needs to scale out? KV cache utilization >90% and num<em>waiting</em>seqs >20 sustained are the leading indicators. GPU utilization alone is insufficient — the GPU may be fully utilized but all compute is wasted waiting for slow memory access. KV cache fill is the primary capacity metric for LLM inference.