AI Infrastructure — Quick Reference

Common Interview Questions

Q: What chunking strategy would you use for clinical guidelines and why? Section-boundary chunking — clinical guidelines are structured with numbered recommendations that are the unit of clinical utility. Splitting at token boundaries cuts recommendations in half, creating chunks that lack context.

Q: What is the difference between prompt caching and semantic caching? Prompt caching is provider-side: it caches the KV computation for repeated prompt prefixes, reducing cost and TTFT. Semantic caching is application-side: it stores previous query-response pairs indexed by embedding, returning cached responses for semantically similar new queries without any LLM call.

Q: Why should clinical AI rate limits be token-based, not request-based? A single clinical RAG request may consume 4,000 tokens (long context with retrieved chunks), while a simple lookup consumes 200. Request-level limits do not reflect actual API cost or provider rate limit consumption. Token-per-minute limits provide accurate budget control.

Q: What GPU metrics indicate that the inference fleet needs to scale out? KV cache utilization >90% and num<em>waiting</em>seqs >20 sustained are the leading indicators. GPU utilization alone is insufficient — the GPU may be fully utilized but all compute is wasted waiting for slow memory access. KV cache fill is the primary capacity metric for LLM inference.