AI System Design Problems

How to Use This Document

Each problem is presented in interview format: a scenario statement, scaling and constraint parameters, and then a complete structured solution. Work through each problem yourself before reading the solution. Time yourself: most system design interviews give 45–60 minutes per problem.

Active preparation approach:

  1. Read the scenario and constraints (2 min)
  2. Write your own approach without reading the solution (35–40 min)
  3. Compare your approach to the provided framework
  4. Identify what you missed or would phrase differently
  5. Repeat until the structural patterns feel automatic

Problem 1 β€” Design a Clinical RAG System for a Hospital

Scenario: A 500-bed acute care hospital wants an AI system that allows hospitalist physicians to query clinical knowledge bases (guidelines, protocols, formulary) during rounds from a mobile device. The system must respond in under 3 seconds and must cite its sources.

Scale parameters:

  • 300 concurrent clinical users peak
  • 50,000 queries per day
  • Knowledge base: 10,000 documents (guidelines, protocols, 50,000 pages total)
  • Compliance: HIPAA, FHIR integration not required for this feature (knowledge queries only)

Solution Framework:

Requirements clarification:

  • Query types: "What is the first-line antibiotic for CAP?" β€” no patient context needed
  • Response format: plain text answer + 2–3 citations with document title, section, effective date
  • Update cadence: knowledge base refreshed weekly as guidelines are updated
  • Mobile constraints: low-latency (≀3s) response, small response payload

Architecture components:

text
Knowledge Base Pipeline (offline):
PDF/DOCX Sources β†’ Document Extractor β†’ Section-Boundary Chunker
β†’ Embedding Service (batch) β†’ Quality Validation β†’ pgvector (Supabase/RDS)
β†’ Document Store (metadata + source text) β†’ Scheduled: nightly refresh

Query Path (online):
Mobile Client β†’ AI Gateway (auth, rate limit) β†’ Semantic Cache lookup
β†’ [Cache hit] Return cached response
β†’ [Cache miss] Embed query β†’ ANN search pgvector (top-5 chunks)
β†’ Build prompt (system + chunks + query) with prompt caching β†’ Claude (Sonnet tier)
β†’ Output validation β†’ Store in semantic cache β†’ Return response + citations

Critical design decisions:

  1. Chunking strategy: Section-boundary chunking for clinical guidelines β€” a recommendation is the unit of clinical utility. Fixed-size chunking splits recommendations across chunks.
  1. Semantic cache: Threshold 0.95 cosine similarity. Cache TTL by document type: formulary 1 day, guidelines 7 days. Invalidate cache on knowledge base update (event-driven: pipeline publishes document.updated event β†’ cache invalidation handler).
  1. Citation metadata: Every chunk stores {source<em>organization, document</em>title, section, effective_date}. Returned in every response. Non-negotiable for clinical use.
  1. No patient data in this system: These are population-level knowledge queries. No PHI in prompt, no BAA complexity for knowledge queries.
  1. Prompt caching: System prompt + top-5 chunks are cached at the API provider layer (Anthropic prompt caching). Only the query changes per request β€” significant cost reduction.

Failure modes and mitigations:

  • LLM timeout: semantic cache returns stale response with cached: true, cached_at: timestamp
  • Knowledge base update failure: stale documents served with effective_date visible to user; alert on pipeline failure
  • High load: rate limit per user (60 requests/min); queue excess requests

Scaling to 10x:

  • 500,000 queries/day: semantic cache hit rate rises with volume β€” reduces LLM calls
  • Add read replicas to pgvector for ANN search throughput
  • Shard knowledge base by department (cardiology, pharmacy, oncology) β€” reduces ANN search latency

Non-functional requirements:

  • No PHI β†’ minimal HIPAA surface; standard TLS + authentication sufficient
  • Observability: latency<em>p95, cache</em>hit<em>rate, citation</em>count, disclaimer<em>present</em>rate in quality log
  • Cost: estimate 3,000 tokens average per query Γ— 50,000 queries/day Γ— model price (verify current pricing)

Problem 2 β€” Design a Prior Authorization AI Agent

Scenario: A health system processes 2,000 prior authorization (PA) requests per day. Currently, clinical staff spend 15 minutes per PA manually pulling patient records and writing justifications. Design an AI agent system that automates the clinical evidence gathering and draft justification generation.

Scale parameters:

  • 2,000 PAs per day (peak: 300 in 2-hour morning window)
  • 10 clinical staff currently doing PA work
  • Average clinical evidence: 3–5 FHIR resources per PA
  • PA response must include: clinical justification letter + supporting evidence citations
  • Compliance: HIPAA, PHI in all requests

Solution Framework:

Requirements clarification:

  • Human-in-loop: clinical staff must review and approve every AI-generated PA; AI produces draft
  • Latency: PA drafts expected within 5 minutes of order entry
  • Integration: Epic FHIR R4 for patient context; payer portal APIs for submission (out of scope for AI)
  • Accuracy: PA denials currently at 15%; target reduction to <10% with AI assistance

Architecture (event-driven + durable workflow):

text
Order Entry Event:
EHR β†’ Kafka (clinical-orders topic, partition by patient_id) 
β†’ PA AI Consumer (consumer group: prior-auth-ai)

Per PA workflow (Temporal durable workflow):
1. extract_patient_context (FHIR: Patient, Encounter, Condition, MedicationRequest, Procedure)
2. retrieve_clinical_evidence (RAG: clinical guidelines matching diagnosis + medication)
3. generate_pa_justification (LLM: Claude Opus β€” best clinical reasoning)
4. create_ehr_task (FHIR DocumentReference: "preliminary" draft + PA review task in Epic)
5. notify_clinical_staff (EHR In-Basket notification)
6. await_staff_review (Temporal timer: poll for task completion, timeout 8h with escalation)
7. finalize_and_submit (if approved: update docStatus "final"; submit to payer)

Why Temporal instead of Airflow: The workflow includes a human approval step that may take minutes to hours. Temporal persists the workflow state during the wait without blocking compute resources. Airflow workers would be held indefinitely. Full execution history also provides the audit trail for HIPAA.

PHI handling:

  • FHIR context contains PHI β€” confirm BAA with LLM provider before deployment
  • All FHIR reads emit HIPAA audit events (patient_id, user=service account, action=read)
  • Temporal workflow history contains PHI references β€” Temporal cluster must be deployed in PHI data zone
  • AI-generated PA justification stored as FHIR DocumentReference (PHI store)

LLM selection: Claude Opus tier β€” PA justification requires complex clinical reasoning drawing on diagnosis codes, clinical criteria, and payer clinical criteria. Quality over speed for this use case.

Failure modes:

  • FHIR context retrieval fails: retry with exponential backoff (3 attempts); if exhausted, escalate to clinical staff with partial context
  • LLM generation fails: retry 2Γ—; route to clinical staff with pre-filled template
  • Staff does not review within 8h: escalate to supervisor; send reminder at 4h

Scale to 10x (20,000 PAs/day):

  • Kafka partition count: increase to 24 (currently 6)
  • Temporal worker pool: scale horizontally
  • LLM cost: 10x increase β€” evaluate Sonnet-tier for simpler PAs (routine medication refills); Opus for complex criteria

Problem 3 β€” Design an AI API Gateway for 100 Engineering Teams

Scenario: A large enterprise has 100 product teams who want to use AI (LLM APIs) in their applications. The AI platform team needs to design a centralized AI gateway that provides governance, cost control, observability, and failover without requiring each team to implement these independently.

Scale parameters:

  • 100 teams, 500 developers
  • Estimated peak load: 1M tokens/minute across all teams
  • Mix of use cases: customer support, code generation, document summarization, internal search
  • Providers: Anthropic and Azure OpenAI (dual-provider for resilience)
  • Compliance: SOC 2, not healthcare

Solution Framework:

Requirements clarification:

  • Per-team token budgets with enforcement (not just monitoring)
  • Model tier routing: teams should specify capability level (Frontier/Standard/Efficient), not model names
  • Circuit breaking: if Anthropic is degraded, automatically route to Azure OpenAI
  • Cost attribution: monthly cost report per team for budget showback

Gateway architecture:

text
Client Teams β†’ AI Gateway (FastAPI, 3 replicas, Kubernetes)
  ↓
Auth Layer: JWT validation (team service account, team_id extracted from token)
  ↓
AuthZ Layer: Team config lookup (Redis: team_id β†’ {budget, allowed_tiers, phi_access})
  ↓
Rate Limit Layer: Token sliding window (Redis: team_id β†’ [(timestamp, tokens)] last 60s)
  ↓
Router Layer: Select provider based on {requested_tier, circuit_breaker_state, cost_policy}
  ↓
Circuit Breaker: Redis-backed state per provider (failure_count, is_open, half_open_until)
  ↓
Provider Call: Anthropic API / Azure OpenAI (TLS 1.3, timeout 30s)
  ↓
Metrics Layer: Emit {team_id, provider, model, use_case, prompt_tokens, completion_tokens}
  ↓
Return Response to Client

Rate limiting design:

  • Unit: tokens per minute (not requests per minute)
  • Implementation: sliding window in Redis with sorted set (ZADD by timestamp)
  • Enforcement: check before routing; reject with 429 + Retry-After if exceeded
  • Budget: monthly token cap per team; alert at 80%, auto-downgrade to Efficient tier at 90%, block at 100%

Circuit breaker design:

  • Per-provider state in Redis (shared across all 3 gateway replicas)
  • Open after 5 failures in 60 seconds
  • Half-open after 60 seconds: allow 1 test request
  • Success resets counter; failure extends open period

Model tier routing table:

Capability Tier Primary Provider Fallback Provider
Frontier Anthropic (claude-opus-4-8) Azure OpenAI (gpt-4o)
Standard Anthropic (claude-sonnet-4-6) Azure OpenAI (gpt-4o-mini)
Efficient Anthropic (claude-haiku-4-5) Azure OpenAI (gpt-3.5-turbo)

Cost attribution pipeline:

  • Every gateway response: emit to Kafka (ai-cost-events topic)
  • Kafka consumer: aggregate by team_id + model + day β†’ DynamoDB/BigQuery cost table
  • Monthly: auto-generate cost report per team; surface in internal developer portal

Scale to 2M tokens/minute (2x):

  • Scale gateway replicas from 3 to 6
  • Redis rate limiter: use Redis Cluster (sharded by team_id)
  • Kafka cost events: add partitions to cost-events topic

Non-functional requirements:

  • Gateway latency overhead: target <10ms added latency (authentication, rate limit check)
  • Availability: 99.9% (Anthropic and Azure in active-active via circuit breaker)

Problem 4 β€” Design a Clinical Documentation AI System

Scenario: Design an AI system for ambient clinical documentation at a hospital. Physicians wear a badge microphone during patient encounters. The system listens, generates structured clinical notes, and submits them to the EHR for physician review and signature.

Scale parameters:

  • 200 physicians, average 10 encounters/day
  • Encounter duration: 8–25 minutes
  • Note required within 60 minutes of encounter completion
  • Target: physician reviews and signs note, reducing documentation time from 15 min to 3 min
  • Compliance: HIPAA (audio + transcript + note = PHI)

Solution Framework:

Requirements clarification:

  • Audio capture: edge device (badge) or in-room device? β†’ Badge device, continuous capture
  • Speaker diarization: physician voice vs. patient voice must be distinguished
  • Note structure: SOAP format (Subjective, Objective, Assessment, Plan) matching specialty template
  • Output channel: FHIR DocumentReference (preliminary) β†’ EHR physician review task

Pipeline architecture:

text
Audio Capture: Badge device β†’ encrypted upload (TLS) β†’ Audio Processing Service
  ↓
Speech-to-Text: Whisper / Azure Speech (medical vocabulary fine-tuned)
  + Speaker Diarization (physician vs. patient)
  ↓
Transcript Processing:
  - Remove identifying patient name (replace with [PATIENT])
  - De-identify audio transcript before LLM processing (PHI minimization)
  ↓
Note Generation (LLM β€” Claude Opus):
  System prompt: specialty-specific SOAP template
  Input: de-identified transcript
  Output: structured SOAP note draft with identified gaps
  ↓
Note Post-Processing:
  - Re-insert patient name, DOB from EHR context (maintained separately)
  - ICD-10 code suggestion (secondary classification model)
  - PHI validation (confirm note is complete and accurate)
  ↓
EHR Write (FHIR DocumentReference):
  - docStatus: "preliminary" (never "final" β€” physician must sign)
  - Extension: {ai-generated: true, model-version: X, encounter-id: Y}
  ↓
Physician Review (EHR workflow):
  - EHR shows AI draft; physician reviews, edits, signs
  - On signature: docStatus β†’ "final"
  - Physician edit distance tracked (quality metric)

PHI architecture:

  • Audio is PHI β€” badge device encrypts before transmission; audio stored in PHI data zone
  • Transcript is PHI β€” processed in PHI data zone; PHI minimized before LLM call
  • LLM provider must have BAA β€” Azure OpenAI (BAA available) or self-hosted inference
  • All processing steps logged in HIPAA audit trail

Quality monitoring:

  • Physician edit distance: percentage of words changed before signing (track per physician, per specialty)
  • Signing latency: time from note delivery to physician signature
  • Note rejection rate: physician deletes AI draft and writes from scratch (quality signal)
  • Alert: if edit distance >30% for a physician consistently β†’ model is not working for that specialty

Scale to 10x (2,000 physicians):

  • Speech-to-text: scale transcription workers horizontally
  • Note generation: batch if within SLA; parallel processing by encounter
  • EHR write: FHIR write throughput may be bottleneck β€” add FHIR write queue

Problem 5 β€” Design an AI Observability Platform

Scenario: Your company runs 15 AI-powered features across multiple products. Quality has degraded twice in the past year without being detected until users complained. Design an AI observability platform that detects quality regressions before users notice.

Scale parameters:

  • 15 AI features across 3 products
  • 500K AI requests per day total
  • 5 model versions in active use across features
  • Need: detect regressions within 1 hour of deployment

Solution Framework:

Requirements clarification:

  • What defines "quality"? β†’ Accuracy on golden query set, user feedback signals, content quality heuristics
  • Detection latency: 1 hour from deployment to alert
  • Response: alert on regression, ability to rollback model version

Three-layer observability:

text
Layer 1: Operational Metrics (infrastructure health)
  - Latency p50/p95/p99 per feature per model version
  - Error rate, timeout rate, content filter rate
  - Token usage (cost proxy), cache hit rate
  β†’ Prometheus + Grafana; alert on significant deviation from baseline

Layer 2: Quality Heuristics (output quality signals, no ground truth needed)
  - Disclaimer present rate (clinical features)
  - Citation count per response
  - Response length distribution
  - Finish reason distribution (stop vs. max_tokens)
  β†’ Quality log β†’ Athena/BigQuery; daily trend analysis

Layer 3: Golden Query Evaluation (ground truth accuracy)
  - 50-100 golden queries per feature with expected outputs
  - Run automatically after each deployment
  - Score: semantic similarity to expected outputs (embedding cosine similarity)
  - Alert: if score drops >10% from baseline
  β†’ Evaluation pipeline (triggered by deployment); alert to Slack/PagerDuty

Golden query evaluation pipeline:

python
# Triggered on deployment or on schedule (daily minimum)
for feature in ai_features:
    golden_set = load_golden_queries(feature.id)
    scores = []
    for query in golden_set:
        response = call_ai_feature(feature, query.input)
        score = semantic_similarity(response.output, query.expected_output)
        scores.append(score)
    mrr = mean(scores)
    if mrr < feature.baseline_mrr * 0.90:  # 10% regression threshold
        alert(f"Quality regression detected in {feature.name}: MRR {mrr:.3f} vs baseline {feature.baseline_mrr:.3f}")
        trigger_rollback_review(feature)

User feedback integration:

  • Thumbs up/down embedded in AI response UI
  • Flag button: "This response was incorrect"
  • Track feedback rate per feature per model version
  • Alert: if flag rate >2% sustained for 30 minutes

Rollback mechanism:

  • AI gateway routes by model tier (not model version) β†’ update tierβ†’model mapping in Redis
  • Rollback: update Redis config to route Frontier tier β†’ previous model version
  • No code deployment required

Problem 6 β€” Design a Secure AI Platform for Healthcare (Zero Trust)

Scenario: A healthcare system wants to deploy an AI platform that serves clinical applications across 5 hospitals, ensuring PHI never leaves each hospital's data residency boundary and all AI access is auditable for HIPAA compliance.

Scale parameters:

  • 5 hospitals, 1,000 clinical users total
  • PHI must stay within each hospital's data center (on-premises preference) or in designated cloud regions
  • Audit trail required: every AI-assisted clinical decision must be traceable
  • LLM inference: self-hosted preference to avoid external PHI transmission

Solution Framework:

Architecture (per hospital):

text
Zero Trust Network Segmentation (per hospital deployment):

External Zone: Clinician browser (VPN required for PHI features)
  ↓ (mTLS)
DMZ β€” AI Gateway Zone: Zero Trust Gateway (auth, authz, rate limiting, PHI access control)
  ↓ (JWT + mTLS)
Internal β€” AI Processing Zone: RAG Service, Agent Orchestrator (no external network access)
  ↓ (service account, mTLS)
Protected β€” PHI Data Zone: FHIR Proxy, HIPAA Audit Log (no egress)
  ↓ (internal only)
GPU Inference Zone: Self-hosted vLLM (on-premises GPU) β€” PHI never leaves hospital network

PHI data residency enforcement:

  • Each hospital: dedicated AI gateway instance + dedicated vLLM server (on-prem GPU)
  • No cross-hospital PHI sharing: clinical context from Hospital A is never sent to Hospital B's inference server
  • Audit log per hospital: immutable CloudWatch log (hospital's own AWS account / on-premises ELK)

Why self-hosted inference:

  • PHI in prompt β†’ any external LLM API requires BAA AND data transmission to cloud
  • On-premises vLLM: PHI stays on hospital network; no cloud transmission; BAA not required for inference
  • Trade-off: capital cost (GPU servers), operational burden (inference team), model update cadence

HIPAA audit trail design:

  • Every FHIR read by AI: emit {eventtype: phiaccess, userid, patientid, resourcetype, timestamp, requestid}
  • Every AI inference: emit {eventtype: aiinferencewithphi, modelversion, usecase, request_id}
  • Audit log: write-once (WORM) storage; 6-year retention; access restricted to compliance team
  • Audit events correlated by request_id across FHIR and inference logs

mTLS everywhere:

  • Gateway ↔ RAG service, Gateway ↔ FHIR proxy, RAG ↔ vLLM: all mTLS
  • Certificate authority: HashiCorp Vault PKI (on-premises CA)
  • Certificate lifetime: 90 days; auto-rotation via cert-manager

Problem 7 β€” Design a Real-Time Clinical Alert System

Scenario: A hospital wants an AI system that monitors incoming laboratory results and generates real-time alerts for clinicians when results indicate a patient may be deteriorating. Alert must reach the responsible physician within 60 seconds of lab result posting.

Scale parameters:

  • 5,000 lab results per day
  • 200 critical results per day (requiring immediate alert)
  • Physician response acknowledgment required within 15 minutes
  • Compliance: HIPAA, PHI in all lab data

Solution Framework:

Event-driven architecture:

text
Lab IS β†’ HL7 v2 ORU message (result)
  ↓ (HL7 v2 TCP/MLLP)
HL7 Bridge: Parse ORU β†’ structured JSON β†’ Kafka topic (lab-results, partition by patient_id)
  ↓
AI Alert Consumer (consumer group: ai-critical-value-alert):
  1. Deserialize lab result event
  2. Retrieve patient context (FHIR: Patient, active conditions, current medications)
  3. AI assessment (fast LLM call: Claude Haiku or Sonnet): 
     - Is this value clinically critical given this patient's context?
     - What is the likely diagnosis? What is the urgency level?
  4. If urgency = "critical": generate alert card
  5. Deliver: 
     - EHR In-Basket notification (FHIR Task)
     - Push notification to physician mobile (if critical)
     - Escalate to charge nurse if not acknowledged in 15 minutes
  6. Commit Kafka offset after successful delivery

Why event-driven (not polling):

  • 5,000 lab results/day = 3.5/minute average, 50/minute peak
  • Polling every 30 seconds would miss the 60-second SLA during peak
  • Kafka events are pushed the moment the lab result is posted β†’ immediate processing

AI assessment design:

  • Efficient model (Claude Haiku): fast, low-cost for the high-volume assessment step
  • Two-step classification: (1) is this result outside normal range? (2) given this patient's context, is this clinically significant?
  • Clinical context: patient's baseline values (important β€” a creatinine of 2.0 is critical for some patients, normal for others with CKD)
  • Output: {urgency: critical|high|low, clinicalnarrative: string, suggestedaction: string}

Idempotency:

  • Lab result events may be delivered more than once (HL7 redelivery, Kafka at-least-once)
  • Consumer dedup cache: check processed:{lab<em>result</em>id} before processing
  • Alert dedup: don't send duplicate alert if same lab result ID already alerted

Escalation workflow:

  • Critical alert delivered to physician (EHR + push)
  • If no acknowledgment in 15 minutes: escalate to charge nurse
  • If no acknowledgment in 30 minutes: escalate to department chief
  • All escalations logged for HIPAA audit and quality review

Problem 8 β€” Design a Multi-Tenant AI SaaS Platform

Scenario: You are building a B2B AI platform that will be sold to enterprise customers. Each customer gets their own AI features, their own knowledge base, and their own usage budget. Customers must be strictly isolated β€” one customer's data cannot affect another customer's AI outputs.

Scale parameters:

  • 50 customers at launch, 500 within 18 months
  • Per customer: 1,000–100,000 documents in knowledge base
  • Mix of industries: healthcare, financial services, retail
  • Some healthcare customers require PHI isolation and HIPAA compliance

Solution Framework:

Multi-tenancy architecture decision β€” tenant isolation levels:

Level Implementation Cost Isolation
Silo Separate stack per customer High Complete
Pool (shared, logic-isolated) Single stack, customer_id column filters Low Software-enforced
Bridge (tiered) Shared compute, dedicated storage Medium Data isolated; compute shared

Recommendation: Bridge model β€” shared AI gateway and inference, dedicated vector store and document store per tenant.

text
Shared (all tenants):
  AI Gateway β†’ LLM Providers (Anthropic, Azure)
  Embedding Service (shared pool)
  Kubernetes cluster (namespaced per customer tier)

Per-Tenant (isolated):
  Vector Store: Dedicated Weaviate class (or dedicated pgvector schema)
  Document Store: S3 prefix isolated per tenant (s3://ai-platform/{tenant_id}/docs/)
  Audit Log: Per-tenant CloudWatch log group
  API keys: Per-tenant secrets in Secrets Manager

PHI tenant isolation for healthcare customers:

  • Healthcare customers flagged in tenant config: phi_customer: true
  • PHI customers: dedicated namespace in Kubernetes; no shared pods with non-PHI tenants
  • Inference: PHI customers use self-hosted vLLM or Azure OpenAI (BAA in place)
  • Vector store: dedicated pgvector database instance (not just schema) for PHI isolation

Tenant ID enforcement in all queries:

  • Every vector store query includes AND tenant<em>id = '{tenant</em>id}' filter
  • Every document store access uses tenant-prefixed S3 path
  • AI gateway validates JWT tenant_id claim matches the resource being accessed
  • Rate limiting keyed by tenant<em>id (not just userid)

Knowledge base isolation:

  • Embedding computed at upload time and stored in tenant's isolated vector store
  • No cross-tenant retrieval possible β€” query always filtered by tenant_id
  • Document ingestion pipeline: tenant_id stamped on every chunk at creation

Problem 9 β€” Design a Fine-Tuning Pipeline for a Clinical LLM

Scenario: A hospital wants to fine-tune a general-purpose LLM on clinical documentation from their EHR to improve clinical note generation quality. Design the end-to-end fine-tuning pipeline from data preparation through deployment.

Scale parameters:

  • 500,000 clinical notes (de-identified) available for training
  • Target: improve clinical note format adherence and terminology accuracy
  • Deployment: self-hosted (data residency; PHI-adjacent training data)
  • Re-training: quarterly or on significant dataset expansion

Solution Framework:

Fine-tuning pipeline stages:

text
Stage 1: Data Preparation
  Raw clinical notes (EHR export)
  β†’ PHI De-identification (BERT-based NER model, not just regex)
  β†’ Quality filtering (remove too-short, incomplete, or obviously erroneous notes)
  β†’ Format conversion: {prompt: "[SOAP note request based on encounter context]", completion: "[clinical note]"}
  β†’ Dataset split: 90% train / 5% validation / 5% test (held-out from training)
  β†’ Deduplication: remove near-duplicate notes (embedding similarity > 0.97)
  
Stage 2: Fine-Tuning
  Base model: llama-3-8B-Instruct or similar (proprietary base = licensing complexity)
  Method: LoRA (Low-Rank Adaptation) β€” efficient; does not modify base weights
  Framework: Hugging Face TRL (SFTTrainer) + Accelerate
  Hardware: 4Γ— A100 80GB (tensor parallel for 8B model)
  Privacy: DP-SGD (Opacus) β€” epsilon ≀ 8 for training data protection
  Duration: ~10 hours per training run
  
Stage 3: Evaluation
  Automated: ROUGE-L vs. physician-written reference notes; BERTScore; ICD code accuracy
  Human: 50 blind comparisons (base model vs. fine-tuned) by clinical informatics physicians
  Red-lining: test for memorization (training data extraction)
  Gate: must pass memorization audit before promotion to staging
  
Stage 4: Deployment
  Model Registry: MLflow (model_id, version, training_date, phi_status, eval_scores, approvals)
  Deployment: vLLM serving (INT8 quantization to fit on 2Γ— A100 vs. 4Γ— for FP16)
  Traffic: shadow mode 1 week (log outputs, don't serve to users) β†’ 10% canary β†’ 100%
  Rollback: previous model version remains in registry; re-route vLLM to previous version

PHI handling in fine-tuning:

  • Training data is PHI-adjacent β€” even "de-identified" data may re-identify via rare combinations
  • Expert determination de-identification (not just Safe Harbor) + DP-SGD training
  • Fine-tuned model classified as "restricted" in model registry: API-only, no weight export, memorization audit required
  • IRB approval may be required for use of clinical notes in model training (consult legal)

Problem 10 β€” Design an AI-Powered Drug Interaction Check System

Scenario: A hospital pharmacy wants to deploy AI to detect potentially dangerous drug interactions at the point of medication order entry. The system must intercept medication orders, check for interactions, and alert the ordering physician before the order is confirmed.

Scale parameters:

  • 5,000 medication orders per day
  • Peak: 500 orders in a 2-hour window (morning orders)
  • Alert latency: must respond within 3 seconds (integrated into EHR order workflow)
  • Accuracy requirement: zero false negatives for Class A (life-threatening) interactions

Solution Framework:

Architecture β€” hybrid rule-based + AI:

Key insight: This is NOT a pure AI problem. Drug interaction checking has a gold standard: FDA-approved drug interaction databases (DrugBank, First Databank, Micromedex). The AI layer adds context-awareness (patient-specific risk assessment) on top of the authoritative database check.

text
Medication Order Entry (EHR):
  ↓
CDS Hooks: order-select hook fires
  ↓
Drug Interaction Service (must respond in ≀3s):
  
  Step 1: Deterministic Rule Check (≀100ms)
    Query drug interaction database (FDA-approved, licensed data, updated daily)
    Return all Class A (life-threatening) and Class B (serious) interactions
    β†’ If Class A found: immediately return CDS card with STOP alert (no AI needed)
    
  Step 2: AI Contextual Assessment (≀2.5s, only if no Class A interaction)
    Retrieve: active medications, allergies, diagnoses, labs (FHIR, parallel)
    AI prompt (Claude Sonnet): 
      "Given this patient's clinical context [context], the following potential interactions
      were identified: [Class B/C interactions]. Assess clinical significance given this
      specific patient profile and recommend action."
    β†’ Return CDS card with AI-generated contextual guidance and citations

Why NOT use pure AI for Class A interactions:

  • AI can hallucinate or miss interactions
  • Class A interactions (e.g., MAOIs + serotonergic agents) require zero false negatives
  • Rule-based database is the authoritative source; AI adds value for nuanced Class B/C assessment

Latency budget for 3-second SLA:

  • DB query: ≀100ms
  • Parallel FHIR reads (6 resources): ≀500ms with asyncio.gather
  • AI inference (Sonnet, streaming): ≀2s (with 3s total budget)
  • Response assembly + delivery: ≀100ms
  • Total: ≀2.7s β†’ 0.3s buffer

Idempotency:

  • CDS Hooks may fire multiple times for the same order (clinician modifies order details)
  • Cache interaction check results for (patientid, medicationshash) with 5-minute TTL

Problem 11 β€” Design a Population Health Risk Scoring System

Scenario: A health system wants to identify patients at risk of 30-day hospital readmission to allow proactive outreach. Score all discharged patients nightly, prioritize for care coordination follow-up.

Scale parameters:

  • 300 discharges per day
  • Score must be available within 2 hours of discharge
  • Historical dataset: 500K encounters for model training
  • Integration: care coordination team reviews top-risk patients in care management platform

Solution Framework:

text
Daily Discharge Trigger:
  EHR β†’ Kafka (discharge-events topic) β†’ Discharge Risk Worker
  
Per-Discharge Scoring:
  1. Retrieve structured features (FHIR: Patient, Conditions, Medications, Labs, Procedures)
  2. Feature engineering: age, comorbidity index, LOS, prior admissions, discharge disposition
  3. Risk model inference: ML model (XGBoost or gradient boosting preferred over LLM for tabular data)
  4. High-risk patients (score >0.7): LLM-generated narrative explanation
     "Patient is high-risk due to: [top 3 factors]. Recommended interventions: [care plan]"
  5. Write to care management platform: patient_id, risk_score, risk_tier, key_factors, ai_narrative, model_version

Model approach:
  - Structured tabular features β†’ gradient boosting (AUROC target: >0.80)
  - LLM used for EXPLANATION of predictions, not for prediction itself
  - Why: LLMs are not reliable for structured tabular risk scoring; gradient boosting is superior
  
Batch pipeline (Airflow DAG, nightly):
  - Extract 24h discharges from Snowflake
  - Batch score all patients (gradient boosting model, fast)
  - For top 30% by score: generate LLM narrative explanation
  - Write scores to care management platform
  - Quality check: score distribution within expected range; alert if distribution shifts

Model update cadence:

  • Quarterly retraining on rolling 12-month discharge cohort
  • A/B test new model in shadow mode before replacing production model
  • Track: AUROC on holdout set; 30-day readmission rate correlation with risk scores

Problems 12–20 (Condensed Frameworks)

These additional problems use a condensed format β€” the key insight and critical design decision for each.

Problem 12 β€” AI Evaluation Framework for Clinical AI

Key insight: Clinical AI evaluation requires three distinct evaluation types: (1) technical quality (MRR, ROUGE, BERTScore against reference outputs), (2) clinical accuracy (physician expert review of random sample), and (3) safety evaluation (demographic bias across patient subgroups, failure mode analysis). Most teams implement only type 1 and miss the most important clinical safety issues.

Critical decision: Evaluate across demographic subgroups (age, race, insurance status, diagnosis code frequency) because AI models frequently underperform for underrepresented populations.


Problem 13 β€” Semantic Caching Layer for High-Volume RAG

Key insight: At 100K queries/day, semantic caching reduces LLM costs by 40–60% for enterprise knowledge bases where many users ask semantically similar questions. The critical design decision is threshold selection β€” too low produces wrong cached responses; too high misses savings.

Critical decision: For clinical use, threshold β‰₯ 0.95. Invalidate cache on knowledge base update (event-driven, not TTL-only). Never cache patient-specific queries.


Problem 14 β€” AI-Powered ICD-10 Coding from Clinical Notes

Key insight: ICD-10 coding from clinical notes is a classification task, not a generation task. The AI reads the clinical note and must assign up to 25 ICD-10 codes per encounter. Hybrid approach: AI generates candidate codes, medical coder reviews and finalizes.

Critical decision: Coder-in-the-loop is required because AI coding accuracy (even at 95%) is not sufficient for claims submission without human review. The AI reduces coder time from 20 minutes to 5 minutes per encounter.


Problem 15 β€” Enterprise LLM Inference at Scale (Self-Hosted)

Key insight: Self-hosted inference at scale requires understanding the GPU capacity model: model weights + KV cache + overhead. The most common capacity mistake is planning for model weights only.

Critical decision: vLLM with continuous batching and prefix caching; tensor parallelism (NVLink) for models >7B; monitor vllm:gpu<em>cache</em>usage_perc as the primary capacity metric (not GPU utilization).


Problem 16 β€” AI Agent for Insurance Prior Authorization (End-to-End Automation)

Key insight: Full prior authorization automation (no human) requires AI accuracy high enough that the error rate is acceptable to the clinical and compliance teams. For most organizations, human-in-loop (AI draft + clinical staff approval) is the right first deployment. Full automation as V2 after 12 months of quality data.

Critical decision: Physician-in-loop for approval; never auto-submit to payer without clinical review. Track PA denial rate before/after AI to measure quality.


Problem 17 β€” Multi-Modal Clinical AI (Radiology)

Key insight: Radiology AI (chest X-ray reading) is FDA-regulated as a medical device (SaMD). Deploying radiology AI without FDA clearance is a regulatory violation.

Critical decision: Confirm FDA clearance status before deployment. Cleared radiology AI (510(k)) should be treated as a decision support tool β€” radiologist reviews AI findings alongside images, does not rely on AI alone.


Problem 18 β€” AI for Patient-Facing Healthcare (Patient Portal)

Key insight: Patient-facing healthcare AI has additional constraints beyond clinical AI: health literacy variation (AI must communicate at 6th–8th grade level), language access requirements (Title VI for healthcare organizations serving LEP populations), and escalation protocols (patient distress detection and escalation to human staff).

Critical decision: AI must be configured to escalate to human support for: suicidal ideation, expressions of distress, medical emergency description, requests for urgent clinical advice. Never rely on AI alone for patient safety escalation detection.


Problem 19 β€” Enterprise AI Cost Management

Key insight: At $100K+/month in AI API spend, token cost optimization is a priority. The largest cost levers: (1) semantic caching (eliminate duplicate LLM calls), (2) model tier routing (use Efficient for simple tasks, Frontier only when needed), (3) prompt compression (reduce context window), (4) output streaming (detect early stop, avoid max_tokens waste).

Critical decision: Instrument every LLM call with token counts before implementing optimizations. Cost optimization without measurement is guesswork.


Problem 20 β€” AI Infrastructure Disaster Recovery

Key insight: AI-specific disaster recovery requires: (1) vector store replica with point-in-time restore (knowledge base can be rebuilt but takes hours), (2) model weights backup in S3 with versioning, (3) failover to managed API if self-hosted inference fails. The RTO for clinical AI must be aligned with the clinical workflow dependency β€” CDS Hooks degradation must be graceful (return empty cards, not 500 errors).

Critical decision: Define RTO and RPO per clinical AI feature before disaster recovery design. A CDS feature with 5-second timeout and empty-card fallback has effective RTO of 0 from the EHR's perspective. A discharge summary service with a 2-hour SLA can tolerate longer recovery.


Quick Reference: Common System Design Failure Modes

When reviewing any AI system design, check for these anti-patterns:

Anti-pattern Consequence Fix
Synchronous LLM call in critical path, no timeout One slow LLM call blocks entire user flow Timeout + fallback response
No semantic caching 10x higher LLM cost than necessary Semantic cache with appropriate threshold
No circuit breaker on LLM API LLM provider outage = product outage Circuit breaker + secondary provider
No model version pinning Upgrade breaks prompt format Pin model version; test upgrades before rollout
No evaluation pipeline Quality regressions go undetected Golden query evaluation on every deployment
PHI in LLM prompt without BAA HIPAA violation at deployment Confirm BAA or use self-hosted inference
Over-broad FHIR scopes HIPAA minimum necessary violation Request only specific resource types needed
Logging request bodies for PHI AI Logging system becomes PHI store Log metadata only; never request bodies
No fallback for AI agent tool failure Agent crashes on first tool error Retry with backoff; graceful degradation
Fixed-size chunking for clinical guidelines Recommendations split across chunks Section-boundary chunking

Further Reading