Multimodal AI
Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate
Trade-offs and Considerations
Multimodal Failure Modes
| Failure Mode | Description | Mitigation |
|---|---|---|
| OCR errors on handwriting | Handwritten clinical text misread | Flag [?] markers; require human review for critical values |
| Image resolution degradation | Low-resolution scans produce hallucinated text | Specify minimum scan DPI (300 DPI minimum for clinical docs) |
| PHI in image metadata | DICOM/EXIF metadata contains PHI | Strip metadata before sending to any API |
| Context token consumption | Large images consume 1,500–2,000 tokens each | Limit to 1–3 images per call; resize if possible |
| Missing clinical context | Image interpreted without clinical context | Always include relevant patient/clinical context in the prompt |
| Confidence calibration | Model may express confidence in hallucinated OCR | Always validate extracted structured data against known formats |
Cost of Vision vs. Text
Vision inputs cost more than equivalent text token counts because image encoding requires additional computation. A 1,024×1,024 image at ~1,568 tokens is billed at text input token rates, but the model processes it more expensively internally. For high-volume document processing (thousands of lab reports per day), evaluate whether OCR pre-processing (Tesseract, AWS Textract, Azure Document Intelligence) followed by text-only LLM processing is more cost-effective than direct vision processing.
Comparison Table
| Approach | Document Types | Accuracy | Cost | Latency | PHI Risk |
|---|---|---|---|---|---|
| Vision LLM (Claude) | All image formats | High for printed, Medium for handwritten | Medium | 2–5s | API sends image — BAA required |
| OCR + text LLM | Printed documents | Very High for printed | Low (separate OCR step) | 3–8s | Two API calls |
| AWS Textract | Forms, tables, PDFs | Very High for structured | Medium | 1–3s | AWS HIPAA eligible |
| Azure Document Intelligence | Complex forms, custom models | Very High | Medium | 2–4s | Azure HIPAA eligible |
| Vision LLM (direct) | Handwriting | Medium | Medium | 2–5s | BAA required |
Interview Questions
Q1: A hospital wants to use AI to process scanned physician referral letters. What architecture would you recommend?
Category: Architecture / System Design Difficulty: Senior Role: AI Architect / FDE
Answer Framework:
The architecture depends on two key properties of these referral letters: their format variability (typed vs. handwritten, structured vs. unstructured) and their volume (how many per day). As an illustrative example, assume a mix of typed and handwritten letters at moderate daily volume.
For typed, structured referrals: a traditional OCR pipeline (AWS Textract for structured form extraction) followed by a text LLM for structured data extraction is more cost-effective and reliable than direct vision LLM processing.
For handwritten and unstructured referrals: vision LLMs (Claude vision) significantly outperform traditional OCR because they understand clinical context, not just characters. A handwritten "cc: f/u 2wk" is meaningless to Tesseract; Claude correctly extracts "carbon copy, follow-up in 2 weeks."
Recommended architecture: an intelligent router that detects whether an incoming referral is typed/structured (→ Textract + text LLM) or handwritten/unstructured (→ vision LLM directly). This hybrid approach minimizes cost while maintaining coverage.
In all cases: PHI handling is the primary architectural constraint. These documents contain patient names, DOBs, and MRNs — all PHI. The API call must be made within the hospital's AWS or Azure tenant (Bedrock or Azure Document Intelligence) where a HIPAA BAA is in place, and image content must not be logged persistently without de-identification.
Output validation: extracted data is validated against known formats (dates, provider NPI numbers, ICD codes) before ingestion into Epic. Extraction confidence scores below 0.85 route to human review queue rather than automatic ingestion.