Multimodal AI

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate

Trade-offs and Considerations

Multimodal Failure Modes

Failure Mode	Description	Mitigation
OCR errors on handwriting	Handwritten clinical text misread	Flag [?] markers; require human review for critical values
Image resolution degradation	Low-resolution scans produce hallucinated text	Specify minimum scan DPI (300 DPI minimum for clinical docs)
PHI in image metadata	DICOM/EXIF metadata contains PHI	Strip metadata before sending to any API
Context token consumption	Large images consume 1,500–2,000 tokens each	Limit to 1–3 images per call; resize if possible
Missing clinical context	Image interpreted without clinical context	Always include relevant patient/clinical context in the prompt
Confidence calibration	Model may express confidence in hallucinated OCR	Always validate extracted structured data against known formats

Cost of Vision vs. Text

Vision inputs cost more than equivalent text token counts because image encoding requires additional computation. A 1,024×1,024 image at ~1,568 tokens is billed at text input token rates, but the model processes it more expensively internally. For high-volume document processing (thousands of lab reports per day), evaluate whether OCR pre-processing (Tesseract, AWS Textract, Azure Document Intelligence) followed by text-only LLM processing is more cost-effective than direct vision processing.

Comparison Table

Approach	Document Types	Accuracy	Cost	Latency	PHI Risk
Vision LLM (Claude)	All image formats	High for printed, Medium for handwritten	Medium	2–5s	API sends image — BAA required
OCR + text LLM	Printed documents	Very High for printed	Low (separate OCR step)	3–8s	Two API calls
AWS Textract	Forms, tables, PDFs	Very High for structured	Medium	1–3s	AWS HIPAA eligible
Azure Document Intelligence	Complex forms, custom models	Very High	Medium	2–4s	Azure HIPAA eligible
Vision LLM (direct)	Handwriting	Medium	Medium	2–5s	BAA required

Interview Questions

Category: Architecture / System Design Difficulty: Senior Role: AI Architect / FDE

Answer Framework:

The architecture depends on two key properties of these referral letters: their format variability (typed vs. handwritten, structured vs. unstructured) and their volume (how many per day). As an illustrative example, assume a mix of typed and handwritten letters at moderate daily volume.

For typed, structured referrals: a traditional OCR pipeline (AWS Textract for structured form extraction) followed by a text LLM for structured data extraction is more cost-effective and reliable than direct vision LLM processing.

For handwritten and unstructured referrals: vision LLMs (Claude vision) significantly outperform traditional OCR because they understand clinical context, not just characters. A handwritten "cc: f/u 2wk" is meaningless to Tesseract; Claude correctly extracts "carbon copy, follow-up in 2 weeks."

Recommended architecture: an intelligent router that detects whether an incoming referral is typed/structured (→ Textract + text LLM) or handwritten/unstructured (→ vision LLM directly). This hybrid approach minimizes cost while maintaining coverage.

In all cases: PHI handling is the primary architectural constraint. These documents contain patient names, DOBs, and MRNs — all PHI. The API call must be made within the hospital's AWS or Azure tenant (Bedrock or Azure Document Intelligence) where a HIPAA BAA is in place, and image content must not be logged persistently without de-identification.

Output validation: extracted data is validated against known formats (dates, provider NPI numbers, ICD codes) before ingestion into Epic. Extraction confidence scores below 0.85 route to human review queue rather than automatic ingestion.

Multimodal AI#

Trade-offs and Considerations#

Multimodal Failure Modes#

Cost of Vision vs. Text#

Comparison Table#

Interview Questions#

Q1: A hospital wants to use AI to process scanned physician referral letters. What architecture would you recommend?#