Multimodal AI

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate


Executive Summary

Multimodal AI extends LLMs beyond text to process images, documents, audio, and video — capabilities that unlock significant value in healthcare, where a large fraction of clinically relevant information exists as images (radiology, pathology, dermatology), scanned documents (medical records, consent forms), and structured PDFs (lab reports, discharge summaries). Understanding multimodal capabilities, their limitations, and their enterprise architectural implications is increasingly essential for AI architects working in document-heavy and image-rich domains.


Trade-offs and Considerations

Multimodal Failure Modes

Failure Mode Description Mitigation
OCR errors on handwriting Handwritten clinical text misread Flag [?] markers; require human review for critical values
Image resolution degradation Low-resolution scans produce hallucinated text Specify minimum scan DPI (300 DPI minimum for clinical docs)
PHI in image metadata DICOM/EXIF metadata contains PHI Strip metadata before sending to any API
Context token consumption Large images consume 1,500–2,000 tokens each Limit to 1–3 images per call; resize if possible
Missing clinical context Image interpreted without clinical context Always include relevant patient/clinical context in the prompt
Confidence calibration Model may express confidence in hallucinated OCR Always validate extracted structured data against known formats

Cost of Vision vs. Text

Vision inputs cost more than equivalent text token counts because image encoding requires additional computation. A 1,024×1,024 image at ~1,568 tokens is billed at text input token rates, but the model processes it more expensively internally. For high-volume document processing (thousands of lab reports per day), evaluate whether OCR pre-processing (Tesseract, AWS Textract, Azure Document Intelligence) followed by text-only LLM processing is more cost-effective than direct vision processing.


Interview Questions

Q1: A hospital wants to use AI to process scanned physician referral letters. What architecture would you recommend?

Category: Architecture / System Design Difficulty: Senior Role: AI Architect / FDE

Answer Framework:

The architecture depends on two key properties of these referral letters: their format variability (typed vs. handwritten, structured vs. unstructured) and their volume (how many per day). As an illustrative example, assume a mix of typed and handwritten letters at moderate daily volume.

For typed, structured referrals: a traditional OCR pipeline (AWS Textract for structured form extraction) followed by a text LLM for structured data extraction is more cost-effective and reliable than direct vision LLM processing.

For handwritten and unstructured referrals: vision LLMs (Claude vision) significantly outperform traditional OCR because they understand clinical context, not just characters. A handwritten "cc: f/u 2wk" is meaningless to Tesseract; Claude correctly extracts "carbon copy, follow-up in 2 weeks."

Recommended architecture: an intelligent router that detects whether an incoming referral is typed/structured (→ Textract + text LLM) or handwritten/unstructured (→ vision LLM directly). This hybrid approach minimizes cost while maintaining coverage.

In all cases: PHI handling is the primary architectural constraint. These documents contain patient names, DOBs, and MRNs — all PHI. The API call must be made within the hospital's AWS or Azure tenant (Bedrock or Azure Document Intelligence) where a HIPAA BAA is in place, and image content must not be logged persistently without de-identification.

Output validation: extracted data is validated against known formats (dates, provider NPI numbers, ICD codes) before ingestion into Epic. Extraction confidence scores below 0.85 route to human review queue rather than automatic ingestion.


Further Reading

In This Repository: