Multimodal AI
Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate
Executive Summary
Multimodal AI extends LLMs beyond text to process images, documents, audio, and video — capabilities that unlock significant value in healthcare, where a large fraction of clinically relevant information exists as images (radiology, pathology, dermatology), scanned documents (medical records, consent forms), and structured PDFs (lab reports, discharge summaries). Understanding multimodal capabilities, their limitations, and their enterprise architectural implications is increasingly essential for AI architects working in document-heavy and image-rich domains.
Prerequisites
- LLM Fundamentals — Baseline model understanding
- Prompt Engineering — Prompting patterns apply to multimodal inputs
Core Concepts
Why Multimodal Matters for Enterprise AI
Enterprise data is not primarily text. A hospital system's information landscape:
- Clinical notes and documentation: ~30% (text)
- Radiology images (CT, MRI, X-ray, ultrasound): ~25%
- Pathology slides: ~15%
- Scanned paper documents (consent forms, referral letters): ~15%
- Structured PDFs (lab reports, prior auth forms): ~10%
- Other (photos, procedural recordings): ~5%
A purely text-based AI system can only address 30% of the clinical information landscape. Multimodal AI extends coverage to scanned documents, structured PDFs, lab reports, and — when combined with specialized medical imaging AI — radiology and pathology content.
The Three Categories of Multimodal in Enterprise
Category 1: Document Understanding Processing PDFs, scanned documents, and structured forms — the most immediately deployable multimodal capability in enterprise settings. Use cases: extracting structured data from insurance EOB documents, processing scanned referral letters, parsing lab report PDFs, reading prior authorization forms.
Category 2: Medical Image Analysis Analyzing clinical images (chest X-rays, CT scans, pathology slides, dermatology photos). This category has specific FDA regulatory implications — AI that influences diagnostic decisions for medical images may be regulated as Software as a Medical Device (SaMD).
Category 3: Document + Context Fusion Combining visual document understanding with retrieved textual knowledge — e.g., processing a patient's radiology image alongside retrieved clinical guidelines to provide contextually grounded interpretation. This is the most architecturally complex and highest-value multimodal application.
Architecture
Multimodal Input Architecture
How Vision LLMs Process Images
Vision LLMs process images by encoding them into a sequence of visual tokens using a vision encoder (typically a ViT — Vision Transformer). These visual tokens are concatenated with the text tokens in the model's context and processed jointly through the transformer's attention layers.
Key implication: images consume context window tokens. A 1024×1024 image encoded at standard resolution typically consumes 1,568–2,048 tokens (Claude's token count varies by resolution). For a 200K context window, you can include approximately 90 high-resolution images — but doing so would consume the entire context with no room for retrieved knowledge or conversation history. In practice, multimodal clinical applications should limit image inclusion to 1–3 images per call and use targeted retrieval for any additional context.
Implementation Patterns
Pattern 1: Processing a Clinical Lab Report PDF
"""
Extract structured data from a clinical lab report PDF.
Context: HMS scenario — processing inbound lab results from an external
reference lab that sends PDF reports rather than HL7/FHIR interfaces.
The extracted structured data is ingested into the Epic EHR as manual results.
Dependencies:
anthropic>=0.49.0
pillow>=10.2.0
python-dotenv>=1.0.0
pdf2image>=1.16.0 (requires poppler)
"""
import anthropic
import base64
import json
from pathlib import Path
from typing import Any
def pdf_page_to_base64(pdf_path: Path, page_number: int = 0) -> str:
"""Convert a PDF page to base64-encoded PNG for LLM processing."""
try:
from pdf2image import convert_from_path
except ImportError:
raise ImportError("Install pdf2image: pip install pdf2image (requires poppler)")
images = convert_from_path(str(pdf_path), first_page=page_number + 1, last_page=page_number + 1)
if not images:
raise ValueError(f"No page {page_number} in {pdf_path}")
import io
buffer = io.BytesIO()
images[0].save(buffer, format="PNG")
return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
def extract_lab_results(
pdf_path: Path,
client: anthropic.Anthropic,
) -> dict[str, Any]:
"""
Extract structured lab results from a PDF lab report.
Returns structured data suitable for EHR ingestion.
Note: This function processes PHI (lab results are PHI).
Ensure the Anthropic API call is made within the HIPAA BAA boundary.
PHI is NOT logged. Results are not cached.
"""
image_data = pdf_page_to_base64(pdf_path, page_number=0)
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=2048,
system=(
"Extract laboratory results from medical lab report images. "
"Return ONLY valid JSON with this structure:\n"
"{\n"
' "patient_name": "string",\n'
' "dob": "YYYY-MM-DD",\n'
' "collection_date": "YYYY-MM-DD",\n'
' "ordering_provider": "string",\n'
' "lab_results": [\n'
' {"test_name": "...", "result": "...", "unit": "...", '
'"reference_range": "...", "flag": "H|L|C|null"}\n'
" ],\n"
' "critical_values": ["list of tests with critical flags"]\n'
"}\n"
"Flag meanings: H=High, L=Low, C=Critical. "
"If a field is not visible, use null."
),
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": "Extract all laboratory results from this lab report.",
},
],
}],
)
raw_output = response.content[0].text.strip()
return json.loads(raw_output)Pattern 2: Clinical Document OCR and Summarization
"""
Process a scanned clinical document (e.g., a handwritten referral letter).
Context: HMS receives scanned referral letters from community physicians.
This pipeline converts the scan to structured text and extracts key clinical data.
"""
import anthropic
import base64
from pathlib import Path
from dataclasses import dataclass
@dataclass
class ReferralExtract:
referring_provider: str | None
referring_facility: str | None
reason_for_referral: str | None
relevant_history: str | None
current_medications: list[str]
urgency: str # "routine" | "urgent" | "emergent"
requested_specialty: str | None
patient_summary: str
def process_scanned_referral(
image_path: Path,
client: anthropic.Anthropic,
) -> ReferralExtract:
"""
Extract structured data from a scanned physician referral letter.
Handles handwritten and typed formats.
"""
image_bytes = image_path.read_bytes()
image_data = base64.standard_b64encode(image_bytes).decode("utf-8")
# Determine media type from extension
suffix = image_path.suffix.lower()
media_type_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png"}
media_type = media_type_map.get(suffix, "image/jpeg")
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
system=(
"You are processing a scanned physician referral letter. "
"Extract all clinically relevant information. "
"If handwriting is unclear, indicate uncertainty with [?]. "
"Return JSON matching the ReferralExtract schema: "
'{"referring_provider": "...", "referring_facility": "...", '
'"reason_for_referral": "...", "relevant_history": "...", '
'"current_medications": ["..."], '
'"urgency": "routine|urgent|emergent", '
'"requested_specialty": "...", '
'"patient_summary": "2-3 sentence clinical summary"}'
),
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": media_type, "data": image_data},
},
{"type": "text", "text": "Extract the referral information from this document."},
],
}],
)
import json
data = json.loads(response.content[0].text.strip())
return ReferralExtract(**data)Pattern 3: Multimodal RAG — Image + Retrieved Text
"""
Multimodal RAG: combine a clinical image with retrieved clinical guidelines.
Context: HMS radiology workflow — a chest X-ray is flagged by radiology AI
as potentially showing pneumonia. The clinical AI combines the image with
retrieved clinical guidelines to provide the ordering physician with
context-aware clinical decision support.
NOTE: This is NOT a diagnostic AI. It is a clinical knowledge support tool.
The image interpretation provided to the LLM is from a separate, cleared
radiology AI. The LLM provides clinical knowledge context, not diagnosis.
"""
import anthropic
import base64
from pathlib import Path
def clinical_image_support(
image_path: Path,
radiology_ai_findings: str, # Output from FDA-cleared radiology AI
clinical_question: str,
retrieved_guidelines: list[str], # From RAG pipeline
client: anthropic.Anthropic,
) -> str:
"""
Combine radiology AI findings, the CXR image, and retrieved guidelines
to provide clinical knowledge context to the ordering physician.
IMPORTANT: This tool does NOT provide a diagnosis. It provides:
- Context from clinical guidelines relevant to the AI findings
- Suggested clinical questions to consider
- Relevant clinical decision support information
All clinical decisions remain with the ordering physician.
"""
image_data = base64.standard_b64encode(image_path.read_bytes()).decode("utf-8")
guidelines_context = "\n\n---\n\n".join(
f"Guideline {i+1}:\n{g}" for i, g in enumerate(retrieved_guidelines)
)
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
system=(
"You are a clinical knowledge assistant. You provide relevant clinical "
"guideline information to support physician clinical decision-making. "
"You do NOT diagnose. You do NOT interpret radiological images. "
"A separate FDA-cleared radiology AI has provided preliminary findings. "
"Your role is to provide relevant clinical guideline context based on "
"those findings. Always defer to the ordering physician's clinical judgment. "
"Always include: 'Clinical decision-making remains with the ordering physician.'"
),
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": (
f"Radiology AI preliminary findings: {radiology_ai_findings}\n\n"
f"Clinical question: {clinical_question}\n\n"
f"Retrieved Clinical Guidelines:\n\n{guidelines_context}\n\n"
"Please provide relevant clinical guideline context based on "
"the radiology AI findings. Do not interpret the image directly."
),
},
],
}],
)
return response.content[0].textHealthcare Context — HMS Scenario
HMS Multimodal Use Cases and Regulatory Status
| Use Case | Modality | FDA Regulatory Status | HMS Implementation Approach |
|---|---|---|---|
| Lab report PDF extraction | Documents | Not regulated (data processing) | Direct implementation via pattern 1 |
| Scanned referral processing | Documents | Not regulated | Direct implementation via pattern 2 |
| Radiology AI (chest X-ray) | Medical imaging | SaMD — requires 510(k) clearance | Purchase cleared third-party AI (Nuance PowerScribe AI, Aidoc) |
| Pathology slide analysis | Medical imaging | SaMD — requires clearance | Purchase cleared third-party AI |
| Dermatology photo review | Medical imaging | SaMD — potentially regulated | Pilot with cleared partner only |
| Clinical documentation photos | Documents | Not regulated | Direct implementation |
Critical architectural principle: Do not build your own medical image diagnostic AI. It is an FDA-regulated medical device. Use FDA-cleared commercial products for radiology, pathology, and dermatology AI. Use Claude and other frontier LLMs for document processing, text extraction, and knowledge support — these are not classified as SaMD.
DICOM to LLM Pipeline
Medical images are stored in DICOM format. Sending DICOM images to an LLM requires conversion:
def dicom_to_png_for_llm(
dicom_path: Path,
window_center: int = 40, # Soft tissue window for CXR
window_width: int = 400,
) -> bytes:
"""
Convert DICOM image to PNG for LLM processing.
Applies windowing for appropriate tissue visualization.
IMPORTANT: Strip DICOM metadata tags containing PHI before
sending to any external API. DICOM metadata includes patient
name, DOB, MRN, and other identifiers.
"""
import pydicom
import numpy as np
from PIL import Image
import io
ds = pydicom.dcmread(str(dicom_path))
pixel_array = ds.pixel_array.astype(float)
# Apply HU windowing
min_val = window_center - window_width / 2
max_val = window_center + window_width / 2
pixel_array = np.clip(pixel_array, min_val, max_val)
pixel_array = ((pixel_array - min_val) / window_width * 255).astype(np.uint8)
image = Image.fromarray(pixel_array)
if image.mode != "RGB":
image = image.convert("RGB")
buffer = io.BytesIO()
image.save(buffer, format="PNG")
return buffer.getvalue()Trade-offs and Considerations
Multimodal Failure Modes
| Failure Mode | Description | Mitigation |
|---|---|---|
| OCR errors on handwriting | Handwritten clinical text misread | Flag [?] markers; require human review for critical values |
| Image resolution degradation | Low-resolution scans produce hallucinated text | Specify minimum scan DPI (300 DPI minimum for clinical docs) |
| PHI in image metadata | DICOM/EXIF metadata contains PHI | Strip metadata before sending to any API |
| Context token consumption | Large images consume 1,500–2,000 tokens each | Limit to 1–3 images per call; resize if possible |
| Missing clinical context | Image interpreted without clinical context | Always include relevant patient/clinical context in the prompt |
| Confidence calibration | Model may express confidence in hallucinated OCR | Always validate extracted structured data against known formats |
Cost of Vision vs. Text
Vision inputs cost more than equivalent text token counts because image encoding requires additional computation. A 1,024×1,024 image at ~1,568 tokens is billed at text input token rates, but the model processes it more expensively internally. For high-volume document processing (thousands of lab reports per day), evaluate whether OCR pre-processing (Tesseract, AWS Textract, Azure Document Intelligence) followed by text-only LLM processing is more cost-effective than direct vision processing.
Comparison Table
| Approach | Document Types | Accuracy | Cost | Latency | PHI Risk |
|---|---|---|---|---|---|
| Vision LLM (Claude) | All image formats | High for printed, Medium for handwritten | Medium | 2–5s | API sends image — BAA required |
| OCR + text LLM | Printed documents | Very High for printed | Low (separate OCR step) | 3–8s | Two API calls |
| AWS Textract | Forms, tables, PDFs | Very High for structured | Medium | 1–3s | AWS HIPAA eligible |
| Azure Document Intelligence | Complex forms, custom models | Very High | Medium | 2–4s | Azure HIPAA eligible |
| Vision LLM (direct) | Handwriting | Medium | Medium | 2–5s | BAA required |
Interview Questions
Q1: A hospital wants to use AI to process scanned physician referral letters. What architecture would you recommend?
Category: Architecture / System Design Difficulty: Senior Role: AI Architect / FDE
Answer Framework:
The architecture depends on two key properties of these referral letters: their format variability (typed vs. handwritten, structured vs. unstructured) and their volume (how many per day). As an illustrative example, assume a mix of typed and handwritten letters at moderate daily volume.
For typed, structured referrals: a traditional OCR pipeline (AWS Textract for structured form extraction) followed by a text LLM for structured data extraction is more cost-effective and reliable than direct vision LLM processing.
For handwritten and unstructured referrals: vision LLMs (Claude vision) significantly outperform traditional OCR because they understand clinical context, not just characters. A handwritten "cc: f/u 2wk" is meaningless to Tesseract; Claude correctly extracts "carbon copy, follow-up in 2 weeks."
Recommended architecture: an intelligent router that detects whether an incoming referral is typed/structured (→ Textract + text LLM) or handwritten/unstructured (→ vision LLM directly). This hybrid approach minimizes cost while maintaining coverage.
In all cases: PHI handling is the primary architectural constraint. These documents contain patient names, DOBs, and MRNs — all PHI. The API call must be made within the hospital's AWS or Azure tenant (Bedrock or Azure Document Intelligence) where a HIPAA BAA is in place, and image content must not be logged persistently without de-identification.
Output validation: extracted data is validated against known formats (dates, provider NPI numbers, ICD codes) before ingestion into Epic. Extraction confidence scores below 0.85 route to human review queue rather than automatic ingestion.
Further Reading
In This Repository:
- Healthcare AI: Clinical Documentation AI — [PLANNED] Multimodal patterns in clinical documentation
- Security: Data Privacy — [PLANNED] PHI in image data
- examples/healthcare/ — Working healthcare AI examples including document processing
Summary
Key Takeaways:
- Enterprise and clinical data is heavily image and document-based — multimodal AI extends coverage to the 70% of information that is not plain text
- Document understanding (lab reports, referral letters, consent forms) is the highest-ROI multimodal use case in the near term — it is implementable today without regulatory barriers
- Medical image diagnostic AI (radiology, pathology) is FDA-regulated — always use cleared commercial products, not LLMs, for diagnostic AI
- Images consume substantial context window tokens (1,500–2,000 per high-res image) — budget them carefully
- PHI in image metadata (DICOM headers, EXIF data) must be stripped before sending to any external API
- For high-volume document processing, evaluate OCR pre-processing + text LLM against direct vision LLM — the hybrid approach often wins on cost
The one thing to remember: Multimodal extends AI to where clinical data actually lives — not just in typed notes, but in scanned forms, lab PDFs, and clinical images. The regulatory boundary is the key architectural constraint: document processing is unrestricted; diagnostic image analysis is SaMD and must use FDA-cleared products.
Next: Agentic AI | Previous: Context Window Management