Multimodal AI

Section: 01-AI-Foundations Status: COMPLETE Last Updated: 2026-06-30 Difficulty: Intermediate

Executive Summary

Multimodal AI extends LLMs beyond text to process images, documents, audio, and video — capabilities that unlock significant value in healthcare, where a large fraction of clinically relevant information exists as images (radiology, pathology, dermatology), scanned documents (medical records, consent forms), and structured PDFs (lab reports, discharge summaries). Understanding multimodal capabilities, their limitations, and their enterprise architectural implications is increasingly essential for AI architects working in document-heavy and image-rich domains.

Prerequisites

LLM Fundamentals — Baseline model understanding
Prompt Engineering — Prompting patterns apply to multimodal inputs

Core Concepts

Why Multimodal Matters for Enterprise AI

Enterprise data is not primarily text. A hospital system's information landscape:

Clinical notes and documentation: ~30% (text)
Radiology images (CT, MRI, X-ray, ultrasound): ~25%
Pathology slides: ~15%
Scanned paper documents (consent forms, referral letters): ~15%
Structured PDFs (lab reports, prior auth forms): ~10%
Other (photos, procedural recordings): ~5%

A purely text-based AI system can only address 30% of the clinical information landscape. Multimodal AI extends coverage to scanned documents, structured PDFs, lab reports, and — when combined with specialized medical imaging AI — radiology and pathology content.

The Three Categories of Multimodal in Enterprise

Category 1: Document Understanding Processing PDFs, scanned documents, and structured forms — the most immediately deployable multimodal capability in enterprise settings. Use cases: extracting structured data from insurance EOB documents, processing scanned referral letters, parsing lab report PDFs, reading prior authorization forms.

Category 2: Medical Image Analysis Analyzing clinical images (chest X-rays, CT scans, pathology slides, dermatology photos). This category has specific FDA regulatory implications — AI that influences diagnostic decisions for medical images may be regulated as Software as a Medical Device (SaMD).

Category 3: Document + Context Fusion Combining visual document understanding with retrieved textual knowledge — e.g., processing a patient's radiology image alongside retrieved clinical guidelines to provide contextually grounded interpretation. This is the most architecturally complex and highest-value multimodal application.

Architecture

Multimodal Input Architecture

graph TD %% Title: Multimodal Input Processing Pipeline subgraph "Input Sources" I1["Clinical Image\n(JPG/PNG/DICOM)"] I2["Scanned Document\n(PDF/JPG)"] I3["Structured PDF\n(Lab report)"] I4["Text Input\n(Clinical question)"] end subgraph "Pre-processing" P1["DICOM → PNG\nconverter"] P2["OCR fallback\n(if needed)"] P3["PDF → Image\nrenderer"] P4["PHI Redaction\n(before sending to LLM)"] end subgraph "Vision LLM" V1["Image Encoding\n(vision encoder)"] V2["Text Encoding\n(token embeddings)"] V3["Cross-attention\n(image + text fusion)"] V4["Output Generation"] end I1 --> P1 --> P4 --> V1 I2 --> P2 --> P4 --> V1 I3 --> P3 --> P4 --> V1 I4 --> V2 V1 & V2 --> V3 --> V4

How Vision LLMs Process Images

Vision LLMs process images by encoding them into a sequence of visual tokens using a vision encoder (typically a ViT — Vision Transformer). These visual tokens are concatenated with the text tokens in the model's context and processed jointly through the transformer's attention layers.

Key implication: images consume context window tokens. A 1024×1024 image encoded at standard resolution typically consumes 1,568–2,048 tokens (Claude's token count varies by resolution). For a 200K context window, you can include approximately 90 high-resolution images — but doing so would consume the entire context with no room for retrieved knowledge or conversation history. In practice, multimodal clinical applications should limit image inclusion to 1–3 images per call and use targeted retrieval for any additional context.

Implementation Patterns

Pattern 1: Processing a Clinical Lab Report PDF

python

"""
Extract structured data from a clinical lab report PDF.
Context: HMS scenario — processing inbound lab results from an external
reference lab that sends PDF reports rather than HL7/FHIR interfaces.
The extracted structured data is ingested into the Epic EHR as manual results.

Dependencies:
    anthropic>=0.49.0
    pillow>=10.2.0
    python-dotenv>=1.0.0
    pdf2image>=1.16.0 (requires poppler)
"""
import anthropic
import base64
import json
from pathlib import Path
from typing import Any


def pdf_page_to_base64(pdf_path: Path, page_number: int = 0) -> str:
    """Convert a PDF page to base64-encoded PNG for LLM processing."""
    try:
        from pdf2image import convert_from_path
    except ImportError:
        raise ImportError("Install pdf2image: pip install pdf2image (requires poppler)")

    images = convert_from_path(str(pdf_path), first_page=page_number + 1, last_page=page_number + 1)
    if not images:
        raise ValueError(f"No page {page_number} in {pdf_path}")

    import io
    buffer = io.BytesIO()
    images[0].save(buffer, format="PNG")
    return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")


def extract_lab_results(
    pdf_path: Path,
    client: anthropic.Anthropic,
) -> dict[str, Any]:
    """
    Extract structured lab results from a PDF lab report.
    Returns structured data suitable for EHR ingestion.

    Note: This function processes PHI (lab results are PHI).
    Ensure the Anthropic API call is made within the HIPAA BAA boundary.
    PHI is NOT logged. Results are not cached.
    """
    image_data = pdf_page_to_base64(pdf_path, page_number=0)

    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=2048,
        system=(
            "Extract laboratory results from medical lab report images. "
            "Return ONLY valid JSON with this structure:\n"
            "{\n"
            '  "patient_name": "string",\n'
            '  "dob": "YYYY-MM-DD",\n'
            '  "collection_date": "YYYY-MM-DD",\n'
            '  "ordering_provider": "string",\n'
            '  "lab_results": [\n'
            '    {"test_name": "...", "result": "...", "unit": "...", '
            '"reference_range": "...", "flag": "H|L|C|null"}\n'
            "  ],\n"
            '  "critical_values": ["list of tests with critical flags"]\n'
            "}\n"
            "Flag meanings: H=High, L=Low, C=Critical. "
            "If a field is not visible, use null."
        ),
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Extract all laboratory results from this lab report.",
                },
            ],
        }],
    )

    raw_output = response.content[0].text.strip()
    return json.loads(raw_output)

Pattern 2: Clinical Document OCR and Summarization

python

"""
Process a scanned clinical document (e.g., a handwritten referral letter).
Context: HMS receives scanned referral letters from community physicians.
This pipeline converts the scan to structured text and extracts key clinical data.
"""
import anthropic
import base64
from pathlib import Path
from dataclasses import dataclass


@dataclass
class ReferralExtract:
    referring_provider: str | None
    referring_facility: str | None
    reason_for_referral: str | None
    relevant_history: str | None
    current_medications: list[str]
    urgency: str  # "routine" | "urgent" | "emergent"
    requested_specialty: str | None
    patient_summary: str


def process_scanned_referral(
    image_path: Path,
    client: anthropic.Anthropic,
) -> ReferralExtract:
    """
    Extract structured data from a scanned physician referral letter.
    Handles handwritten and typed formats.
    """
    image_bytes = image_path.read_bytes()
    image_data = base64.standard_b64encode(image_bytes).decode("utf-8")

    # Determine media type from extension
    suffix = image_path.suffix.lower()
    media_type_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png"}
    media_type = media_type_map.get(suffix, "image/jpeg")

    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        system=(
            "You are processing a scanned physician referral letter. "
            "Extract all clinically relevant information. "
            "If handwriting is unclear, indicate uncertainty with [?]. "
            "Return JSON matching the ReferralExtract schema: "
            '{"referring_provider": "...", "referring_facility": "...", '
            '"reason_for_referral": "...", "relevant_history": "...", '
            '"current_medications": ["..."], '
            '"urgency": "routine|urgent|emergent", '
            '"requested_specialty": "...", '
            '"patient_summary": "2-3 sentence clinical summary"}'
        ),
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": media_type, "data": image_data},
                },
                {"type": "text", "text": "Extract the referral information from this document."},
            ],
        }],
    )

    import json
    data = json.loads(response.content[0].text.strip())
    return ReferralExtract(**data)

Pattern 3: Multimodal RAG — Image + Retrieved Text

python

"""
Multimodal RAG: combine a clinical image with retrieved clinical guidelines.
Context: HMS radiology workflow — a chest X-ray is flagged by radiology AI
as potentially showing pneumonia. The clinical AI combines the image with
retrieved clinical guidelines to provide the ordering physician with
context-aware clinical decision support.

NOTE: This is NOT a diagnostic AI. It is a clinical knowledge support tool.
The image interpretation provided to the LLM is from a separate, cleared
radiology AI. The LLM provides clinical knowledge context, not diagnosis.
"""
import anthropic
import base64
from pathlib import Path


def clinical_image_support(
    image_path: Path,
    radiology_ai_findings: str,  # Output from FDA-cleared radiology AI
    clinical_question: str,
    retrieved_guidelines: list[str],  # From RAG pipeline
    client: anthropic.Anthropic,
) -> str:
    """
    Combine radiology AI findings, the CXR image, and retrieved guidelines
    to provide clinical knowledge context to the ordering physician.

    IMPORTANT: This tool does NOT provide a diagnosis. It provides:
    - Context from clinical guidelines relevant to the AI findings
    - Suggested clinical questions to consider
    - Relevant clinical decision support information

    All clinical decisions remain with the ordering physician.
    """
    image_data = base64.standard_b64encode(image_path.read_bytes()).decode("utf-8")

    guidelines_context = "\n\n---\n\n".join(
        f"Guideline {i+1}:\n{g}" for i, g in enumerate(retrieved_guidelines)
    )

    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        system=(
            "You are a clinical knowledge assistant. You provide relevant clinical "
            "guideline information to support physician clinical decision-making. "
            "You do NOT diagnose. You do NOT interpret radiological images. "
            "A separate FDA-cleared radiology AI has provided preliminary findings. "
            "Your role is to provide relevant clinical guideline context based on "
            "those findings. Always defer to the ordering physician's clinical judgment. "
            "Always include: 'Clinical decision-making remains with the ordering physician.'"
        ),
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": (
                        f"Radiology AI preliminary findings: {radiology_ai_findings}\n\n"
                        f"Clinical question: {clinical_question}\n\n"
                        f"Retrieved Clinical Guidelines:\n\n{guidelines_context}\n\n"
                        "Please provide relevant clinical guideline context based on "
                        "the radiology AI findings. Do not interpret the image directly."
                    ),
                },
            ],
        }],
    )

    return response.content[0].text

Healthcare Context — HMS Scenario

HMS Multimodal Use Cases and Regulatory Status

Use Case	Modality	FDA Regulatory Status	HMS Implementation Approach
Lab report PDF extraction	Documents	Not regulated (data processing)	Direct implementation via pattern 1
Scanned referral processing	Documents	Not regulated	Direct implementation via pattern 2
Radiology AI (chest X-ray)	Medical imaging	SaMD — requires 510(k) clearance	Purchase cleared third-party AI (Nuance PowerScribe AI, Aidoc)
Pathology slide analysis	Medical imaging	SaMD — requires clearance	Purchase cleared third-party AI
Dermatology photo review	Medical imaging	SaMD — potentially regulated	Pilot with cleared partner only
Clinical documentation photos	Documents	Not regulated	Direct implementation

Critical architectural principle: Do not build your own medical image diagnostic AI. It is an FDA-regulated medical device. Use FDA-cleared commercial products for radiology, pathology, and dermatology AI. Use Claude and other frontier LLMs for document processing, text extraction, and knowledge support — these are not classified as SaMD.

DICOM to LLM Pipeline

Medical images are stored in DICOM format. Sending DICOM images to an LLM requires conversion:

python

def dicom_to_png_for_llm(
    dicom_path: Path,
    window_center: int = 40,   # Soft tissue window for CXR
    window_width: int = 400,
) -> bytes:
    """
    Convert DICOM image to PNG for LLM processing.
    Applies windowing for appropriate tissue visualization.

    IMPORTANT: Strip DICOM metadata tags containing PHI before
    sending to any external API. DICOM metadata includes patient
    name, DOB, MRN, and other identifiers.
    """
    import pydicom
    import numpy as np
    from PIL import Image
    import io

    ds = pydicom.dcmread(str(dicom_path))
    pixel_array = ds.pixel_array.astype(float)

    # Apply HU windowing
    min_val = window_center - window_width / 2
    max_val = window_center + window_width / 2
    pixel_array = np.clip(pixel_array, min_val, max_val)
    pixel_array = ((pixel_array - min_val) / window_width * 255).astype(np.uint8)

    image = Image.fromarray(pixel_array)
    if image.mode != "RGB":
        image = image.convert("RGB")

    buffer = io.BytesIO()
    image.save(buffer, format="PNG")
    return buffer.getvalue()

Trade-offs and Considerations

Multimodal Failure Modes

Failure Mode	Description	Mitigation
OCR errors on handwriting	Handwritten clinical text misread	Flag [?] markers; require human review for critical values
Image resolution degradation	Low-resolution scans produce hallucinated text	Specify minimum scan DPI (300 DPI minimum for clinical docs)
PHI in image metadata	DICOM/EXIF metadata contains PHI	Strip metadata before sending to any API
Context token consumption	Large images consume 1,500–2,000 tokens each	Limit to 1–3 images per call; resize if possible
Missing clinical context	Image interpreted without clinical context	Always include relevant patient/clinical context in the prompt
Confidence calibration	Model may express confidence in hallucinated OCR	Always validate extracted structured data against known formats

Cost of Vision vs. Text

Vision inputs cost more than equivalent text token counts because image encoding requires additional computation. A 1,024×1,024 image at ~1,568 tokens is billed at text input token rates, but the model processes it more expensively internally. For high-volume document processing (thousands of lab reports per day), evaluate whether OCR pre-processing (Tesseract, AWS Textract, Azure Document Intelligence) followed by text-only LLM processing is more cost-effective than direct vision processing.

Comparison Table

Approach	Document Types	Accuracy	Cost	Latency	PHI Risk
Vision LLM (Claude)	All image formats	High for printed, Medium for handwritten	Medium	2–5s	API sends image — BAA required
OCR + text LLM	Printed documents	Very High for printed	Low (separate OCR step)	3–8s	Two API calls
AWS Textract	Forms, tables, PDFs	Very High for structured	Medium	1–3s	AWS HIPAA eligible
Azure Document Intelligence	Complex forms, custom models	Very High	Medium	2–4s	Azure HIPAA eligible
Vision LLM (direct)	Handwriting	Medium	Medium	2–5s	BAA required

Interview Questions

Category: Architecture / System Design Difficulty: Senior Role: AI Architect / FDE

Answer Framework:

The architecture depends on two key properties of these referral letters: their format variability (typed vs. handwritten, structured vs. unstructured) and their volume (how many per day). As an illustrative example, assume a mix of typed and handwritten letters at moderate daily volume.

For typed, structured referrals: a traditional OCR pipeline (AWS Textract for structured form extraction) followed by a text LLM for structured data extraction is more cost-effective and reliable than direct vision LLM processing.

For handwritten and unstructured referrals: vision LLMs (Claude vision) significantly outperform traditional OCR because they understand clinical context, not just characters. A handwritten "cc: f/u 2wk" is meaningless to Tesseract; Claude correctly extracts "carbon copy, follow-up in 2 weeks."

Recommended architecture: an intelligent router that detects whether an incoming referral is typed/structured (→ Textract + text LLM) or handwritten/unstructured (→ vision LLM directly). This hybrid approach minimizes cost while maintaining coverage.

In all cases: PHI handling is the primary architectural constraint. These documents contain patient names, DOBs, and MRNs — all PHI. The API call must be made within the hospital's AWS or Azure tenant (Bedrock or Azure Document Intelligence) where a HIPAA BAA is in place, and image content must not be logged persistently without de-identification.

Output validation: extracted data is validated against known formats (dates, provider NPI numbers, ICD codes) before ingestion into Epic. Extraction confidence scores below 0.85 route to human review queue rather than automatic ingestion.

Summary

Key Takeaways:

Enterprise and clinical data is heavily image and document-based — multimodal AI extends coverage to the 70% of information that is not plain text
Document understanding (lab reports, referral letters, consent forms) is the highest-ROI multimodal use case in the near term — it is implementable today without regulatory barriers
Medical image diagnostic AI (radiology, pathology) is FDA-regulated — always use cleared commercial products, not LLMs, for diagnostic AI
Images consume substantial context window tokens (1,500–2,000 per high-res image) — budget them carefully
PHI in image metadata (DICOM headers, EXIF data) must be stripped before sending to any external API
For high-volume document processing, evaluate OCR pre-processing + text LLM against direct vision LLM — the hybrid approach often wins on cost

The one thing to remember: Multimodal extends AI to where clinical data actually lives — not just in typed notes, but in scanned forms, lab PDFs, and clinical images. The regulatory boundary is the key architectural constraint: document processing is unrestricted; diagnostic image analysis is SaMD and must use FDA-cleared products.

Next: Agentic AI | Previous: Context Window Management

Multimodal AI#

Executive Summary#

Prerequisites#

Core Concepts#

Why Multimodal Matters for Enterprise AI#

The Three Categories of Multimodal in Enterprise#

Architecture#

Multimodal Input Architecture#

How Vision LLMs Process Images#

Implementation Patterns#

Pattern 1: Processing a Clinical Lab Report PDF#

Pattern 2: Clinical Document OCR and Summarization#

Pattern 3: Multimodal RAG — Image + Retrieved Text#

Healthcare Context — HMS Scenario#

HMS Multimodal Use Cases and Regulatory Status#

DICOM to LLM Pipeline#

Trade-offs and Considerations#

Multimodal Failure Modes#

Cost of Vision vs. Text#

Comparison Table#

Interview Questions#

Q1: A hospital wants to use AI to process scanned physician referral letters. What architecture would you recommend?#

Further Reading#

Summary#

Multimodal AI

Executive Summary

Prerequisites

Core Concepts

Why Multimodal Matters for Enterprise AI

The Three Categories of Multimodal in Enterprise

Architecture

Multimodal Input Architecture

How Vision LLMs Process Images

Implementation Patterns

Pattern 1: Processing a Clinical Lab Report PDF

Pattern 2: Clinical Document OCR and Summarization

Pattern 3: Multimodal RAG — Image + Retrieved Text

Healthcare Context — HMS Scenario

HMS Multimodal Use Cases and Regulatory Status

DICOM to LLM Pipeline

Trade-offs and Considerations

Multimodal Failure Modes

Cost of Vision vs. Text

Comparison Table

Interview Questions

Q1: A hospital wants to use AI to process scanned physician referral letters. What architecture would you recommend?

Further Reading

Summary