POC to Production

Executive Summary

The proof-of-concept is a hypothesis test, not a mini-production system. Most enterprise AI POCs succeed technically and still fail to reach production — not because the technology didn't work, but because the POC was designed without a clear production path, success criteria were not agreed in writing before execution, or the gap between POC constraints and production constraints was not surfaced until it became a blocker. This chapter defines a rigorous POC design methodology — scoping, success criteria, gap analysis, production migration planning — that treats POC execution as the first step in a production journey, not an isolated experiment. Understanding where POCs go wrong is the most direct path to ensuring they go right.

Learning Objectives

Design a POC scope that is both feasible within the time constraint and production-representative
Define written success criteria before POC execution begins
Identify and plan for the production-POC gap: the constraints that differ between POC and production environments
Build a go/no-go decision framework that prevents both premature production recommendations and indefinite POC extension
Execute the POC-to-production migration as a structured engineering process

Business Problem

Enterprise AI POCs occupy an ambiguous organizational position. They are funded as experiments but expected to demonstrate production viability. They are staffed at POC intensity but must produce insights about production complexity. And they are evaluated against criteria that are often not defined until after the POC is complete — which means the evaluation is subjective and the path to production is unclear.

The failure mode is predictable: POC succeeds by every measure the team chose to measure; production deployment stalls because the measures chosen did not include the factors that actually determine production viability. A POC that demonstrated AI output quality but did not measure physician adoption rate, integration latency under concurrent load, or security architecture compatibility has not demonstrated production readiness — it has demonstrated that the AI works in isolation.

Why POC Design Matters

The design decisions made at the beginning of a POC determine what the POC can and cannot demonstrate. A POC against synthetic data demonstrates AI output quality but not data integration reliability. A POC against 20 encounters demonstrates output consistency but not statistical significance for quality metrics. A POC run by the FDE without client engineering involvement demonstrates the capability but not the client team's ability to own it.

Each of these design gaps creates a specific production risk that the POC was not positioned to surface. The POC design methodology exists to ensure that the risks that matter for production viability are included in POC scope from the beginning.

Conceptual Explanation

A well-designed POC has three properties that are often in tension:

Feasible: Can be executed with the available time, data, and people. A 6-week POC scope for a 4-week engagement is not feasible.

Representative: Produces evidence about the conditions that matter for production success. A POC against synthetic data is feasible but not representative of production data quality.

Decision-enabling: Produces a clear signal that allows a go/no-go decision at the end. A POC without defined success criteria cannot produce a clear go/no-go decision.

The POC design process is the discipline of finding the intersection of these three properties for a specific use case and client environment.

Core Architecture: The POC Design Process

Step 1 — Define the POC Hypothesis

Every POC tests a hypothesis. Making the hypothesis explicit is the first design step:

text

POC Hypothesis Template:

"We hypothesize that [AI system] can [capability claim] when operating on
 [data source] in [client environment], as measured by [success criteria],
 within a [time / resource] constraint."

Example:
"We hypothesize that discharge summary AI can generate clinically complete
 draft discharge summaries from FHIR R4 clinical data for inpatient pneumonia
 encounters at the Reference Healthcare Organization, achieving a physician
 edit rate below 30% and section completeness above 95% as evaluated by two
 hospitalist physicians, within a 4-week execution window using 50 randomly
 selected inpatient encounters from the Epic FHIR sandbox."

A hypothesis that cannot be stated in this form is not POC-ready. If the capability claim is vague ("AI generates good discharge summaries"), the success criteria will be unresolvable. If the data source is undefined ("our patient data"), the integration scope is unknown. Make the hypothesis concrete before writing a line of code.

Step 2 — Define Success Criteria in Writing

Success criteria must be defined and signed off by the client before POC execution begins. Verbal success criteria are renegotiated after the POC produces results.

python

from dataclasses import dataclass
from typing import Optional

@dataclass
class POCSuccessCriteria:
    """
    Written success criteria for an AI POC.
    Must be signed off by FDE and client stakeholder before execution.
    
    Educational example — not for clinical use.
    """
    primary_metric: str
    primary_threshold: float
    primary_measurement_method: str
    
    secondary_metrics: list[dict]  # List of {metric, threshold, method}
    
    # POC scope definition
    data_source: str
    sample_size: int
    evaluation_period_days: int
    evaluators: list[str]  # Who evaluates: roles, not names
    
    # Go/no-go decision
    go_criteria: str       # "All primary and ≥ 2 of 3 secondary criteria met"
    no_go_criteria: str    # "Primary criterion missed by > 10 percentage points"
    inconclusive_criteria: str  # "Primary met but secondary data insufficient"
    
    # Approvals
    fde_sign_off: Optional[str] = None
    client_technical_sign_off: Optional[str] = None
    client_clinical_sign_off: Optional[str] = None


# Example: Discharge Summary AI POC success criteria
discharge_summary_poc_criteria = POCSuccessCriteria(
    primary_metric="Physician edit rate",
    primary_threshold=0.30,  # Less than 30% of sections require substantial edits
    primary_measurement_method=(
        "Two hospitalist physicians independently review 25 AI-generated drafts each. "
        "For each draft, they rate each section as: Accept as-is / Minor edit / Major edit / Reject. "
        "Edit rate = (Major edit + Reject) / total sections evaluated."
    ),
    secondary_metrics=[
        {
            "metric": "Section completeness rate",
            "threshold": 0.95,
            "method": "AI-generated draft contains all required sections per HMS template ≥ 95% of encounters"
        },
        {
            "metric": "Medication accuracy",
            "threshold": 0.90,
            "method": "Discharge medications in AI draft match pharmacy-verified discharge med list ≥ 90%"
        },
        {
            "metric": "Generation latency",
            "threshold": 30.0,
            "method": "Draft generation completes within 30 seconds for ≥ 95% of encounters (p95 latency)"
        }
    ],
    data_source="Epic FHIR R4 sandbox (production-representative de-identified data)",
    sample_size=50,
    evaluation_period_days=14,
    evaluators=["Two hospitalist physicians", "One clinical informatics nurse"],
    go_criteria=(
        "Primary criterion (edit rate < 30%) met AND ≥ 2 of 3 secondary criteria met"
    ),
    no_go_criteria=(
        "Primary criterion not met (edit rate ≥ 40%) OR medication accuracy < 80%"
    ),
    inconclusive_criteria=(
        "Primary criterion met but sample size insufficient for statistical confidence "
        "(evaluator bandwidth limitation — extend evaluation with larger sample)"
    )
)

Step 3 — POC-to-Production Gap Analysis

The POC operates in a simplified environment. Production operates in a constrained environment. The gap between these two environments is the primary source of POC-to-production failures.

Map the gap explicitly before POC execution:

python

PRODUCTION_GAP_ANALYSIS = {
    "data_environment": {
        "poc": "Epic FHIR sandbox (de-identified data, representative schema)",
        "production": "Epic FHIR production (real PHI, live encounter data)",
        "gap": "PHI handling, BAA requirement, Minimum Necessary scope enforcement",
        "migration_requirement": "BAA signed; audit logging enabled; PHI minimization in prompts"
    },
    "concurrency": {
        "poc": "Single-user, sequential requests",
        "production": "50+ concurrent clinical users; burst at shift change",
        "gap": "LLM API rate limits; AI gateway queuing; response time under load",
        "migration_requirement": "Load test at 50 concurrent requests; queue depth monitoring"
    },
    "model_governance": {
        "poc": "Latest available model version (for best quality)",
        "production": "Approved model version per Model Review Board; change control for updates",
        "gap": "Version pinning; evaluation before model updates; PCCP if SaMD-adjacent",
        "migration_requirement": "Model registry entry; MRB approval; re-evaluation on update"
    },
    "prompt_management": {
        "poc": "Prompt in code (iterating rapidly)",
        "production": "Versioned prompt registry; clinical validation per version",
        "gap": "Prompt versioning; rollback capability; evaluation on prompt change",
        "migration_requirement": "Prompt Registry deployment; clinical sign-off process per version"
    },
    "error_handling": {
        "poc": "Errors visible to FDE; exceptions bubble up",
        "production": "Errors must not disrupt clinical workflow; graceful degradation required",
        "gap": "Exception handling; fallback behavior; alert thresholds",
        "migration_requirement": "Production exception handling; fallback to manual workflow"
    },
    "observability": {
        "poc": "FDE monitors logs directly; ad hoc investigation",
        "production": "Automated quality drift monitoring; alert on threshold breach",
        "gap": "Structured logging; evaluation pipeline; alert routing",
        "migration_requirement": "Observability stack; baseline metrics; on-call runbook"
    },
    "clinical_workflow_integration": {
        "poc": "Standalone application (not embedded in Epic)",
        "production": "Embedded SMART on FHIR application (App Orchard approved)",
        "gap": "Epic App Orchard review (6–12 weeks); SMART launch integration",
        "migration_requirement": "App Orchard submission; SMART launch parameters; Epic testing"
    }
}

Step 4 — POC Execution Plan

markdown

# POC Execution Plan — Discharge Summary AI
# Duration: 4 weeks
# FDE + Client Engineering Team

## Week 1 — Environment and Integration

Day 1–2: AI gateway deployment (IT team)
Day 1–3: FHIR sandbox access credentials provisioned
Day 2–4: FHIR client integration — Patient, Condition, MedicationRequest, Observation
Day 3–5: Prompt development (initial version; not clinically validated yet)
Day 5: First end-to-end generation test — internal review only

Exit criteria: Generate a discharge summary draft from FHIR data for any test encounter

## Week 2 — Quality Iteration

Day 6–8: Prompt refinement based on internal quality review
Day 8–10: Generate drafts for 20 randomly selected encounters
Day 10: Internal quality review by FDE + clinical informatics
Day 10: Adjust prompt based on findings

Exit criteria: FDE + clinical informatics confident output quality is ready for physician review

## Week 3 — Clinical Evaluation

Day 11–13: Physician evaluator kickoff; evaluation rubric training
Day 11–20: Physicians review 25 encounters each (25 unique + 5 overlap for inter-rater reliability)
Day 20: Collect evaluation data

Exit criteria: 50 encounters evaluated; data collected; evaluation scores tabulated

## Week 4 — Analysis and Production Planning

Day 21–22: Analyze results against success criteria
Day 23: Go/No-Go determination meeting with client
Day 23–25: If Go — draft Production Architecture document
Day 25: POC readout presentation to executive sponsors
Day 28: POC Summary report delivered

## Roles and Responsibilities

| Activity | FDE | Client Engineering | Clinical Team | IT |
|----------|-----|-------------------|--------------|-----|
| FHIR integration | Lead | Support | — | Credentials |
| Prompt development | Lead | Review | Clinical review | — |
| Clinical evaluation | Facilitation | — | Lead | — |
| Gap analysis | Lead | Collaborate | — | Collaborate |
| Production planning | Lead | Collaborate | Inform | Collaborate |

Step 5 — Go/No-Go Framework

python

from enum import Enum

class POCOutcome(Enum):
    GO = "go"
    NO_GO = "no_go"
    CONDITIONAL_GO = "conditional_go"
    EXTEND = "extend"
    REDESIGN = "redesign"

def evaluate_poc_outcome(
    results: dict,
    criteria: POCSuccessCriteria
) -> tuple[POCOutcome, str]:
    """
    Evaluate POC results against success criteria to produce a go/no-go recommendation.
    Educational example — not for clinical use.
    """
    primary_met = results["edit_rate"] < criteria.primary_threshold
    
    secondary_met_count = sum([
        results["section_completeness"] >= 0.95,
        results["medication_accuracy"] >= 0.90,
        results["p95_latency_seconds"] <= 30.0
    ])
    
    # No-go: primary failure
    if results["edit_rate"] >= 0.40:
        return (POCOutcome.NO_GO, 
                f"Primary criterion failed: edit rate {results['edit_rate']:.1%} "
                f"exceeds no-go threshold of 40%")
    
    # No-go: medication accuracy safety concern
    if results["medication_accuracy"] < 0.80:
        return (POCOutcome.NO_GO,
                f"Medication accuracy {results['medication_accuracy']:.1%} "
                f"below minimum safe threshold of 80% — patient safety concern")
    
    # Full go: all criteria met
    if primary_met and secondary_met_count >= 2:
        return (POCOutcome.GO,
                f"All criteria met: edit rate {results['edit_rate']:.1%}, "
                f"{secondary_met_count}/3 secondary criteria met")
    
    # Conditional go: primary met, some secondary gaps
    if primary_met and secondary_met_count == 1:
        return (POCOutcome.CONDITIONAL_GO,
                "Primary criterion met. Secondary gaps require mitigation plan before production.")
    
    # Extend: primary borderline, insufficient sample
    if 0.30 <= results["edit_rate"] <= 0.35 and results["sample_size"] < 50:
        return (POCOutcome.EXTEND,
                "Edit rate borderline. Extend with additional encounters to reach statistical confidence.")
    
    # Redesign: primary missed but improvable
    return (POCOutcome.REDESIGN,
            f"Edit rate {results['edit_rate']:.1%} above threshold but improvable. "
            f"Prompt redesign recommended with new 30-day iteration cycle.")

Architecture Diagram

graph TD HYPO["Define POC Hypothesis\n(Capability + Data + Criteria + Constraint)"] CRIT["Define Success Criteria\n(Written sign-off before execution)"] GAP["POC-to-Production Gap Analysis\n(8 dimensions)"] PLAN["POC Execution Plan\n(4-week schedule + roles)"] subgraph "POC Execution" W1["Week 1: Environment + Integration"] W2["Week 2: Quality Iteration"] W3["Week 3: Clinical Evaluation"] W4["Week 4: Analysis + Planning"] end subgraph "Go/No-Go Decision" GO["GO — Proceed to Production Planning"] CGO["CONDITIONAL GO — Mitigation Plan"] EXT["EXTEND — More Data Needed"] NGO["NO-GO — Root Cause Analysis"] RDS["REDESIGN — New Iteration Cycle"] end PP["Production Planning\nMigration + Architecture"] LAUNCH["Production Launch"] HYPO --> CRIT --> GAP --> PLAN PLAN --> W1 --> W2 --> W3 --> W4 W4 --> GO & CGO & EXT & NGO & RDS GO --> PP CGO --> PP EXT --> W3 NGO --> HYPO RDS --> HYPO PP --> LAUNCH

Implementation Patterns

Production Migration Checklist

When the POC produces a Go or Conditional Go decision, the production migration begins:

python

PRODUCTION_MIGRATION_CHECKLIST = {
    "governance": [
        "BAA signed with all AI vendors (Anthropic, cloud provider)",
        "Model Review Board approval obtained (Tier 1 and 2 use cases)",
        "Prompt versions clinically validated and registered in Prompt Registry",
        "Model version pinned and registered in Model Registry",
        "Risk tier classification documented"
    ],
    "infrastructure": [
        "AI gateway deployed in production environment (HIPAA-eligible VPC)",
        "Audit logging enabled (append-only, 6-year retention, hashed patient IDs)",
        "PHI minimization enforced in production prompts",
        "TLS 1.2+ on all communication paths",
        "Rate limiting configured (protect LLM API budget and availability)"
    ],
    "integration": [
        "Epic App Orchard review submitted and approved",
        "Production FHIR credentials provisioned",
        "SMART on FHIR launch configured in Epic",
        "DocumentReference write-back tested in Epic staging",
        "HL7 ADT integration tested (if applicable)"
    ],
    "quality_assurance": [
        "Golden dataset evaluation completed (≥ 50 real encounters)",
        "Demographic bias evaluation (subgroup performance ≥ 90% of overall)",
        "Adversarial testing: prompt injection, edge cases",
        "Load testing at expected concurrent user count + 50% headroom",
        "Physician champion sign-off on production output quality"
    ],
    "operations": [
        "Observability stack deployed (structured logging, quality drift monitoring)",
        "Alert thresholds defined and tested",
        "On-call runbook written",
        "Rollback procedure defined and tested",
        "Safety event reporting channel established"
    ],
    "clinical_readiness": [
        "Physician training materials developed",
        "Clinical champion training completed",
        "Shadow mode period defined (minimum 48 hours)",
        "Canary deployment plan (5% → 25% → 100%)",
        "Feedback collection mechanism deployed"
    ]
}

Enterprise Considerations

POC portfolio management: FDE organizations with multiple concurrent POCs need portfolio visibility — which POCs are at what stage, which are at risk of stalling, and which are ready for production planning. A POC tracking system (even a simple spreadsheet) that captures POC stage, success criteria status, blocking issues, and estimated production date is essential.

POC cost transparency: POCs consume significant resources: FDE time, client engineering time, API costs, and organizational attention. The cost of a POC must be weighed against the expected value of the production deployment. POCs for use cases with uncertain production ROI should be designed to be shorter and cheaper — enough to validate the hypothesis before committing full resources.

Avoiding the perpetual POC: Some organizations run perpetual POCs — endless iterations that never produce a production decision. This is usually a symptom of unclear success criteria or organizational risk aversion. The go/no-go framework with a defined timeline prevents this pattern.

Healthcare Example

⊕ Healthcare Example

Educational Example — Illustrative POC. Not intended as clinical guidance.

A Reference Healthcare Organization discharge summary AI POC produces the following results at week 4:

Physician edit rate: 24% (below the 30% threshold — primary criterion met)
Section completeness: 97% (secondary criterion met)
Medication accuracy: 88% (secondary criterion met)
P95 latency: 22 seconds (secondary criterion met)

Decision: GO

Production planning begins with a 12-week migration plan: App Orchard review (running in parallel, 8 weeks), AI gateway production deployment (week 1), BAA finalization (week 2–3), MRB approval (week 4), shadow mode pilot (weeks 5–6), canary to 10% of hospitalists (weeks 7–9), full deployment (week 12).

Common Mistakes

1. Starting POC execution before success criteria are signed off. When POC results are mixed, unsigned success criteria become negotiation fodder. Get written sign-off before the first line of code.

2. POC scope that does not represent production constraints. A POC against synthetic data in a local environment with a single user has told you almost nothing about production viability. At minimum, use production-representative data and test under concurrent load.

3. Not involving the client's engineering team in POC execution. An FDE who builds the entire POC alone creates a system the client cannot maintain. POC execution must include the client engineers who will own the system in production.

4. Missing the production gap for Epic App Orchard. App Orchard review is a 6–12 week process. POC designs that assume immediate Epic production access will create a production delay that was entirely foreseeable in week 1.

5. No-Go without root cause analysis. A No-Go POC is valuable information, not a failure. The output should include a root cause analysis: was the data quality insufficient? Was the use case mismatched to the AI capability? Was the prompt under-engineered? A structured No-Go enables the next iteration.

Best Practices

Define the POC hypothesis before defining the scope
Get written sign-off on success criteria before starting execution
Map the production gap in week 1; don't discover it in week 8
Involve client engineering in POC execution from day 1
Include clinical evaluation in every healthcare AI POC — not just technical metrics
Define a clear Go/No-Go decision process with a date on the calendar before execution begins
Begin App Orchard submission immediately on POC start — it runs in parallel, not after

Trade-offs

Speed vs. rigor: A 2-week POC is faster but produces less evidence about production viability. A 6-week POC produces stronger evidence but delays the production decision. The right balance depends on the risk of a production failure vs. the cost of a longer POC.

Representation vs. access: Using production data in a POC produces the most representative results but requires BAA, PHI controls, and access provisioning. Using synthetic data avoids these requirements but produces weaker evidence. For healthcare AI, production-representative de-identified data in an Epic sandbox is the practical middle ground.

Interview Questions

Q: A POC produces mixed results — the primary success criterion is met but one secondary criterion is not. How do you structure the go/no-go conversation with the client?

Category: Behavioral Difficulty: Principal Role: FDE

Answer Framework:

A Conditional Go is a legitimate POC outcome — it means the core capability is validated but a production risk exists that requires a mitigation plan before launch. The go/no-go conversation should be structured, not improvisational.

First, present the results factually against the success criteria that were signed off before the POC. The signed criteria are the agreed evaluation framework; departure from them requires explicit justification. If the primary criterion is met, that is the most important finding.

Second, analyze the failed secondary criterion: is it a fundamental limitation (the AI cannot consistently achieve this metric), a data quality issue (the metric failed because of specific data gaps, not the AI capability), or an engineering gap (latency is too high because the demo gateway was under-provisioned, not because the AI is inherently slow)? Each has a different mitigation.

Third, propose a concrete mitigation plan for the gap — with a specific owner, timeline, and re-measurement mechanism. The production launch is conditioned on the mitigation being completed and validated.

The go/no-go decision is the client's, not the FDE's. The FDE presents the evidence and the recommendation; the client decides.

Key Points to Hit:

Present results against the signed success criteria — no surprises
Categorize the failure (fundamental / data quality / engineering)
Propose a specific mitigation plan with owner and timeline
Decision is the client's; recommendation is the FDE's

Red Flags:

Redefining success criteria post-POC to match results
Recommending GO without a mitigation plan for the secondary gap

Key Takeaways

A POC tests a hypothesis — make the hypothesis explicit before designing the scope
Written, signed success criteria before execution are non-negotiable
The production-POC gap must be mapped before execution begins, not after
Eight production gap dimensions: data environment, concurrency, model governance, prompt management, error handling, observability, clinical workflow integration, and security
A Go/No-Go decision framework with defined categories prevents ambiguous outcomes
App Orchard review takes 6–12 weeks — submit in parallel with POC execution
POC execution must include the client engineering team who will own the system in production

Glossary

POC Hypothesis: An explicit statement of what the POC will test, against what data, in what environment, measured by what criteria.

Production Gap: The set of constraints and requirements that differ between the POC environment and the production environment.

Conditional Go: A POC outcome where the primary criterion is met but secondary gaps require mitigation plans before production launch.

App Orchard: Epic's third-party application review program. Required before a SMART on FHIR application can access production Epic environments.

POC to Production#

Executive Summary#

Learning Objectives#

Business Problem#

Why POC Design Matters#

Conceptual Explanation#

Core Architecture: The POC Design Process#

Step 1 — Define the POC Hypothesis#

Step 2 — Define Success Criteria in Writing#

Step 3 — POC-to-Production Gap Analysis#

Step 4 — POC Execution Plan#

Step 5 — Go/No-Go Framework#

Architecture Diagram#

Implementation Patterns#

Production Migration Checklist#

Enterprise Considerations#

Healthcare Example#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: A POC produces mixed results — the primary success criterion is met but one secondary criterion is not. How do you structure the go/no-go conversation with the client?#

Key Takeaways#

Glossary#

Further Reading#

POC to Production

Executive Summary

Learning Objectives

Business Problem

Why POC Design Matters

Conceptual Explanation

Core Architecture: The POC Design Process

Step 1 — Define the POC Hypothesis

Step 2 — Define Success Criteria in Writing

Step 3 — POC-to-Production Gap Analysis

Step 4 — POC Execution Plan

Step 5 — Go/No-Go Framework

Architecture Diagram

Implementation Patterns

Production Migration Checklist

Enterprise Considerations

Healthcare Example

Common Mistakes

Best Practices

Trade-offs

Interview Questions

Q: A POC produces mixed results — the primary success criterion is met but one secondary criterion is not. How do you structure the go/no-go conversation with the client?

Key Takeaways

Glossary

Further Reading