POC to Production

Executive Summary

The proof-of-concept is a hypothesis test, not a mini-production system. Most enterprise AI POCs succeed technically and still fail to reach production — not because the technology didn't work, but because the POC was designed without a clear production path, success criteria were not agreed in writing before execution, or the gap between POC constraints and production constraints was not surfaced until it became a blocker. This chapter defines a rigorous POC design methodology — scoping, success criteria, gap analysis, production migration planning — that treats POC execution as the first step in a production journey, not an isolated experiment. Understanding where POCs go wrong is the most direct path to ensuring they go right.

Learning Objectives

  • Design a POC scope that is both feasible within the time constraint and production-representative
  • Define written success criteria before POC execution begins
  • Identify and plan for the production-POC gap: the constraints that differ between POC and production environments
  • Build a go/no-go decision framework that prevents both premature production recommendations and indefinite POC extension
  • Execute the POC-to-production migration as a structured engineering process

Business Problem

Enterprise AI POCs occupy an ambiguous organizational position. They are funded as experiments but expected to demonstrate production viability. They are staffed at POC intensity but must produce insights about production complexity. And they are evaluated against criteria that are often not defined until after the POC is complete — which means the evaluation is subjective and the path to production is unclear.

The failure mode is predictable: POC succeeds by every measure the team chose to measure; production deployment stalls because the measures chosen did not include the factors that actually determine production viability. A POC that demonstrated AI output quality but did not measure physician adoption rate, integration latency under concurrent load, or security architecture compatibility has not demonstrated production readiness — it has demonstrated that the AI works in isolation.

Why POC Design Matters

The design decisions made at the beginning of a POC determine what the POC can and cannot demonstrate. A POC against synthetic data demonstrates AI output quality but not data integration reliability. A POC against 20 encounters demonstrates output consistency but not statistical significance for quality metrics. A POC run by the FDE without client engineering involvement demonstrates the capability but not the client team's ability to own it.

Each of these design gaps creates a specific production risk that the POC was not positioned to surface. The POC design methodology exists to ensure that the risks that matter for production viability are included in POC scope from the beginning.

Conceptual Explanation

A well-designed POC has three properties that are often in tension:

Feasible: Can be executed with the available time, data, and people. A 6-week POC scope for a 4-week engagement is not feasible.

Representative: Produces evidence about the conditions that matter for production success. A POC against synthetic data is feasible but not representative of production data quality.

Decision-enabling: Produces a clear signal that allows a go/no-go decision at the end. A POC without defined success criteria cannot produce a clear go/no-go decision.

The POC design process is the discipline of finding the intersection of these three properties for a specific use case and client environment.

Core Architecture: The POC Design Process

Step 1 — Define the POC Hypothesis

Every POC tests a hypothesis. Making the hypothesis explicit is the first design step:

text
POC Hypothesis Template:

"We hypothesize that [AI system] can [capability claim] when operating on
 [data source] in [client environment], as measured by [success criteria],
 within a [time / resource] constraint."

Example:
"We hypothesize that discharge summary AI can generate clinically complete
 draft discharge summaries from FHIR R4 clinical data for inpatient pneumonia
 encounters at the Reference Healthcare Organization, achieving a physician
 edit rate below 30% and section completeness above 95% as evaluated by two
 hospitalist physicians, within a 4-week execution window using 50 randomly
 selected inpatient encounters from the Epic FHIR sandbox."

A hypothesis that cannot be stated in this form is not POC-ready. If the capability claim is vague ("AI generates good discharge summaries"), the success criteria will be unresolvable. If the data source is undefined ("our patient data"), the integration scope is unknown. Make the hypothesis concrete before writing a line of code.

Step 2 — Define Success Criteria in Writing

Success criteria must be defined and signed off by the client before POC execution begins. Verbal success criteria are renegotiated after the POC produces results.

python
from dataclasses import dataclass
from typing import Optional

@dataclass
class POCSuccessCriteria:
    """
    Written success criteria for an AI POC.
    Must be signed off by FDE and client stakeholder before execution.
    
    Educational example — not for clinical use.
    """
    primary_metric: str
    primary_threshold: float
    primary_measurement_method: str
    
    secondary_metrics: list[dict]  # List of {metric, threshold, method}
    
    # POC scope definition
    data_source: str
    sample_size: int
    evaluation_period_days: int
    evaluators: list[str]  # Who evaluates: roles, not names
    
    # Go/no-go decision
    go_criteria: str       # "All primary and ≥ 2 of 3 secondary criteria met"
    no_go_criteria: str    # "Primary criterion missed by > 10 percentage points"
    inconclusive_criteria: str  # "Primary met but secondary data insufficient"
    
    # Approvals
    fde_sign_off: Optional[str] = None
    client_technical_sign_off: Optional[str] = None
    client_clinical_sign_off: Optional[str] = None


# Example: Discharge Summary AI POC success criteria
discharge_summary_poc_criteria = POCSuccessCriteria(
    primary_metric="Physician edit rate",
    primary_threshold=0.30,  # Less than 30% of sections require substantial edits
    primary_measurement_method=(
        "Two hospitalist physicians independently review 25 AI-generated drafts each. "
        "For each draft, they rate each section as: Accept as-is / Minor edit / Major edit / Reject. "
        "Edit rate = (Major edit + Reject) / total sections evaluated."
    ),
    secondary_metrics=[
        {
            "metric": "Section completeness rate",
            "threshold": 0.95,
            "method": "AI-generated draft contains all required sections per HMS template ≥ 95% of encounters"
        },
        {
            "metric": "Medication accuracy",
            "threshold": 0.90,
            "method": "Discharge medications in AI draft match pharmacy-verified discharge med list ≥ 90%"
        },
        {
            "metric": "Generation latency",
            "threshold": 30.0,
            "method": "Draft generation completes within 30 seconds for ≥ 95% of encounters (p95 latency)"
        }
    ],
    data_source="Epic FHIR R4 sandbox (production-representative de-identified data)",
    sample_size=50,
    evaluation_period_days=14,
    evaluators=["Two hospitalist physicians", "One clinical informatics nurse"],
    go_criteria=(
        "Primary criterion (edit rate < 30%) met AND ≥ 2 of 3 secondary criteria met"
    ),
    no_go_criteria=(
        "Primary criterion not met (edit rate ≥ 40%) OR medication accuracy < 80%"
    ),
    inconclusive_criteria=(
        "Primary criterion met but sample size insufficient for statistical confidence "
        "(evaluator bandwidth limitation — extend evaluation with larger sample)"
    )
)

Step 3 — POC-to-Production Gap Analysis

The POC operates in a simplified environment. Production operates in a constrained environment. The gap between these two environments is the primary source of POC-to-production failures.

Map the gap explicitly before POC execution:

python
PRODUCTION_GAP_ANALYSIS = {
    "data_environment": {
        "poc": "Epic FHIR sandbox (de-identified data, representative schema)",
        "production": "Epic FHIR production (real PHI, live encounter data)",
        "gap": "PHI handling, BAA requirement, Minimum Necessary scope enforcement",
        "migration_requirement": "BAA signed; audit logging enabled; PHI minimization in prompts"
    },
    "concurrency": {
        "poc": "Single-user, sequential requests",
        "production": "50+ concurrent clinical users; burst at shift change",
        "gap": "LLM API rate limits; AI gateway queuing; response time under load",
        "migration_requirement": "Load test at 50 concurrent requests; queue depth monitoring"
    },
    "model_governance": {
        "poc": "Latest available model version (for best quality)",
        "production": "Approved model version per Model Review Board; change control for updates",
        "gap": "Version pinning; evaluation before model updates; PCCP if SaMD-adjacent",
        "migration_requirement": "Model registry entry; MRB approval; re-evaluation on update"
    },
    "prompt_management": {
        "poc": "Prompt in code (iterating rapidly)",
        "production": "Versioned prompt registry; clinical validation per version",
        "gap": "Prompt versioning; rollback capability; evaluation on prompt change",
        "migration_requirement": "Prompt Registry deployment; clinical sign-off process per version"
    },
    "error_handling": {
        "poc": "Errors visible to FDE; exceptions bubble up",
        "production": "Errors must not disrupt clinical workflow; graceful degradation required",
        "gap": "Exception handling; fallback behavior; alert thresholds",
        "migration_requirement": "Production exception handling; fallback to manual workflow"
    },
    "observability": {
        "poc": "FDE monitors logs directly; ad hoc investigation",
        "production": "Automated quality drift monitoring; alert on threshold breach",
        "gap": "Structured logging; evaluation pipeline; alert routing",
        "migration_requirement": "Observability stack; baseline metrics; on-call runbook"
    },
    "clinical_workflow_integration": {
        "poc": "Standalone application (not embedded in Epic)",
        "production": "Embedded SMART on FHIR application (App Orchard approved)",
        "gap": "Epic App Orchard review (6–12 weeks); SMART launch integration",
        "migration_requirement": "App Orchard submission; SMART launch parameters; Epic testing"
    }
}

Step 4 — POC Execution Plan

markdown
# POC Execution Plan — Discharge Summary AI
# Duration: 4 weeks
# FDE + Client Engineering Team

## Week 1 — Environment and Integration

Day 1–2: AI gateway deployment (IT team)
Day 1–3: FHIR sandbox access credentials provisioned
Day 2–4: FHIR client integration — Patient, Condition, MedicationRequest, Observation
Day 3–5: Prompt development (initial version; not clinically validated yet)
Day 5: First end-to-end generation test — internal review only

Exit criteria: Generate a discharge summary draft from FHIR data for any test encounter

## Week 2 — Quality Iteration

Day 6–8: Prompt refinement based on internal quality review
Day 8–10: Generate drafts for 20 randomly selected encounters
Day 10: Internal quality review by FDE + clinical informatics
Day 10: Adjust prompt based on findings

Exit criteria: FDE + clinical informatics confident output quality is ready for physician review

## Week 3 — Clinical Evaluation

Day 11–13: Physician evaluator kickoff; evaluation rubric training
Day 11–20: Physicians review 25 encounters each (25 unique + 5 overlap for inter-rater reliability)
Day 20: Collect evaluation data

Exit criteria: 50 encounters evaluated; data collected; evaluation scores tabulated

## Week 4 — Analysis and Production Planning

Day 21–22: Analyze results against success criteria
Day 23: Go/No-Go determination meeting with client
Day 23–25: If Go — draft Production Architecture document
Day 25: POC readout presentation to executive sponsors
Day 28: POC Summary report delivered

## Roles and Responsibilities

| Activity | FDE | Client Engineering | Clinical Team | IT |
|----------|-----|-------------------|--------------|-----|
| FHIR integration | Lead | Support | — | Credentials |
| Prompt development | Lead | Review | Clinical review | — |
| Clinical evaluation | Facilitation | — | Lead | — |
| Gap analysis | Lead | Collaborate | — | Collaborate |
| Production planning | Lead | Collaborate | Inform | Collaborate |

Step 5 — Go/No-Go Framework

python
from enum import Enum

class POCOutcome(Enum):
    GO = "go"
    NO_GO = "no_go"
    CONDITIONAL_GO = "conditional_go"
    EXTEND = "extend"
    REDESIGN = "redesign"

def evaluate_poc_outcome(
    results: dict,
    criteria: POCSuccessCriteria
) -> tuple[POCOutcome, str]:
    """
    Evaluate POC results against success criteria to produce a go/no-go recommendation.
    Educational example — not for clinical use.
    """
    primary_met = results["edit_rate"] < criteria.primary_threshold
    
    secondary_met_count = sum([
        results["section_completeness"] >= 0.95,
        results["medication_accuracy"] >= 0.90,
        results["p95_latency_seconds"] <= 30.0
    ])
    
    # No-go: primary failure
    if results["edit_rate"] >= 0.40:
        return (POCOutcome.NO_GO, 
                f"Primary criterion failed: edit rate {results['edit_rate']:.1%} "
                f"exceeds no-go threshold of 40%")
    
    # No-go: medication accuracy safety concern
    if results["medication_accuracy"] < 0.80:
        return (POCOutcome.NO_GO,
                f"Medication accuracy {results['medication_accuracy']:.1%} "
                f"below minimum safe threshold of 80% — patient safety concern")
    
    # Full go: all criteria met
    if primary_met and secondary_met_count >= 2:
        return (POCOutcome.GO,
                f"All criteria met: edit rate {results['edit_rate']:.1%}, "
                f"{secondary_met_count}/3 secondary criteria met")
    
    # Conditional go: primary met, some secondary gaps
    if primary_met and secondary_met_count == 1:
        return (POCOutcome.CONDITIONAL_GO,
                "Primary criterion met. Secondary gaps require mitigation plan before production.")
    
    # Extend: primary borderline, insufficient sample
    if 0.30 <= results["edit_rate"] <= 0.35 and results["sample_size"] < 50:
        return (POCOutcome.EXTEND,
                "Edit rate borderline. Extend with additional encounters to reach statistical confidence.")
    
    # Redesign: primary missed but improvable
    return (POCOutcome.REDESIGN,
            f"Edit rate {results['edit_rate']:.1%} above threshold but improvable. "
            f"Prompt redesign recommended with new 30-day iteration cycle.")

Architecture Diagram

Implementation Patterns

Production Migration Checklist

When the POC produces a Go or Conditional Go decision, the production migration begins:

python
PRODUCTION_MIGRATION_CHECKLIST = {
    "governance": [
        "BAA signed with all AI vendors (Anthropic, cloud provider)",
        "Model Review Board approval obtained (Tier 1 and 2 use cases)",
        "Prompt versions clinically validated and registered in Prompt Registry",
        "Model version pinned and registered in Model Registry",
        "Risk tier classification documented"
    ],
    "infrastructure": [
        "AI gateway deployed in production environment (HIPAA-eligible VPC)",
        "Audit logging enabled (append-only, 6-year retention, hashed patient IDs)",
        "PHI minimization enforced in production prompts",
        "TLS 1.2+ on all communication paths",
        "Rate limiting configured (protect LLM API budget and availability)"
    ],
    "integration": [
        "Epic App Orchard review submitted and approved",
        "Production FHIR credentials provisioned",
        "SMART on FHIR launch configured in Epic",
        "DocumentReference write-back tested in Epic staging",
        "HL7 ADT integration tested (if applicable)"
    ],
    "quality_assurance": [
        "Golden dataset evaluation completed (≥ 50 real encounters)",
        "Demographic bias evaluation (subgroup performance ≥ 90% of overall)",
        "Adversarial testing: prompt injection, edge cases",
        "Load testing at expected concurrent user count + 50% headroom",
        "Physician champion sign-off on production output quality"
    ],
    "operations": [
        "Observability stack deployed (structured logging, quality drift monitoring)",
        "Alert thresholds defined and tested",
        "On-call runbook written",
        "Rollback procedure defined and tested",
        "Safety event reporting channel established"
    ],
    "clinical_readiness": [
        "Physician training materials developed",
        "Clinical champion training completed",
        "Shadow mode period defined (minimum 48 hours)",
        "Canary deployment plan (5% → 25% → 100%)",
        "Feedback collection mechanism deployed"
    ]
}

Enterprise Considerations

POC portfolio management: FDE organizations with multiple concurrent POCs need portfolio visibility — which POCs are at what stage, which are at risk of stalling, and which are ready for production planning. A POC tracking system (even a simple spreadsheet) that captures POC stage, success criteria status, blocking issues, and estimated production date is essential.

POC cost transparency: POCs consume significant resources: FDE time, client engineering time, API costs, and organizational attention. The cost of a POC must be weighed against the expected value of the production deployment. POCs for use cases with uncertain production ROI should be designed to be shorter and cheaper — enough to validate the hypothesis before committing full resources.

Avoiding the perpetual POC: Some organizations run perpetual POCs — endless iterations that never produce a production decision. This is usually a symptom of unclear success criteria or organizational risk aversion. The go/no-go framework with a defined timeline prevents this pattern.

Healthcare Example

⊕ Healthcare Example

Educational Example — Illustrative POC. Not intended as clinical guidance.

A Reference Healthcare Organization discharge summary AI POC produces the following results at week 4:

  • Physician edit rate: 24% (below the 30% threshold — primary criterion met)
  • Section completeness: 97% (secondary criterion met)
  • Medication accuracy: 88% (secondary criterion met)
  • P95 latency: 22 seconds (secondary criterion met)

Decision: GO

Production planning begins with a 12-week migration plan: App Orchard review (running in parallel, 8 weeks), AI gateway production deployment (week 1), BAA finalization (week 2–3), MRB approval (week 4), shadow mode pilot (weeks 5–6), canary to 10% of hospitalists (weeks 7–9), full deployment (week 12).

Common Mistakes

1. Starting POC execution before success criteria are signed off. When POC results are mixed, unsigned success criteria become negotiation fodder. Get written sign-off before the first line of code.

2. POC scope that does not represent production constraints. A POC against synthetic data in a local environment with a single user has told you almost nothing about production viability. At minimum, use production-representative data and test under concurrent load.

3. Not involving the client's engineering team in POC execution. An FDE who builds the entire POC alone creates a system the client cannot maintain. POC execution must include the client engineers who will own the system in production.

4. Missing the production gap for Epic App Orchard. App Orchard review is a 6–12 week process. POC designs that assume immediate Epic production access will create a production delay that was entirely foreseeable in week 1.

5. No-Go without root cause analysis. A No-Go POC is valuable information, not a failure. The output should include a root cause analysis: was the data quality insufficient? Was the use case mismatched to the AI capability? Was the prompt under-engineered? A structured No-Go enables the next iteration.

Best Practices

  • Define the POC hypothesis before defining the scope
  • Get written sign-off on success criteria before starting execution
  • Map the production gap in week 1; don't discover it in week 8
  • Involve client engineering in POC execution from day 1
  • Include clinical evaluation in every healthcare AI POC — not just technical metrics
  • Define a clear Go/No-Go decision process with a date on the calendar before execution begins
  • Begin App Orchard submission immediately on POC start — it runs in parallel, not after

Trade-offs

Speed vs. rigor: A 2-week POC is faster but produces less evidence about production viability. A 6-week POC produces stronger evidence but delays the production decision. The right balance depends on the risk of a production failure vs. the cost of a longer POC.

Representation vs. access: Using production data in a POC produces the most representative results but requires BAA, PHI controls, and access provisioning. Using synthetic data avoids these requirements but produces weaker evidence. For healthcare AI, production-representative de-identified data in an Epic sandbox is the practical middle ground.

Interview Questions

Q: A POC produces mixed results — the primary success criterion is met but one secondary criterion is not. How do you structure the go/no-go conversation with the client?

Category: Behavioral Difficulty: Principal Role: FDE

Answer Framework:

A Conditional Go is a legitimate POC outcome — it means the core capability is validated but a production risk exists that requires a mitigation plan before launch. The go/no-go conversation should be structured, not improvisational.

First, present the results factually against the success criteria that were signed off before the POC. The signed criteria are the agreed evaluation framework; departure from them requires explicit justification. If the primary criterion is met, that is the most important finding.

Second, analyze the failed secondary criterion: is it a fundamental limitation (the AI cannot consistently achieve this metric), a data quality issue (the metric failed because of specific data gaps, not the AI capability), or an engineering gap (latency is too high because the demo gateway was under-provisioned, not because the AI is inherently slow)? Each has a different mitigation.

Third, propose a concrete mitigation plan for the gap — with a specific owner, timeline, and re-measurement mechanism. The production launch is conditioned on the mitigation being completed and validated.

The go/no-go decision is the client's, not the FDE's. The FDE presents the evidence and the recommendation; the client decides.

Key Points to Hit:

  • Present results against the signed success criteria — no surprises
  • Categorize the failure (fundamental / data quality / engineering)
  • Propose a specific mitigation plan with owner and timeline
  • Decision is the client's; recommendation is the FDE's

Red Flags:

  • Redefining success criteria post-POC to match results
  • Recommending GO without a mitigation plan for the secondary gap

Key Takeaways

  • A POC tests a hypothesis — make the hypothesis explicit before designing the scope
  • Written, signed success criteria before execution are non-negotiable
  • The production-POC gap must be mapped before execution begins, not after
  • Eight production gap dimensions: data environment, concurrency, model governance, prompt management, error handling, observability, clinical workflow integration, and security
  • A Go/No-Go decision framework with defined categories prevents ambiguous outcomes
  • App Orchard review takes 6–12 weeks — submit in parallel with POC execution
  • POC execution must include the client engineering team who will own the system in production

Glossary

POC Hypothesis: An explicit statement of what the POC will test, against what data, in what environment, measured by what criteria.

Production Gap: The set of constraints and requirements that differ between the POC environment and the production environment.

Conditional Go: A POC outcome where the primary criterion is met but secondary gaps require mitigation plans before production launch.

App Orchard: Epic's third-party application review program. Required before a SMART on FHIR application can access production Epic environments.

Further Reading