POC to Production

Conceptual Explanation

A well-designed POC has three properties that are often in tension:

Feasible: Can be executed with the available time, data, and people. A 6-week POC scope for a 4-week engagement is not feasible.

Representative: Produces evidence about the conditions that matter for production success. A POC against synthetic data is feasible but not representative of production data quality.

Decision-enabling: Produces a clear signal that allows a go/no-go decision at the end. A POC without defined success criteria cannot produce a clear go/no-go decision.

The POC design process is the discipline of finding the intersection of these three properties for a specific use case and client environment.

Core Architecture: The POC Design Process

Step 1 — Define the POC Hypothesis

Every POC tests a hypothesis. Making the hypothesis explicit is the first design step:

text
POC Hypothesis Template:

"We hypothesize that [AI system] can [capability claim] when operating on
 [data source] in [client environment], as measured by [success criteria],
 within a [time / resource] constraint."

Example:
"We hypothesize that discharge summary AI can generate clinically complete
 draft discharge summaries from FHIR R4 clinical data for inpatient pneumonia
 encounters at the Reference Healthcare Organization, achieving a physician
 edit rate below 30% and section completeness above 95% as evaluated by two
 hospitalist physicians, within a 4-week execution window using 50 randomly
 selected inpatient encounters from the Epic FHIR sandbox."

A hypothesis that cannot be stated in this form is not POC-ready. If the capability claim is vague ("AI generates good discharge summaries"), the success criteria will be unresolvable. If the data source is undefined ("our patient data"), the integration scope is unknown. Make the hypothesis concrete before writing a line of code.

Step 2 — Define Success Criteria in Writing

Success criteria must be defined and signed off by the client before POC execution begins. Verbal success criteria are renegotiated after the POC produces results.

python
from dataclasses import dataclass
from typing import Optional

@dataclass
class POCSuccessCriteria:
    """
    Written success criteria for an AI POC.
    Must be signed off by FDE and client stakeholder before execution.
    
    Educational example — not for clinical use.
    """
    primary_metric: str
    primary_threshold: float
    primary_measurement_method: str
    
    secondary_metrics: list[dict]  # List of {metric, threshold, method}
    
    # POC scope definition
    data_source: str
    sample_size: int
    evaluation_period_days: int
    evaluators: list[str]  # Who evaluates: roles, not names
    
    # Go/no-go decision
    go_criteria: str       # "All primary and ≥ 2 of 3 secondary criteria met"
    no_go_criteria: str    # "Primary criterion missed by > 10 percentage points"
    inconclusive_criteria: str  # "Primary met but secondary data insufficient"
    
    # Approvals
    fde_sign_off: Optional[str] = None
    client_technical_sign_off: Optional[str] = None
    client_clinical_sign_off: Optional[str] = None


# Example: Discharge Summary AI POC success criteria
discharge_summary_poc_criteria = POCSuccessCriteria(
    primary_metric="Physician edit rate",
    primary_threshold=0.30,  # Less than 30% of sections require substantial edits
    primary_measurement_method=(
        "Two hospitalist physicians independently review 25 AI-generated drafts each. "
        "For each draft, they rate each section as: Accept as-is / Minor edit / Major edit / Reject. "
        "Edit rate = (Major edit + Reject) / total sections evaluated."
    ),
    secondary_metrics=[
        {
            "metric": "Section completeness rate",
            "threshold": 0.95,
            "method": "AI-generated draft contains all required sections per HMS template ≥ 95% of encounters"
        },
        {
            "metric": "Medication accuracy",
            "threshold": 0.90,
            "method": "Discharge medications in AI draft match pharmacy-verified discharge med list ≥ 90%"
        },
        {
            "metric": "Generation latency",
            "threshold": 30.0,
            "method": "Draft generation completes within 30 seconds for ≥ 95% of encounters (p95 latency)"
        }
    ],
    data_source="Epic FHIR R4 sandbox (production-representative de-identified data)",
    sample_size=50,
    evaluation_period_days=14,
    evaluators=["Two hospitalist physicians", "One clinical informatics nurse"],
    go_criteria=(
        "Primary criterion (edit rate < 30%) met AND ≥ 2 of 3 secondary criteria met"
    ),
    no_go_criteria=(
        "Primary criterion not met (edit rate ≥ 40%) OR medication accuracy < 80%"
    ),
    inconclusive_criteria=(
        "Primary criterion met but sample size insufficient for statistical confidence "
        "(evaluator bandwidth limitation — extend evaluation with larger sample)"
    )
)

Step 3 — POC-to-Production Gap Analysis

The POC operates in a simplified environment. Production operates in a constrained environment. The gap between these two environments is the primary source of POC-to-production failures.

Map the gap explicitly before POC execution:

python
PRODUCTION_GAP_ANALYSIS = {
    "data_environment": {
        "poc": "Epic FHIR sandbox (de-identified data, representative schema)",
        "production": "Epic FHIR production (real PHI, live encounter data)",
        "gap": "PHI handling, BAA requirement, Minimum Necessary scope enforcement",
        "migration_requirement": "BAA signed; audit logging enabled; PHI minimization in prompts"
    },
    "concurrency": {
        "poc": "Single-user, sequential requests",
        "production": "50+ concurrent clinical users; burst at shift change",
        "gap": "LLM API rate limits; AI gateway queuing; response time under load",
        "migration_requirement": "Load test at 50 concurrent requests; queue depth monitoring"
    },
    "model_governance": {
        "poc": "Latest available model version (for best quality)",
        "production": "Approved model version per Model Review Board; change control for updates",
        "gap": "Version pinning; evaluation before model updates; PCCP if SaMD-adjacent",
        "migration_requirement": "Model registry entry; MRB approval; re-evaluation on update"
    },
    "prompt_management": {
        "poc": "Prompt in code (iterating rapidly)",
        "production": "Versioned prompt registry; clinical validation per version",
        "gap": "Prompt versioning; rollback capability; evaluation on prompt change",
        "migration_requirement": "Prompt Registry deployment; clinical sign-off process per version"
    },
    "error_handling": {
        "poc": "Errors visible to FDE; exceptions bubble up",
        "production": "Errors must not disrupt clinical workflow; graceful degradation required",
        "gap": "Exception handling; fallback behavior; alert thresholds",
        "migration_requirement": "Production exception handling; fallback to manual workflow"
    },
    "observability": {
        "poc": "FDE monitors logs directly; ad hoc investigation",
        "production": "Automated quality drift monitoring; alert on threshold breach",
        "gap": "Structured logging; evaluation pipeline; alert routing",
        "migration_requirement": "Observability stack; baseline metrics; on-call runbook"
    },
    "clinical_workflow_integration": {
        "poc": "Standalone application (not embedded in Epic)",
        "production": "Embedded SMART on FHIR application (App Orchard approved)",
        "gap": "Epic App Orchard review (6–12 weeks); SMART launch integration",
        "migration_requirement": "App Orchard submission; SMART launch parameters; Epic testing"
    }
}

Step 4 — POC Execution Plan

markdown
# POC Execution Plan — Discharge Summary AI
# Duration: 4 weeks
# FDE + Client Engineering Team

## Architecture Diagram

```mermaid
graph TD
    HYPO["Define POC Hypothesis\n(Capability + Data + Criteria + Constraint)"]
    CRIT["Define Success Criteria\n(Written sign-off before execution)"]
    GAP["POC-to-Production Gap Analysis\n(8 dimensions)"]
    PLAN["POC Execution Plan\n(4-week schedule + roles)"]

    subgraph "POC Execution"
        W1["Week 1: Environment + Integration"]
        W2["Week 2: Quality Iteration"]
        W3["Week 3: Clinical Evaluation"]
        W4["Week 4: Analysis + Planning"]
    end

    subgraph "Go/No-Go Decision"
        GO["GO — Proceed to Production Planning"]
        CGO["CONDITIONAL GO — Mitigation Plan"]
        EXT["EXTEND — More Data Needed"]
        NGO["NO-GO — Root Cause Analysis"]
        RDS["REDESIGN — New Iteration Cycle"]
    end

    PP["Production Planning\nMigration + Architecture"]
    LAUNCH["Production Launch"]

    HYPO --> CRIT --> GAP --> PLAN
    PLAN --> W1 --> W2 --> W3 --> W4
    W4 --> GO & CGO & EXT & NGO & RDS
    GO --> PP
    CGO --> PP
    EXT --> W3
    NGO --> HYPO
    RDS --> HYPO
    PP --> LAUNCH

Common Mistakes

1. Starting POC execution before success criteria are signed off. When POC results are mixed, unsigned success criteria become negotiation fodder. Get written sign-off before the first line of code.

2. POC scope that does not represent production constraints. A POC against synthetic data in a local environment with a single user has told you almost nothing about production viability. At minimum, use production-representative data and test under concurrent load.

3. Not involving the client's engineering team in POC execution. An FDE who builds the entire POC alone creates a system the client cannot maintain. POC execution must include the client engineers who will own the system in production.

4. Missing the production gap for Epic App Orchard. App Orchard review is a 6–12 week process. POC designs that assume immediate Epic production access will create a production delay that was entirely foreseeable in week 1.

5. No-Go without root cause analysis. A No-Go POC is valuable information, not a failure. The output should include a root cause analysis: was the data quality insufficient? Was the use case mismatched to the AI capability? Was the prompt under-engineered? A structured No-Go enables the next iteration.

Best Practices

  • Define the POC hypothesis before defining the scope
  • Get written sign-off on success criteria before starting execution
  • Map the production gap in week 1; don't discover it in week 8
  • Involve client engineering in POC execution from day 1
  • Include clinical evaluation in every healthcare AI POC — not just technical metrics
  • Define a clear Go/No-Go decision process with a date on the calendar before execution begins
  • Begin App Orchard submission immediately on POC start — it runs in parallel, not after

Trade-offs

Speed vs. rigor: A 2-week POC is faster but produces less evidence about production viability. A 6-week POC produces stronger evidence but delays the production decision. The right balance depends on the risk of a production failure vs. the cost of a longer POC.

Representation vs. access: Using production data in a POC produces the most representative results but requires BAA, PHI controls, and access provisioning. Using synthetic data avoids these requirements but produces weaker evidence. For healthcare AI, production-representative de-identified data in an Epic sandbox is the practical middle ground.

Interview Questions

Q: A POC produces mixed results — the primary success criterion is met but one secondary criterion is not. How do you structure the go/no-go conversation with the client?

Category: Behavioral Difficulty: Principal Role: FDE

Answer Framework:

A Conditional Go is a legitimate POC outcome — it means the core capability is validated but a production risk exists that requires a mitigation plan before launch. The go/no-go conversation should be structured, not improvisational.

First, present the results factually against the success criteria that were signed off before the POC. The signed criteria are the agreed evaluation framework; departure from them requires explicit justification. If the primary criterion is met, that is the most important finding.

Second, analyze the failed secondary criterion: is it a fundamental limitation (the AI cannot consistently achieve this metric), a data quality issue (the metric failed because of specific data gaps, not the AI capability), or an engineering gap (latency is too high because the demo gateway was under-provisioned, not because the AI is inherently slow)? Each has a different mitigation.

Third, propose a concrete mitigation plan for the gap — with a specific owner, timeline, and re-measurement mechanism. The production launch is conditioned on the mitigation being completed and validated.

The go/no-go decision is the client's, not the FDE's. The FDE presents the evidence and the recommendation; the client decides.

Key Points to Hit:

  • Present results against the signed success criteria — no surprises
  • Categorize the failure (fundamental / data quality / engineering)
  • Propose a specific mitigation plan with owner and timeline
  • Decision is the client's; recommendation is the FDE's

Red Flags:

  • Redefining success criteria post-POC to match results
  • Recommending GO without a mitigation plan for the secondary gap

Key Takeaways

  • A POC tests a hypothesis — make the hypothesis explicit before designing the scope
  • Written, signed success criteria before execution are non-negotiable
  • The production-POC gap must be mapped before execution begins, not after
  • Eight production gap dimensions: data environment, concurrency, model governance, prompt management, error handling, observability, clinical workflow integration, and security
  • A Go/No-Go decision framework with defined categories prevents ambiguous outcomes
  • App Orchard review takes 6–12 weeks — submit in parallel with POC execution
  • POC execution must include the client engineering team who will own the system in production