Human-in-the-Loop (HITL) Design

Executive Summary

Human-in-the-loop (HITL) patterns are the architectural mechanisms by which agentic systems pause execution, surface their current state to a human operator, receive input or approval, and resume — preserving the auditability and oversight that regulated industries and high-stakes decisions require. HITL is not an afterthought; in enterprise agentic systems, it is a first-class architectural requirement that must be designed into the state machine, not bolted on after deployment. This chapter covers HITL taxonomy, the interrupt-resume pattern, approval workflow design, LangGraph implementation, and the governance framework for deciding when autonomous execution is safe. AI architects, system designers, and engineers building agentic systems for regulated industries should read this chapter.

Learning Objectives

Identify the four categories of HITL triggers and when to apply each
Implement the interrupt-resume pattern in a LangGraph state machine
Design an approval workflow schema for persistence across HITL pauses
Define governance criteria for determining when agent autonomy is safe
Explain the trust degradation and oversight erosion risks in HITL systems

Business Problem

An agentic system that executes a clinical prior authorization workflow autonomously — retrieving patient data, evaluating criteria, and submitting a determination to a payer — provides operational efficiency but creates an unacceptable oversight gap. Clinical determinations that affect patient care require physician review before submission. Regulatory and liability frameworks require documented human authorization for consequential decisions.

The engineering challenge is not whether to require human oversight in these contexts — that is a governance decision made long before code is written. The challenge is how to implement oversight reliably: the system must pause at the right point, persist its state durably (the physician may not review immediately), surface exactly the information the reviewer needs, and resume seamlessly after approval without re-executing completed steps.

Why This Technology Exists

The earliest LLM-based agent systems executed autonomously from trigger to completion. As agents acquired tools with real-world side effects (sending messages, modifying records, initiating transactions), the consequences of agent errors escalated. HITL mechanisms emerged as the answer to a core question: at which points in an agent workflow is human judgment required, and how do we integrate that judgment without losing the efficiency gains of automation?

HITL is also the architectural foundation of responsible AI deployment in regulated industries. HIPAA, FDA SaMD guidelines, and CMS conditions of participation all establish accountability requirements that cannot be satisfied by autonomous AI systems without documented human authorization at critical decision points.

Conceptual Explanation

HITL Trigger Categories

Confidence-based triggers: The agent's uncertainty exceeds a defined threshold. Relevant when agents produce confidence scores (via log-probabilities, self-assessment, or explicit uncertainty prompts) and low-confidence outputs warrant human review.

Risk-based triggers: The action being taken has side effects above a defined risk level. Sending a prior authorization determination is higher risk than reading patient data. All Write/Delete/External tool calls are candidates for risk-based HITL.

Policy-based triggers: Business rules require human authorization regardless of agent confidence. Certain drug combinations requiring pharmacist review, procedures above a cost threshold requiring utilization management review, or any determination affecting patient safety.

Anomaly-based triggers: The agent detects something unexpected in its reasoning path — conflicting guidelines, missing data, a patient profile that falls outside the training distribution of the evaluation criteria — and escalates proactively.

The Interrupt-Resume Pattern

HITL requires the system to:

Interrupt — suspend execution at a defined node in the workflow
Persist — save the complete workflow state durably (database, not memory)
Notify — alert the appropriate reviewer(s) with the context they need
Await — accept asynchronous human input (approval, modification, rejection, escalation)
Resume — inject the human decision into the workflow state and continue execution from the interrupt point

The critical property is that state is persisted durably between steps 2 and 5. The reviewer may act seconds or days later. The system must be resumable regardless.

Core Architecture

sequenceDiagram participant Orchestrator as Agent Orchestrator participant StateStore as State Store (PostgreSQL) participant Notifier as Notification Service participant Reviewer as Clinical Reviewer participant UI as Review Interface Orchestrator->>StateStore: Persist state at interrupt point Orchestrator->>Notifier: POST /notify {workflow_id, reviewer_id, context_summary} Notifier->>Reviewer: Email / Push notification Reviewer->>UI: Open review interface UI->>StateStore: GET /workflow/{id}/review_context StateStore-->>UI: Return current state + agent reasoning Reviewer->>UI: Approve / Modify / Reject UI->>StateStore: PATCH /workflow/{id}/decision {decision, reviewer_id, timestamp} StateStore->>Orchestrator: Webhook: workflow {id} has decision Orchestrator->>StateStore: Load persisted state Orchestrator->>Orchestrator: Resume from interrupt point with decision injected

Components

Interrupt Design in LangGraph

LangGraph implements HITL via interrupt<em>before (pause before a node executes) or interrupt</em>after (pause after). The interrupt_before pattern is preferable for approval workflows: the agent prepares its recommendation, the HITL pause happens before any external action is taken, and the reviewer approves or modifies before submission.

python

# From 05-langgraph-deep-dive.md — the HITL configuration
from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver

graph_builder = StateGraph(PriorAuthState)
# ... add nodes ...

# interrupt_before="physician_review" means:
# after draft_determination executes, execution pauses BEFORE physician_review
# the graph saves state to the checkpointer and returns control to the caller
# the caller can inspect the state and resume when the decision is available
graph = graph_builder.compile(
    checkpointer=PostgresSaver.from_conn_string(POSTGRES_CONN),
    interrupt_before=["physician_review"],
)

Review Context Schema

The review interface needs a structured representation of what the agent produced. Define this schema at workflow design time — it is as important as the interrupt node itself.

python

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional


class HITLDecision(str, Enum):
    APPROVED = "approved"
    MODIFIED = "modified"
    REJECTED = "rejected"
    ESCALATED = "escalated"


@dataclass
class ReviewContext:
    """
    Structured representation of the agent's state at the interrupt point.
    Rendered in the reviewer's UI to support their decision.
    """
    workflow_id: str
    patient_id: str
    procedure_code: str
    agent_recommendation: str  # "APPROVE" or "DENY"
    clinical_rationale: str
    supporting_guidelines: list[str]
    patient_summary: dict
    agent_confidence: Optional[str]  # "HIGH" | "MEDIUM" | "LOW" — if available
    anomaly_flags: list[str] = field(default_factory=list)  # Anything the agent flagged
    created_at: datetime = field(default_factory=datetime.utcnow)
    review_deadline: Optional[datetime] = None  # SLA deadline for this review


@dataclass
class ReviewDecision:
    """
    Reviewer's decision, injected back into the workflow state to resume execution.
    """
    workflow_id: str
    reviewer_id: str
    reviewer_role: str  # "attending_physician" | "utilization_manager" | etc.
    decision: HITLDecision
    modification_notes: Optional[str]  # Required when decision == MODIFIED
    rejection_reason: Optional[str]   # Required when decision == REJECTED
    escalation_target: Optional[str]  # Required when decision == ESCALATED
    reviewed_at: datetime = field(default_factory=datetime.utcnow)
    decision_rationale: str = ""  # Narrative rationale for audit trail

Two-Phase Invocation Pattern

python

"""
HITL two-phase invocation pattern.
Educational Example — Illustrative workflow pattern.
Not intended for clinical decision making.
"""
import json
from typing import Any


def phase_1_submit(patient_id: str, procedure_code: str, graph, postgres_saver) -> str:
    """Submit the prior auth request — runs until HITL interrupt."""
    thread_id = f"prior-auth-{patient_id}-{procedure_code}"
    config = {"configurable": {"thread_id": thread_id}}

    initial_state = {
        "patient_id": patient_id,
        "procedure_code": procedure_code,
        "messages": [],
        "patient_data": {},
        "clinical_evaluation": {},
        "draft_determination": "",
        "physician_decision": "",
        "final_determination": {},
    }

    # Execution pauses at interrupt_before=["physician_review"]
    result = graph.invoke(initial_state, config)

    # At this point, the graph has:
    # 1. Persisted state to PostgreSQL under thread_id
    # 2. Returned control because it hit the interrupt point
    # The draft determination is in the state, awaiting physician review.

    draft = result.get("draft_determination", "")
    print(f"[Phase 1 complete] Thread: {thread_id}")
    print(f"Agent draft: {draft[:200]}...")
    return thread_id


def phase_2_resume(thread_id: str, decision: ReviewDecision, graph) -> dict[str, Any]:
    """Resume execution after physician decision is recorded."""
    config = {"configurable": {"thread_id": thread_id}}

    # Inject the physician's decision into the state
    resume_state = {
        "physician_decision": decision.decision.value,
        "messages": [{
            "role": "user",
            "content": json.dumps({
                "physician_decision": decision.decision.value,
                "reviewer_id": decision.reviewer_id,
                "modification_notes": decision.modification_notes,
                "decision_rationale": decision.decision_rationale,
            })
        }],
    }

    # Resume from the interrupt point; physician_review node now executes
    result = graph.invoke(resume_state, config)

    print(f"[Phase 2 complete] Thread: {thread_id}")
    print(f"Final status: {result.get('final_determination', {}).get('status')}")
    return result

Implementation Patterns

Pattern 1: Risk-Based Automatic Escalation

Some actions should trigger automatic escalation without the agent attempting the action. Implement this as a pre-action check in the tool itself:

python

"""
Risk-based escalation in a tool call.
Educational Example — Illustrative control pattern.
"""
from dataclasses import dataclass
from enum import Enum


class RiskLevel(str, Enum):
    LOW = "low"       # Read operations — autonomous execution permitted
    MEDIUM = "medium" # Write operations — log and notify; execute
    HIGH = "high"     # Consequential determinations — interrupt for human review
    CRITICAL = "critical"  # Irreversible actions — human authorization required


@dataclass
class ToolCallRisk:
    level: RiskLevel
    reason: str
    requires_authorization: bool


def assess_tool_risk(tool_name: str, tool_input: dict) -> ToolCallRisk:
    """Assess the risk level of a proposed tool call before execution."""
    read_only_tools = {"get_patient_summary", "search_clinical_guidelines", "get_lab_results"}
    write_tools = {"update_patient_record", "create_care_plan_draft"}
    high_risk_tools = {"submit_prior_auth_determination", "flag_adverse_event"}

    if tool_name in read_only_tools:
        return ToolCallRisk(RiskLevel.LOW, "Read-only operation", False)

    if tool_name in write_tools:
        return ToolCallRisk(RiskLevel.MEDIUM, "Write operation — audited", False)

    if tool_name in high_risk_tools:
        return ToolCallRisk(RiskLevel.HIGH, "Consequential determination — requires physician authorization", True)

    # Unknown tools are treated as high risk
    return ToolCallRisk(RiskLevel.HIGH, f"Unknown tool '{tool_name}' — requires review", True)


class HITLRequired(Exception):
    """Raised when a tool call requires human authorization before execution."""
    def __init__(self, tool_name: str, risk: ToolCallRisk, context: dict):
        self.tool_name = tool_name
        self.risk = risk
        self.context = context
        super().__init__(f"Human authorization required for tool '{tool_name}': {risk.reason}")


def authorized_tool_execute(tool_name: str, tool_input: dict) -> dict:
    """Execute a tool with risk-based HITL guard."""
    risk = assess_tool_risk(tool_name, tool_input)

    if risk.requires_authorization:
        # Do not execute — raise to be caught by the agent loop
        raise HITLRequired(
            tool_name=tool_name,
            risk=risk,
            context={"tool_input": tool_input}
        )

    return _execute_tool_internally(tool_name, tool_input)


def _execute_tool_internally(tool_name: str, tool_input: dict) -> dict:
    # Internal dispatch to actual tool implementations
    # Stub for illustration
    return {"status": "executed", "tool": tool_name}

Pattern 2: HITL in an Agent Loop

python

"""
Agent loop with HITL interrupt handling.
Educational Example — Illustrative control pattern.
"""
import anthropic
from dataclasses import dataclass, field
from typing import Any


@dataclass
class AgentState:
    messages: list[dict] = field(default_factory=list)
    pending_hitl: dict | None = None
    terminated: bool = False


def run_agent_with_hitl(
    initial_message: str,
    client: anthropic.Anthropic,
    tools: list[dict],
    system: str,
    max_iterations: int = 10,
) -> AgentState:
    """
    Run an agent loop that pauses at HITL-required tool calls,
    records the pending action, and returns for human review.
    """
    state = AgentState(messages=[{"role": "user", "content": initial_message}])

    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-opus-4-8",  # Verify at docs.anthropic.com
            max_tokens=4096,
            system=system,
            tools=tools,
            messages=state.messages,
        )

        state.messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            state.terminated = True
            break

        if response.stop_reason == "tool_use":
            tool_results = []

            for block in response.content:
                if block.type != "tool_use":
                    continue

                try:
                    result = authorized_tool_execute(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result),
                    })

                except HITLRequired as e:
                    # Record the pending HITL action and return
                    # The caller is responsible for persisting state and resuming
                    state.pending_hitl = {
                        "tool_use_id": block.id,
                        "tool_name": e.tool_name,
                        "tool_input": e.context["tool_input"],
                        "risk_reason": e.risk.reason,
                        "iteration": iteration,
                    }
                    # Return without appending tool results — caller handles resumption
                    return state

            state.messages.append({"role": "user", "content": tool_results})

    return state

Enterprise Considerations

SLA-bound review. A HITL workflow that parks in a review queue indefinitely does not provide real-world value. Define review SLAs by trigger category: high-urgency clinical decisions (30 minutes), standard utilization management (24 hours), routine documentation review (72 hours). Build SLA monitoring into the persistence layer and alert when reviews are approaching deadline without action.

Reviewer assignment. Not all reviewers are appropriate for all HITL triggers. Build a routing layer that maps trigger type to reviewer role (attending physician, pharmacist, utilization manager) and then to a specific available reviewer. Consider round-robin and load balancing when multiple qualified reviewers are available.

Audit trail. Every HITL action must be recorded: what the agent proposed, who reviewed it, what decision was made, when, and what rationale was given. This is a regulatory requirement in clinical contexts and a governance requirement in enterprise contexts. The ReviewDecision schema above should be persisted immutably.

Timeout handling. What happens when no reviewer acts before the SLA deadline? Define escalation paths: reassign to a backup reviewer, escalate to a supervisor, or trigger an automatic rejection with notification to the requester. The system must have a defined behavior for timeout; silently waiting indefinitely is not an acceptable design.

Security Considerations

Trust the reviewer, not the review UI. The HITL interface receives a reviewer's decision and injects it into the workflow. The injection must be authenticated (reviewer identity verified) and authorized (reviewer has permission to approve this workflow type). A prompt injection attack that forges a physician approval is a critical vulnerability in clinical systems.

Immutable audit logs. Reviewer decisions must be written to an immutable or append-only audit log, not just the workflow state record. The ability to overwrite a prior approval destroys the audit trail's value.

Separation of concerns. The reviewer should see the agent's recommendation and reasoning, but should form their own judgment — not simply rubber-stamp what the agent produced. HITL UI design should present the supporting evidence prominently, not just "Agent recommends: APPROVE. Approve? [Yes/No]."

Healthcare Example

⊕ Healthcare Example

Educational Example — Illustrative Workflow. Not intended for clinical decision making.

A Reference Healthcare Organization's prior authorization system implements three HITL trigger levels:

Trigger Level	Example	Reviewer	SLA
Policy-required	Any procedure above $5,000 estimated cost	Utilization Management	24 hours
Risk-based	Agent confidence LOW on clinical criteria	Attending Physician	4 hours
Anomaly-based	Conflicting contraindications in patient data	Pharmacist + Attending	1 hour

The HITL interface for physician review displays: the patient summary, the requested procedure, the agent's criterion-by-criterion evaluation with guideline citations, and the draft determination. The physician can approve the draft, modify the rationale, or reject with a documented reason. All decisions are persisted to the compliance audit log.

Common Mistakes

Designing HITL as an afterthought. HITL requires durable state persistence, interrupt-capable execution, and asynchronous review flows. These are architectural requirements that must be designed from the start. Adding them to an already-built autonomous system typically requires significant refactoring.

In-memory state at interrupt points. If workflow state is held in memory at the interrupt point, a container restart between the interrupt and the reviewer's decision loses all state. Always persist to durable storage (PostgreSQL, not Redis unless Redis is configured for durability) before notifying the reviewer.

Undefined reviewer authority. A HITL system without explicit reviewer-role authorization mapping will route clinical approvals to whoever happens to be available. Define who is authorized to approve what, and enforce it in the routing and authentication layer.

No timeout policy. A HITL workflow without timeout handling will accumulate stale pending reviews. Define and implement SLA enforcement before go-live.

Best Practices

Design HITL into the state machine from day one — interrupt points are architectural, not operational
Persist state durably before notifying reviewers — never hold interrupt state in memory
Define explicit reviewer-role-to-trigger-type mappings and enforce them with authorization
Implement SLA monitoring with escalation paths for overdue reviews
Build immutable audit logs for all HITL decisions
HITL UI should present evidence, not just the agent's conclusion — reviewers should exercise judgment, not rubber-stamp
Test the resume path explicitly in integration tests — the interrupt is only half the pattern

Alternatives

Approach	When Appropriate	Trade-off
Interrupt-resume (LangGraph)	Complex stateful workflows	Requires checkpointing infrastructure
Pre-authorization gate	Simple approval before any agent execution	Cannot adapt to what the agent discovers
Post-execution review	Lower-risk workflows where rollback is possible	Consequential actions may have already occurred
Confidence threshold only	Low-stakes tasks with measurable confidence	Confidence scores are not reliable for all task types
Full human execution	Highest-risk tasks — AI assists only	Loses automation efficiency entirely

Interview Questions

Q1: What are the four HITL trigger categories, and how would you decide which applies to a clinical prior authorization workflow?

Category: Architecture Difficulty: Senior Role: AI Architect

Answer Framework:

The four trigger categories are: confidence-based (agent uncertainty exceeds threshold), risk-based (action side-effect level), policy-based (business rule requires human authorization regardless of confidence), and anomaly-based (agent detects unexpected state and proactively escalates).

For clinical prior authorization: policy-based triggers are non-negotiable — any determination that affects patient care requires documented physician authorization under CMS and Joint Commission standards. Risk-based triggers apply to the submission tool (External, consequential). Anomaly-based triggers apply when the agent encounters conflicting guidelines or missing data it cannot resolve. Confidence-based triggers are supplemental — if the evaluation model reports LOW confidence, escalate even if the other triggers do not fire.

In practice, for clinical workflows, policy-based triggers dominate: certain decision types always require human review, regardless of agent confidence. The other three categories add additional coverage.

Key Points to Hit:

Policy-based is often non-negotiable in regulated industries
Risk-based maps naturally to the tool side-effect classification (Read/Write/Delete/External)
Anomaly-based requires the agent to reason about its own uncertainty — harder to implement reliably
Multiple trigger types can co-apply; use OR logic (any trigger fires = escalate)

Q2: Why must HITL interrupt state be persisted to durable storage before the reviewer is notified, and what happens if it is not?

Category: System Design Difficulty: Mid-level Role: ML Engineer / AI Architect

Answer Framework:

The reviewer may act seconds after notification or hours later. Container restarts, application deployments, and infrastructure failures can occur in that window. If interrupt state is held only in memory (in the agent process), any restart loses the state entirely — the workflow is lost, the reviewer's decision has nowhere to land, and the original request must be resubmitted from scratch.

With durable persistence (PostgreSQL with a checkpointer like LangGraph's PostgresSaver): the agent serializes complete workflow state before notifying the reviewer; the notification goes out after persistence succeeds; when the reviewer acts, their decision is written to the same store; when the agent resumes, it loads state from the store regardless of whether the original process is still running. The entire HITL pause is process-independent.

The architectural rule is: notification must come after persistence succeeds. If persistence fails, the notification must not be sent — there is no state to resume into.

Red Flags (What NOT to say):

"We can use Redis for this" without qualifying that Redis needs AOF or RDB persistence configured
"The reviewer will act quickly so in-memory should be fine"

Key Takeaways

HITL is an architectural requirement, not an operational feature — design it into the state machine from day one
The four trigger categories are: confidence-based, risk-based, policy-based, and anomaly-based; regulated industries often mandate policy-based triggers
The interrupt-resume pattern requires: interrupt → persist durably → notify → await → resume
State must be persisted to durable storage before the reviewer is notified — never hold interrupt state in memory
Define SLAs, reviewer authority, and timeout escalation paths before deployment
HITL audit logs must be immutable and include: agent proposal, reviewer identity, decision, rationale, and timestamp

Glossary

Term	Definition
HITL (Human-in-the-Loop)	Architectural pattern where agent execution pauses for human review and decision at defined points
Interrupt point	The specific node in a workflow graph where execution pauses for human input
Confidence-based trigger	HITL trigger based on the agent's measured or self-reported uncertainty
Risk-based trigger	HITL trigger based on the side-effect level of the action being attempted
Policy-based trigger	HITL trigger mandated by business or regulatory rules regardless of agent confidence
Anomaly-based trigger	HITL trigger where the agent proactively escalates due to unexpected state
Interrupt-resume pattern	The two-phase execution model: run to interrupt point, pause, persist, await decision, resume
SLA (Service Level Agreement)	The time limit within which a review action must be completed

Human-in-the-Loop (HITL) Design#

Executive Summary#

Learning Objectives#

Business Problem#

Why This Technology Exists#

Conceptual Explanation#

HITL Trigger Categories#

The Interrupt-Resume Pattern#

Core Architecture#

Components#

Interrupt Design in LangGraph#

Review Context Schema#

Two-Phase Invocation Pattern#

Implementation Patterns#

Pattern 1: Risk-Based Automatic Escalation#

Pattern 2: HITL in an Agent Loop#

Enterprise Considerations#

Security Considerations#

Healthcare Example#

Common Mistakes#

Best Practices#

Alternatives#

Interview Questions#

Q1: What are the four HITL trigger categories, and how would you decide which applies to a clinical prior authorization workflow?#

Q2: Why must HITL interrupt state be persisted to durable storage before the reviewer is notified, and what happens if it is not?#

Key Takeaways#

Glossary#

Further Reading#

Human-in-the-Loop (HITL) Design

Executive Summary

Learning Objectives

Business Problem

Why This Technology Exists

Conceptual Explanation

HITL Trigger Categories

The Interrupt-Resume Pattern

Core Architecture

Components

Interrupt Design in LangGraph

Review Context Schema

Two-Phase Invocation Pattern

Implementation Patterns

Pattern 1: Risk-Based Automatic Escalation

Pattern 2: HITL in an Agent Loop

Enterprise Considerations

Security Considerations

Healthcare Example

Common Mistakes

Best Practices

Alternatives

Interview Questions

Q1: What are the four HITL trigger categories, and how would you decide which applies to a clinical prior authorization workflow?

Q2: Why must HITL interrupt state be persisted to durable storage before the reviewer is notified, and what happens if it is not?

Key Takeaways

Glossary

Further Reading