LangGraph Deep Dive

Conceptual Explanation

LangGraph models a workflow as a StateGraph: a directed graph where:

  • Nodes are Python functions (or agent invocations) that transform state
  • Edges define transitions between nodes (unconditional or conditional)
  • State is a typed object shared across all nodes — the single source of truth for the workflow
  • Checkpointer is a persistence layer that saves state after each node, enabling resumption

The critical insight: the state is not just "the current message." It is a complete, typed snapshot of the workflow's progress — including all data collected, decisions made, and tool results. Any node can read any part of the state; each node writes back only the fields it modifies.

text
StateGraph = Nodes + Edges + State + Checkpointer

Workflow execution:
  START → node_A (reads state, produces update) → state_updated
  → conditional_edge (evaluates state condition) → node_B or node_C
  → ... → END

Core Architecture

State Definition

The state is the foundation. Define it first, before any nodes. Use TypedDict for simple states; use Annotated fields when multiple nodes need to append to the same field (list accumulation):

python
from typing import TypedDict, Annotated
import operator

class PriorAuthState(TypedDict):
    # Input fields (set at workflow start)
    patient_id: str
    procedure_code: str
    requesting_clinician: str

    # Accumulated fields (multiple nodes append to these)
    messages: Annotated[list, operator.add]  # message history accumulates

    # Decision fields (set by specific nodes)
    patient_summary: dict
    clinical_evaluation: dict
    payer_policy_match: dict
    determination_letter: str
    physician_decision: str  # "approved" | "modified" | "rejected"
    final_status: str
    error: str | None

Nodes

A node is a Python function that takes the current state and returns a state update (partial dict — only the fields being modified):

python
def fetch_patient_data(state: PriorAuthState) -> dict:
    """Node: retrieves patient EHR data."""
    patient_data = ehr_client.get_summary(state["patient_id"])
    return {"patient_summary": patient_data}

def evaluate_clinical_criteria(state: PriorAuthState) -> dict:
    """Node: evaluates clinical criteria using LLM + RAG."""
    evaluation = clinical_agent.evaluate(
        patient=state["patient_summary"],
        procedure=state["procedure_code"],
    )
    return {"clinical_evaluation": evaluation}

Edges and Conditional Routing

python
from langgraph.graph import StateGraph, END

def route_after_evaluation(state: PriorAuthState) -> str:
    """Conditional edge: routes based on clinical evaluation outcome."""
    if state.get("error"):
        return "handle_error"
    if state["clinical_evaluation"]["criteria_met"]:
        return "check_payer_policy"
    else:
        return "generate_denial_letter"

builder = StateGraph(PriorAuthState)
builder.add_node("fetch_patient_data", fetch_patient_data)
builder.add_node("evaluate_clinical_criteria", evaluate_clinical_criteria)
builder.add_node("check_payer_policy", check_payer_policy)
builder.add_node("generate_approval_letter", generate_approval_letter)
builder.add_node("generate_denial_letter", generate_denial_letter)
builder.add_node("physician_review", physician_review_node)
builder.add_node("handle_error", handle_error_node)

builder.add_edge("__start__", "fetch_patient_data")
builder.add_edge("fetch_patient_data", "evaluate_clinical_criteria")
builder.add_conditional_edges(
    "evaluate_clinical_criteria",
    route_after_evaluation,
    {
        "check_payer_policy": "check_payer_policy",
        "generate_denial_letter": "generate_denial_letter",
        "handle_error": "handle_error",
    }
)
builder.add_edge("check_payer_policy", "generate_approval_letter")
builder.add_edge("generate_approval_letter", "physician_review")
builder.add_edge("generate_denial_letter", "physician_review")
builder.add_edge("physician_review", END)
builder.add_edge("handle_error", END)

Architecture Diagram

Standalone diagram: architecture/mermaid/02-langgraph-state-machine.mmd

Common Mistakes

Using MemorySaver in production. State loss on pod restart in a Kubernetes deployment causes all in-progress workflows to disappear. Always use a persistent checkpointer in production.

Mutable state field naming. Two nodes that both set state["result"] overwrite each other. Use Annotated[list, operator.add] for fields where multiple nodes contribute, and unique field names for fields owned by specific nodes.

Forgetting interrupt_before is compile-time. Interrupts are set at compile time, not runtime. You cannot dynamically add an interrupt to a running workflow. Design interrupt points at graph design time.

No error node. A workflow with no error handling node will halt on unexpected exceptions and leave checkpointed state in an indeterminate condition. Always add a dedicated handle_error node with edges from all failure paths.

Best Practices

  • Define the state schema first, before any node implementation — it is the contract between all nodes
  • Use Annotated[list, operator.add] for accumulated fields; use plain fields for owned fields
  • Always include a handle_error node with edges from every node that can fail
  • Use PostgresSaver in production; MemorySaver in tests only
  • Design thread IDs as: {workflow-type}-{entity-id}-{date} for observability
  • Set interrupt_before for any node that takes an irreversible external action
  • Enable LangSmith tracing in production for full workflow observability

Alternatives

Framework Strengths Weaknesses Best For
LangGraph Typed state, HITL, checkpointing, cycles Learning curve; LangChain dependency Production enterprise agentic workflows
CrewAI Declarative, easy multi-agent setup Less control over graph structure Task-based multi-agent collaboration
Custom SDK loop No dependencies; full control No checkpointing, HITL, or typed state Simple, short-lived single-agent workflows
Prefect / Airflow Battle-tested orchestration No native LLM/agent primitives Pure data pipelines; not agent-native
Temporal Durable execution, enterprise-grade Complex setup; not LLM-specific Long-running, highly reliable workflows

Trade-offs

Dimension LangGraph Advantage LangGraph Cost
State management Typed, persistent, auditable Schema design required upfront
HITL First-class interrupt mechanism Resume logic must be implemented
Fault tolerance Automatic checkpoint resumption PostgreSQL dependency in production
Observability LangSmith integration LangSmith account required
Learning curve Explicit, structured graph model Higher than raw SDK loop

Interview Questions

Q1: Why use LangGraph over a raw agent loop for a clinical prior authorization workflow?

Category: Architecture Difficulty: Senior Role: AI Architect

Answer Framework:

A raw agent loop (while loop calling the LLM) fails five operational requirements of a production prior auth workflow: (1) There is no state persistence — if the Lambda function times out mid-workflow, the prior auth is lost and must restart; (2) there is no human-in-the-loop mechanism — prior auth requires physician approval before submission; (3) there is no typed state — passing clinical data as strings in messages is fragile and untestable; (4) there is no conditional routing — different payers and procedures require different evaluation paths; (5) there is no observability — you cannot see what happened when a prior auth determination is questioned.

LangGraph addresses all five: PostgreSQL checkpointer for persistence, interrupt<em>before for HITL, TypedDict state for typed contracts, add</em>conditional_edges for routing, and LangSmith integration for observability. The cost is learning curve and a PostgreSQL dependency, which is acceptable for a production clinical workflow.


Q2: How does LangGraph's checkpointing enable both fault tolerance and human-in-the-loop?

Category: Technical Depth Difficulty: Senior Role: AI Architect / ML Engineer

Answer Framework:

Both fault tolerance and HITL require the same capability: the ability to persist workflow state and resume from an intermediate point. The checkpointer provides this by saving the complete typed state after each node completes.

For fault tolerance: if the workflow crashes after node 5, the checkpointer has state persisted through node 5. When the workflow is retried (with the same thread ID), the graph loads the checkpoint and continues from node 6. The failed node is retried from its last persisted predecessor state, not from the beginning.

For HITL: when interrupt<em>before=["physician</em>review"] is configured, the graph runs nodes 1–N, checkpoints the state, and returns control to the caller before executing physicianreview. The workflow is now paused at a known, persisted state. The physician reviews the draft letter, makes a decision, and calls graph.invoke({"physician</em>decision": "approved"}, config=config). The graph loads the checkpointed state, injects the physician decision, and continues from the physician_review node.

The critical design insight: HITL is not a special mechanism — it is fault tolerance applied to a planned pause.

Key Takeaways

  • LangGraph solves five operational requirements a raw agent loop cannot: persistence, HITL, typed state, conditional routing, and observability
  • State is the foundation — define the TypedDict schema first, before any nodes
  • Nodes are pure functions: read from state → produce state update (partial dict)
  • add<em>conditional</em>edges implements runtime routing based on state values
  • interrupt_before enables human-in-the-loop by pausing before a specified node and persisting state
  • PostgresSaver is required for production; MemorySaver is for testing only
  • Thread IDs identify workflow instances; use semantically meaningful IDs for observability
  • LangSmith provides the observability layer; enable it in production
Enterprise AI Interview Companion — v1.0 Updated July 02, 2026