Orchestration and Workflow Automation for AI

Executive Summary

AI workloads differ from traditional software workloads in ways that fundamentally change how orchestration must be designed: jobs are long-running (minutes to hours for fine-tuning, seconds for batched inference), GPU resources must be reserved, retried with backoff, and released promptly, and failure modes include silent quality degradation rather than hard crashes. This chapter covers the orchestration layer for AI systems: scheduling recurring data pipelines, managing multi-step AI workflows, scaling inference infrastructure, and operating LLM workflows in production without accumulating technical debt from ad-hoc job management.

Learning Objectives

  • Select the right orchestration approach (Kubernetes, Airflow, Temporal) for each class of AI workload
  • Design durable, resumable workflows for multi-step AI pipelines
  • Implement retry logic, timeout management, and failure handling appropriate for LLM API calls
  • Operate AI infrastructure at scale: autoscaling, resource scheduling, and cost containment

Business Problem

AI pipelines run at the intersection of three scheduling domains that traditional job schedulers handle poorly: data operations (run nightly at 2 AM, skip if source unchanged), AI inference operations (variable latency, rate-limited external APIs, non-deterministic outputs), and model lifecycle operations (evaluate, promote, deploy new model versions). Without purpose-built orchestration, these pipelines are typically implemented as shell scripts, cron jobs, or fragile chains of dependent Lambda functions — systems that are difficult to monitor, impossible to retry reliably, and produce no audit trail.

Why This Technology Exists

The first generation of AI pipelines at scale used Apache Airflow, designed for data warehousing batch jobs. Airflow works well for scheduled, partition-based pipelines but struggles with the durable execution requirements of agentic AI: a multi-hour LLM workflow that must survive process restarts, a human-in-the-loop approval step that pauses execution for days, or a branching workflow that dynamically routes to different agents based on document classification.

Temporal emerged from Netflix's need to coordinate long-running distributed workflows that must survive infrastructure failures. Its durable execution model — where workflow state survives process crashes and can be replayed from a write-ahead log — is a direct fit for AI pipelines with external API calls, human approval gates, and multi-step orchestration requirements.

Core Architecture

Enterprise Considerations

Cost containment for GPU workloads: GPU compute is expensive (illustrative — verify current cloud pricing). GPU nodes should be provisioned on demand for batch inference jobs and released when complete. Use spot/preemptible instances for fault-tolerant batch workloads (embedding, evaluation) to reduce cost. Reserve on-demand capacity only for latency-sensitive inference serving.

Workflow observability: Temporal provides complete execution history for every workflow run, which serves dual purposes: debugging failed workflows and providing audit trail for regulated workloads (clinical AI requires audit logs for every AI-assisted clinical document). Ensure the Temporal server is deployed in the same data residency region as clinical data.

Multi-tenancy in orchestration: In a Reference Healthcare Organization with multiple departments, isolate workloads using Kubernetes namespaces with resource quotas. Prevent one department's batch embedding job from starving clinical inference workloads during peak hours.

Common Mistakes

1. Using Airflow for long-running workflows. Airflow tasks time out and occupy slots for their entire duration. A workflow waiting for physician approval occupies an Airflow worker slot for days. Use Temporal for workflows with human-in-the-loop steps.

2. Not setting GPU resource limits. Without explicit GPU limits in Kubernetes, a single workload can monopolize all GPU capacity. Always set both requests and limits for nvidia.com/gpu.

3. Over-broad retry policies on LLM activities. Retrying a non-idempotent LLM activity 10 times may produce 10 different outputs or consume significant API budget. LLM activities should retry only on transient network errors; they should not retry on model errors or rate limit responses without backoff.

4. Scheduling embedding and inference jobs at the same time. Nightly knowledge base refresh (embedding-intensive) and peak clinical usage (inference-intensive) should not compete for the same GPU resources. Schedule batch embedding jobs during off-peak inference hours.

Best Practices

  • Use Airflow for scheduled, partitioned data pipelines; use Temporal for durable, long-running AI workflows with human-in-the-loop steps
  • Set explicit resource requests and limits for GPU workloads in Kubernetes
  • Use Temporal's retry policy to control retry semantics per activity type; don't apply a single global retry policy
  • Schedule batch embedding jobs during off-peak inference hours to avoid GPU contention
  • Implement quality evaluation as the final step in every knowledge base refresh pipeline

Trade-offs

Orchestrator Strengths Weaknesses Best For
Airflow Mature, widely deployed, rich UI Workers block during task execution Scheduled batch pipelines
Temporal Durable execution, fine-grained retry Operational complexity, infrastructure required Long-running, event-driven AI workflows
Kubernetes Jobs Native GPU support, simple for batch No workflow logic One-shot batch inference, fine-tuning runs
Step Functions (AWS) Managed, integrates with AWS services Vendor-locked, limited duration AWS-native AI pipelines

Interview Questions

Q: A clinical AI workflow has a step where a physician reviews an AI-generated document. The review may take anywhere from 10 minutes to 3 days. How would you design the orchestration for this workflow?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

This is a durable workflow problem. The key constraint is that the workflow must remain paused for up to 3 days while consuming no compute resources — the physician review step is an external, human-triggered event, not a blocking computation.

Wrong approach: Using a thread.sleep() loop in an Airflow task or a Lambda function that polls every minute for 3 days. This consumes compute resources continuously and produces no audit trail.

Correct approach: Temporal's durable execution model is built for exactly this pattern. The workflow suspends after creating the physician review task, persisting its state in the Temporal service. When the physician completes the review (triggering a webhook or a poll activity), the workflow resumes from its exact suspension point. The full execution history — including suspension time, resume time, and physician identity — is recorded in Temporal's history log.

Key Points to Hit:

  • Durable execution vs. stateful polling: the former suspends without consuming resources
  • Audit trail requirement for clinical AI compliance
  • Timeout policy: explicit escalation at 24h and 72h, not silent expiry at 7 days
  • Separation of concerns: physician review task is created in the EHR workflow system, not managed by Temporal

Key Takeaways

  • Airflow is appropriate for scheduled, batch-oriented AI data pipelines; Temporal is appropriate for durable, event-driven AI workflows
  • GPU workloads require explicit resource limits in Kubernetes to prevent resource monopolization
  • Human-in-the-loop steps require durable workflow orchestration — polling loops in Airflow workers are an antipattern
  • Clinical AI workflows require complete audit trails; Temporal's execution history satisfies this requirement
  • Quality evaluation (golden query evaluation) should be a first-class step in every knowledge base refresh pipeline

Further Reading