Orchestration and Workflow Automation for AI

Core Architecture

graph TD subgraph "Scheduling Layer" CRON["Cron / Event Trigger\n(Airflow / Cloud Scheduler)"] API["API-triggered Workflows\n(webhook, FHIR subscription)"] end subgraph "Orchestration Layer" AIRFLOW["Apache Airflow\nBatch data pipelines\nScheduled ETL / indexing"] TEMPORAL["Temporal\nDurable AI workflows\nLong-running, human-in-loop"] K8S_JOB["Kubernetes Jobs\nOne-shot batch inference\nFine-tuning runs"] end subgraph "Execution Layer" WORKERS["AI Workers\nEmbedding / inference tasks"] GPUS["GPU Node Pool\nInference / fine-tuning"] API_CLIENTS["External API Clients\nAnthropic / OpenAI / Vertex"] end subgraph "Infrastructure" K8S["Kubernetes Cluster\nAutoscaling, GPU scheduling"] QUEUE["Task Queue\n(Redis / RabbitMQ)"] end CRON --> AIRFLOW & K8S_JOB API --> TEMPORAL AIRFLOW --> WORKERS TEMPORAL --> WORKERS K8S_JOB --> GPUS WORKERS --> API_CLIENTS & QUEUE WORKERS & GPUS --> K8S

Common Mistakes

1. Using Airflow for long-running workflows. Airflow tasks time out and occupy slots for their entire duration. A workflow waiting for physician approval occupies an Airflow worker slot for days. Use Temporal for workflows with human-in-the-loop steps.

2. Not setting GPU resource limits. Without explicit GPU limits in Kubernetes, a single workload can monopolize all GPU capacity. Always set both requests and limits for nvidia.com/gpu.

3. Over-broad retry policies on LLM activities. Retrying a non-idempotent LLM activity 10 times may produce 10 different outputs or consume significant API budget. LLM activities should retry only on transient network errors; they should not retry on model errors or rate limit responses without backoff.

4. Scheduling embedding and inference jobs at the same time. Nightly knowledge base refresh (embedding-intensive) and peak clinical usage (inference-intensive) should not compete for the same GPU resources. Schedule batch embedding jobs during off-peak inference hours.

Best Practices

Use Airflow for scheduled, partitioned data pipelines; use Temporal for durable, long-running AI workflows with human-in-the-loop steps
Set explicit resource requests and limits for GPU workloads in Kubernetes
Use Temporal's retry policy to control retry semantics per activity type; don't apply a single global retry policy
Schedule batch embedding jobs during off-peak inference hours to avoid GPU contention
Implement quality evaluation as the final step in every knowledge base refresh pipeline

Trade-offs

Orchestrator	Strengths	Weaknesses	Best For
Airflow	Mature, widely deployed, rich UI	Workers block during task execution	Scheduled batch pipelines
Temporal	Durable execution, fine-grained retry	Operational complexity, infrastructure required	Long-running, event-driven AI workflows
Kubernetes Jobs	Native GPU support, simple for batch	No workflow logic	One-shot batch inference, fine-tuning runs
Step Functions (AWS)	Managed, integrates with AWS services	Vendor-locked, limited duration	AWS-native AI pipelines

Interview Questions

Q: A clinical AI workflow has a step where a physician reviews an AI-generated document. The review may take anywhere from 10 minutes to 3 days. How would you design the orchestration for this workflow?

Category: System Design Difficulty: Senior Role: AI Architect

Answer Framework:

This is a durable workflow problem. The key constraint is that the workflow must remain paused for up to 3 days while consuming no compute resources — the physician review step is an external, human-triggered event, not a blocking computation.

Wrong approach: Using a thread.sleep() loop in an Airflow task or a Lambda function that polls every minute for 3 days. This consumes compute resources continuously and produces no audit trail.

Correct approach: Temporal's durable execution model is built for exactly this pattern. The workflow suspends after creating the physician review task, persisting its state in the Temporal service. When the physician completes the review (triggering a webhook or a poll activity), the workflow resumes from its exact suspension point. The full execution history — including suspension time, resume time, and physician identity — is recorded in Temporal's history log.

Key Points to Hit:

Durable execution vs. stateful polling: the former suspends without consuming resources
Audit trail requirement for clinical AI compliance
Timeout policy: explicit escalation at 24h and 72h, not silent expiry at 7 days
Separation of concerns: physician review task is created in the EHR workflow system, not managed by Temporal

Key Takeaways

Airflow is appropriate for scheduled, batch-oriented AI data pipelines; Temporal is appropriate for durable, event-driven AI workflows
GPU workloads require explicit resource limits in Kubernetes to prevent resource monopolization
Human-in-the-loop steps require durable workflow orchestration — polling loops in Airflow workers are an antipattern
Clinical AI workflows require complete audit trails; Temporal's execution history satisfies this requirement
Quality evaluation (golden query evaluation) should be a first-class step in every knowledge base refresh pipeline

Orchestration and Workflow Automation for AI#

Core Architecture#

Common Mistakes#

Best Practices#

Trade-offs#

Interview Questions#

Q: A clinical AI workflow has a step where a physician reviews an AI-generated document. The review may take anywhere from 10 minutes to 3 days. How would you design the orchestration for this workflow?#

Key Takeaways#