When the user asks to review, optimize, simplify, or audit a workflow, walk this checklist and produce a structured report. Findings are graded:
- CRITICAL — likely production incident waiting to happen. Recommend before next deploy.
- WARN — smell or maintenance burden. Recommend, but not blocking.
- INFO — observation; no fix required.
Treat the checklist as guidance — not every item applies to every workflow. A 3-task batch job doesn't need a failureWorkflow. Use judgment.
- Load the workflow definition. Either:
- User supplied a JSON file → read it.
- User named a registered workflow →
conductor workflow get {name} --version {v}(omit--versionfor the latest).
- For each
SIMPLEtask, load its task definition:conductor taskDef get {name}. Timeout/retry config lives there, not on the workflow task. - (Optional, if the user asks about runtime behavior) Look at recent executions:
conductor workflow search -w {name} -s FAILED -c 20and inspect a few withget-execution. - Walk the checklist below, recording findings.
- Report grouped by severity. Offer to apply each fix. Don't apply silently.
-
A1. Description present.
descriptionshould explain what the workflow does and why. Empty descriptions force readers to reverse-engineer intent.- Severity: WARN if missing.
-
A2. ownerEmail set. Routes alerts and identifies the on-call.
- Severity: WARN if missing.
-
A3. schemaVersion: 2. Older schemas use legacy semantics. New workflows should always be 2.
- Severity: WARN if missing or 1.
-
A4. Task count. Soft limit ~100 tasks per workflow definition. Beyond that, readability and observability degrade — extract logical chunks into
SUB_WORKFLOWs, just like refactoring oversized functions.- Severity: WARN if
len(tasks) > 100.
- Severity: WARN if
-
A5. Descriptive
taskReferenceName. Each ref name is unique workflow-wide and shows up in the UI/logs. Prefervalidate_orderovertask1.- Severity: INFO/WARN.
-
A6. Understand the three timeouts. Reference (no severity — purely educational). Each task definition has three timeout knobs and they catch different failure modes:
pollTimeoutSeconds— task sits in the queue this long without a worker picking it up → abandoned. Catches "no worker is polling for this type."responseTimeoutSeconds— once a worker checks out the task, how long without a heartbeat before redelivery. Catches "worker crashed mid-execution."timeoutSeconds— total wall clock from pickup to terminal status. Catches "worker is alive but the task takes too long."
The severity ladder for missing/zero timeouts is B1 below.
-
A7. Workflow versioning hygiene. Don't in-place update workflows that have running production executions — bump
version, deploy callers pointing at the new version, deprecate the old when no executions remain. In-place updates can affect running executions in ways that vary by task type (especially around input expressions). New versions are free; the registry holds many.- Severity: WARN if a workflow with executions in the last 30 days has been edited in place.
- B1. Task timeouts on every SIMPLE task. Each task definition needs
responseTimeoutSeconds,pollTimeoutSeconds, andtimeoutSeconds. See A6 for what each catches. Single severity ladder:- CRITICAL if any of the three is
0or unset on a task def for a SIMPLE task in production use. - WARN if all three are set but one or more are clearly too low (e.g.
responseTimeoutSeconds: 1). - INFO if all three are set with reasonable values.
- CRITICAL if any of the three is
- B2. Workflow-level timeout.
timeoutSeconds+timeoutPolicy(TIME_OUT_WForALERT_ONLY). Without one, a stuck workflow can run forever.- Severity: WARN by default; only INFO if the workflow legitimately has no upper bound (long-lived state machines, event-driven loops). Confirm with the user.
- B3. Retry policy on SIMPLE tasks.
retryCount,retryLogic(FIXEDorEXPONENTIAL_BACKOFF),retryDelaySeconds. Transient errors are common —retryCount: 0exposes every blip.- Severity: WARN if
retryCount == 0and the task isn't intrinsically non-retryable.
- Severity: WARN if
- B4.
failureWorkflowfor cleanup/alerting. Runs when the parent fails. Common pattern: send an alert, mark the entity failed in your DB, release reserved resources. Often missing.- Severity: WARN if absent on workflows that mutate external state.
- B5. DO_WHILE iteration cap. The
loopConditionshould always include a max-iteration guard ($.loop_ref['iteration'] < N) in addition to any result-driven exit. Without it, an unexpected output spins forever.- Severity: CRITICAL if unbounded.
- B6.
optional: trueon non-critical branches. A best-effort notification, audit log, or analytics push shouldn't fail the workflow. Mark them optional.- Severity: INFO — flag candidates, don't dictate.
- B7. Rate limits and concurrent-exec limits on task defs. Two related throttling levers, often both missing:
rateLimitPerFrequency+rateLimitFrequencyInSeconds— token-bucket rate limit. Use for tasks calling external APIs with quotas (Stripe, Slack, third-party LLMs). Without this, a spike in workflow starts blows your quota.concurrentExecLimit— caps simultaneous executions of this task across all workflows. Use for resource-bound tasks: heavy DB writes, GPU-bound model calls, memory-hungry transforms.- Severity: WARN on tasks calling external rate-limited APIs without
rateLimitPerFrequency. WARN on resource-bound tasks withoutconcurrentExecLimit.
- B8.
jsonOutput: truewithout "JSON" in the prompt. Conductor's@DocumentedonjsonOutputnotes: "Depending on the model you MUST include JSON word as part of the prompt." Anthropic Claude in particular silently degrades to prose when this cue is missing.- Severity: WARN when an
LLM_CHAT_COMPLETEsetsjsonOutput: trueand neither the system nor user messages contain the substringJSON(case-insensitive). Also recommend pairing withoutputSchemafor stricter contracts (Conductor retries on schema-validation failure).
- Severity: WARN when an
- B9.
previousResponseIdprovider lock-in / chain breakage. The OpenAI Responses-API chaining field is silently ignored on other providers, and a mid-chain provider switch breaks the chain.- Severity: WARN when any task uses
previousResponseIdand either (a)llmProvideris notopenai/azureopenai, or (b) a chained task's provider differs from the task whoseresponseIdit references. - For long-running workflows where chain lifetime exceeds OpenAI's ~30-day
responseIdretention, recommend the accumulated-messages fallback (../examples/ai-agent-loop.md) and downgrade to INFO when an explicit fallback path is present.
- Severity: WARN when any task uses
- B10. HTTP task hitting an LLM provider API — use the built-in LLM task instead. Conductor ships first-class LLM tasks (
LLM_CHAT_COMPLETE,LLM_GENERATE_EMBEDDINGS,LLM_GENERATE_IMAGE,LLM_GENERATE_TTS,LLM_GENERATE_VIDEO,LLM_SEARCH_INDEX). Hand-rolling the same call as anHTTPtask toapi.openai.com/api.anthropic.com/generativelanguage.googleapis.com/ Vertex / Bedrock / Azure-OpenAI / Cohere / Mistral / Grok / Perplexity / HuggingFace / Ollama loses everything the built-in tasks give you: auth wiring, retries, token accounting, the{role, message}schema,webSearch/codeInterpreterbuilt-in tools,previousResponseIdchaining,tools[]function-calling, structured-output parsing (jsonOutput+outputSchema-driven retry), and a uniformoutput.resultshape that downstream tasks can consume.- Severity: CRITICAL when an
HTTPtask'shttp_request.urimatches a known LLM-provider host (*.openai.com,*.anthropic.com,generativelanguage.googleapis.com,*-aiplatform.googleapis.com,bedrock-runtime.*.amazonaws.com,*.openai.azure.com,api.cohere.ai,api.mistral.ai,api.x.ai,api.perplexity.ai,api-inference.huggingface.co,*.ollama.ai, or any/v1/chat/completions,/v1/messages,/v1/embeddings,/v1/responsespath on a non-Conductor host). - Fix: replace the HTTP task with the matching
LLM_*task. If the user says "the server doesn't have an Anthropic integration configured," the answer is to setANTHROPIC_API_KEY(or the provider-equivalent env var) on the Conductor server, not to keep the HTTP task. Conductor auto-enables providers when the key is present. - Legitimate exceptions (downgrade to INFO with a one-line note): (a) the URL is a non-AI endpoint the provider happens to host (e.g. an admin/billing API); (b) the user has demonstrated a specific feature gap not yet exposed by the built-in task — name the missing field. Provider lock-in concerns ("we want to swap providers later") are not a reason for HTTP; that's exactly what
llmProvideron the built-in task solves.
- Severity: CRITICAL when an
-
C1. INLINE/graaljs scope. JavaScript inline is for trivial validation, format conversion, simple computation. Anything with business logic — multi-step transforms, external dependencies, side effects — belongs in a worker.
- Heuristic: INLINE script over ~15 lines, or one that's hard to follow at a glance, is a smell.
- Severity: WARN.
-
C2. Prefer
JSON_JQ_TRANSFORMfor data shaping. JQ is purpose-built and faster than INLINE for filter/map/aggregate. INLINE makes sense for control flow or arithmetic; JQ for shape transforms.- Severity: INFO.
-
C3. Bounded fan-out. Static
FORK_JOINwith > ~20 branches is a smell — switch toFORK_JOIN_DYNAMIC. Dynamic fork with thousands of branches needs batching (chunk inputs, run sub-workflows of size ~50).- Severity: WARN at high static counts; CRITICAL at unbounded dynamic counts without batching.
-
C4.
asyncComplete: truefor long-running operations. Worker initiates external work, returns immediately, then signals completion later. Avoids holding worker threads for hours.- Severity: INFO.
-
C5. SUB_WORKFLOW for reuse, not organization. Each sub-workflow has its own execution context, separate UI view, and orchestration overhead. Worth it when:
- the same logic is reused across multiple parents, OR
- the chunk is independently scheduled or testable.
Don't extract a sub-workflow just to "organize" a long workflow into chapters — that's what naming and the description field are for. The cost is real: debugging a single failure now spans two execution views.
- Severity: WARN if a SUB_WORKFLOW is used by exactly one parent and isn't independently scheduled.
- D1. No secrets in workflow input. Tokens, API keys, signing secrets must come from the secrets system (
${workflow.secrets.X}on Orkes) or worker environment variables — never${workflow.input.token}. Workflow inputs are visible in the execution view.- Severity: CRITICAL if a real secret is being passed via input.
- D2. No hardcoded URLs / config in task definitions. Parameterize via
${workflow.input.x}or${workflow.variables.x}— environment-specific URLs hardcoded into a definition mean a separate definition per environment.- Severity: WARN.
- D3.
outputParametersis a public API. Other workflows, services, and dashboards depend on the workflow's output shape. Treat changes the way you'd treat function-signature changes: additions are usually safe, removals and renames are breaking. Bumpversionon breaking output changes; never reshape outputs in place.- Severity: WARN if a workflow with active consumers had outputs renamed or removed in place.
- D4. LLM outputs that route control flow need defensive handling. When a SWITCH branches on
${chat.output.result.action}(or any LLM-emitted field), an unparseable or unexpected emission can silently flow into the wrong branch.- Severity: WARN when a SWITCH whose
expressionreadsoutput.result.<x>from anLLM_CHAT_COMPLETEtask has a non-emptydefaultCasethat performs business logic (writes, finalize, etc.). Recommend either an emptydefaultCase: []or a sentinel/no-op handler. See template-resolution.md Pitfall 1.
- Severity: WARN when a SWITCH whose
Sometimes the right answer is not a workflow. Smell tests:
- E1. Sub-100ms latency-critical paths. Workflow start has measurable overhead (queue write, definition load, dispatch). If a user is waiting synchronously, prefer a direct call.
- E2. Single-task "workflows." A workflow with one HTTP task is a queue with extra steps. Use a queue, scheduled worker, or just a function call.
- E3. Large payloads in inputs/outputs. Conductor has practical limits — typically a few MB before perf degrades and the UI struggles. Push blobs (uploaded files, large model outputs, dataset rows) to object storage and let the workflow carry only references (
{ "bucket": "...", "key": "..." }).- Severity: WARN/CRITICAL depending on actual payload size and frequency.
- E4. Reinventing a built-in task with HTTP / INLINE / a custom worker. Conductor ships dedicated system tasks for most common operations — using HTTP, INLINE, or a hand-rolled worker for any of them loses retries, schema validation, observability, and clean parameter swaps. B10 is the LLM-specific instance of this rule (CRITICAL); E4 covers everything else (WARN by default).
-
Common patterns to flag:
Smell Use built-in HTTPPOST to a Kafka REST proxy, or worker that calls a Kafka producerKAFKA_PUBLISHWorker that renders HTML/markdown to PDF GENERATE_PDFHTTPPOST topinecone.io,*.pinecone.io,api.pinecone.io,*.weaviate.network, MongoDB Atlas Search, or worker that wraps a vector-DB clientLLM_INDEX_TEXT/LLM_STORE_EMBEDDINGS/LLM_SEARCH_INDEX/LLM_SEARCH_EMBEDDINGS/LLM_GET_EMBEDDINGSWorker that just sleeps / polls a deadline WAIT(duration oruntil)Worker that waits on a human approval queue HUMANWorker that triggers another workflow via the REST API SUB_WORKFLOW(synchronous) orSTART_WORKFLOW(fire-and-forget)INLINEscript that just reshapes / filters / aggregates / stringifies JSONJSON_JQ_TRANSFORM(also covered by C2 — INFO)Worker that publishes to SQS / internal Conductor event sink EVENTWorker that resolves "which task to run" at runtime from input DYNAMIC/FORK_JOIN_DYNAMICHTTPPOST to OpenAI Images / Vertex Imagen, OpenAI TTS, OpenAI Sora / Vertex VeoGENERATE_IMAGE/GENERATE_AUDIO/GENERATE_VIDEO(B10 CRITICAL — these are LLM-provider hosts)HTTPGET/POST to an MCP serverLIST_MCP_TOOLS/CALL_MCP_TOOL -
Severity: WARN by default; CRITICAL when the reinvented task is on the B10 list (LLM/media providers — auth/secret-leak risk is concentrated there).
-
Fix: replace the HTTP/INLINE/worker task with the matching built-in. If the user objects ("we want flexibility / we want to swap providers"), point out that flexibility is exactly what
llmProvideronLLM_*,vectorDBonLLM_INDEX_TEXT/LLM_SEARCH_INDEX, andsubWorkflowParamonSUB_WORKFLOWalready give you. -
Legitimate exceptions (downgrade to INFO with one-line reason): (a) the operation truly has no built-in (custom internal API, proprietary system); (b) the user has demonstrated a specific missing feature in the built-in — name it. In case (a), follow SKILL.md Rule 7 to scaffold a worker (ask language, WebFetch the SDK).
-
Render findings like this:
Workflow: order_processing v3 (47 tasks)
CRITICAL (4)
✗ B1 SIMPLE task `charge_card`: responseTimeoutSeconds=0
→ Set responseTimeoutSeconds >= 30, pollTimeoutSeconds >= 60, timeoutSeconds = 300
✗ B5 DO_WHILE `retry_loop`: condition has no iteration cap
→ Add `$.retry_loop['iteration'] < 10 &&` to loopCondition
✗ B10 HTTP task `call_claude` posts to https://api.anthropic.com/v1/messages
→ Replace with an LLM_CHAT_COMPLETE task (llmProvider: anthropic). Set
ANTHROPIC_API_KEY on the server if the integration isn't configured yet.
✗ D1 Workflow input `stripeKey` looks like a secret
→ Move to ${workflow.secrets.STRIPE_KEY} or worker env
WARN (4)
⚠ A1 Description is empty
⚠ B2 No workflow timeout. Add timeoutSeconds + timeoutPolicy.
⚠ B3 SIMPLE task `send_email` has retryCount=0 (transient SMTP errors will fail the workflow)
⚠ C1 INLINE task `compute_pricing` has 60 lines of JS — extract to a worker
INFO (2)
• A4 47 tasks — well within the 100-task soft limit
• A5 Task names are descriptive
Recommended Changes (priority order)
[ ] task_def_charge_card.json set responseTimeoutSeconds=30, pollTimeoutSeconds=60, timeoutSeconds=300
[ ] order_processing.json:7 add `$.retry_loop['iteration'] < 10` clause to loopCondition
[ ] order_processing.json:2 move stripeKey to ${workflow.secrets.STRIPE_KEY}
[ ] order_processing.json:1 add description, timeoutSeconds, timeoutPolicy
[ ] task_def_send_email.json set retryCount=3, retryLogic=EXPONENTIAL_BACKOFF
[ ] compute_pricing INLINE extract to a Python worker
Then offer: "Want me to apply any of these? I can update the task definitions and re-register the workflow."
Always end with a Recommended Changes checklist even if the findings are split by severity above. The checklist is the actionable artifact the user takes away — one bullet per fix, file/path pointer first, then the change to make. Skip findings that are INFO-only.
A simpler workflow is one a new engineer can read in five minutes. The biggest levers:
- Extract sub-workflows. Group related tasks (validate-and-prep, fulfill, notify) into separate registered workflows.
- Replace INLINE business logic with workers. A worker has a name, version, tests, and a stack trace; INLINE has none of those.
- Flatten nested SWITCHes. Two-level decision trees are usually a sign that one level should be a sub-workflow.
- Name things. Every task ref name and variable should read as English.
Don't over-refactor. If the workflow is already small and readable, "simpler" might be a no-op — say so.