Skip to content

Improve SUB_WORKFLOW reliability, recovery, and scalability#973

Open
rajeshwar-nu wants to merge 29 commits into
conductor-oss:mainfrom
steeleye:improvments/dyn-fork-join-sw
Open

Improve SUB_WORKFLOW reliability, recovery, and scalability#973
rajeshwar-nu wants to merge 29 commits into
conductor-oss:mainfrom
steeleye:improvments/dyn-fork-join-sw

Conversation

@rajeshwar-nu
Copy link
Copy Markdown
Contributor

@rajeshwar-nu rajeshwar-nu commented Apr 4, 2026

Summary

This PR improves the reliability, recovery behavior, and scalability of SUB_WORKFLOW execution.

It addresses a failure mode where parent workflows can be left with sub-workflow tasks that are persisted but not cleanly attached or recoverable, especially under large fanout, nested sub-workflow creation, or transient queue/persistence instability.

It also reduces the amount of heavy child-workflow startup work performed on the critical parent execution path, which makes SUB_WORKFLOW orchestration behave better under load.

Why

SUB_WORKFLOW is an orchestration primitive, not a worker-polled task.

In the previous model:

  • child launch and parent attachment were too tightly coupled
  • retries did not have a stable child identity to reattach to
  • partially launched SUB_WORKFLOW tasks could remain in SCHEDULED without a reliable recovery path
  • parent/child attachment could lag behind child creation
  • revisit timing for active SUB_WORKFLOW tasks could be too slow for prompt repair
  • large dynamic fanout of nested sub-workflows put too much synchronous pressure on orchestration

For large dyn fork-join -> subworkflow -> dyn fork-join -> subworkflow workloads, that made the system more fragile than it needed to be.

This PR changes SUB_WORKFLOW launch semantics to better match its actual role:

  • durable parent-owned child identity
  • safe retry and reattach
  • faster parent attachment
  • quicker revisit for unresolved active sub-workflow tasks
  • less synchronous orchestration pressure on the parent path

What Changed

Idempotent child launch

  • reserve a stable child workflow id per owning parent workflow task
  • reuse the same child id across retries instead of risking duplicate child creation
  • use execution-store truth for child existence checks instead of index fallback

Faster and lighter parent attachment

  • treat SUB_WORKFLOW launch as an async orchestration step
  • create or reattach the child workflow and attach the parent task as soon as the child record exists
  • avoid waiting for the child workflow’s initial inline expansion before persisting subWorkflowId

This reduces parent-path blocking and improves scalability for nested fanout workloads.

Recovery behavior

  • allow SCHEDULED SUB_WORKFLOW tasks without subWorkflowId to retry launch instead of dead-ending
  • preserve reattach behavior after partial persistence failures
  • make launch failures explicit on the task instead of leaving an ambiguous blank scheduled state

Reservation lifecycle management

  • add owned reservation cleanup for cancel/delete flows
  • support both single-task reservation removal and bulk workflow-owned cleanup
  • store Redis reservations in a workflow-owned hash for cheaper lookup and cleanup

Faster revisit for active SUB_WORKFLOW tasks

  • give active SUB_WORKFLOW tasks dedicated postpone behavior
  • use workflowOffsetTimeout for both SCHEDULED and IN_PROGRESS SUB_WORKFLOW tasks
  • avoid inheriting generic worker-oriented postpone behavior for orchestration tasks

This improves reliability by reducing the time unresolved sub-workflow tasks can sit before being revisited.

Reliability and Scalability Impact

This PR improves reliability by:

  • making child launch retry-safe
  • making partial launch/attach failures recoverable
  • reducing ambiguous SCHEDULED states
  • revisiting unresolved SUB_WORKFLOW tasks sooner

This PR improves scalability by:

  • reducing heavy inline child startup work on the parent path
  • shortening the parent/child attachment gap
  • making nested fanout workloads less sensitive to transient backend or queue issues

Backward Compatibility

This PR intentionally changes SUB_WORKFLOW behavior:

  • SUB_WORKFLOW launch now follows the async/idempotent attach model implemented here
  • WorkflowExecutor.startWorkflowIdempotent(...) now returns a WorkflowModel
  • persistence implementations must support sub-workflow id reservation APIs

This is both a behavioral change and an SPI change, so mixed-version rollout should be treated carefully.

Tests

Added/updated coverage for:

  • idempotent create-vs-reattach behavior
  • execution-store-only existence checks
  • transient retry for SCHEDULED tasks without subWorkflowId
  • faster parent attach after child creation
  • reservation cleanup on task/workflow deletion
  • backend reservation behavior across persistence implementations
  • postpone behavior for active SUB_WORKFLOW tasks

@rajeshwar-nu rajeshwar-nu marked this pull request as ready for review April 4, 2026 07:25
@rajeshwar-nu rajeshwar-nu changed the title Make sub-workflow launch recoverable for dyn fork-join fanout Make sub-workflow launches durable for dyn fork-join fanouts Apr 4, 2026
@rajeshwar-nu rajeshwar-nu changed the title Make sub-workflow launches durable for dyn fork-join fanouts Improve SUB_WORKFLOW reliability, recovery, and scalability Apr 5, 2026
@v1r3n v1r3n requested a review from mp-orkes April 7, 2026 01:43
@rajeshwar-nu
Copy link
Copy Markdown
Contributor Author

Hi @vishesh-orkes, @mp-orkes , @nthmost-orkes , can I please get some thoughts on this 🙏🏻
Thanks

@rajeshwar-nu rajeshwar-nu force-pushed the improvments/dyn-fork-join-sw branch 2 times, most recently from 99695ae to e0d362c Compare April 20, 2026 06:03
@v1r3n
Copy link
Copy Markdown
Collaborator

v1r3n commented Apr 20, 2026

@rajeshwar-nu can you address the failing tests otherwise this PR LGTM

@rajeshwar-nu rajeshwar-nu force-pushed the improvments/dyn-fork-join-sw branch from e0d362c to 45c5dff Compare April 20, 2026 07:32
After making SUB_WORKFLOW an async system task, many integration
tests need an explicit pop+execute of the SUB_WORKFLOW queue and a
sweep() to advance parent or child workflows that were previously
driven inline by the decide loop.

- HierarchicalForkJoinSubworkflow{Rerun,Restart,Retry}Spec: sweep the
  mid-level workflow before polling its integration_task_2 so the
  task gets scheduled after the async SUB_WORKFLOW executes.
- DoWhileSpec: pop+execute the SUB_WORKFLOW spawned by the DoWhile
  iteration and sweep the resulting subworkflow so its first task
  is scheduled.
- ForkJoinSpec: pop+execute the SUB_WORKFLOW that a retry schedules;
  sweep the nested subworkflow before asserting on its first task.
- NestedForkJoinSubWorkflowSpec: pop+execute the SUB_WORKFLOW that
  restart/retry schedules on the parent workflow.
- SubWorkflow{Rerun,Restart,Retry}Spec: after rerun/restart/retry on
  the root or mid-level, pop+execute each newly scheduled
  SUB_WORKFLOW and sweep the corresponding child workflow so its
  first task is scheduled.
@rajeshwar-nu
Copy link
Copy Markdown
Contributor Author

@v1r3n thanks for the review, tests are fixed now

@v1r3n
Copy link
Copy Markdown
Collaborator

v1r3n commented Apr 21, 2026

I see the sub-workflow ids are pre-fetched. I think an alternative to it might be to accept workflow Id as part of the start workflow request. The ids can be generated using IdGenerator attached to the parent workflow and then start the sub-workflow. This avoids the id reservation logic and keeps things simpler. This also means as a user you can pass the id as part of the start workflow request. I would then go ahead and update IdGenerator to add a validate method that ensures that the user supplied id is in a valid format.

@rajeshwar-nu
Copy link
Copy Markdown
Contributor Author

rajeshwar-nu commented Apr 21, 2026

I see the sub-workflow ids are pre-fetched. I think an alternative to it might be to accept workflow Id as part of the start workflow request. The ids can be generated using IdGenerator attached to the parent workflow and then start the sub-workflow. This avoids the id reservation logic and keeps things simpler. This also means as a user you can pass the id as part of the start workflow request. I would then go ahead and update IdGenerator to add a validate method that ensures that the user supplied id is in a valid format.

Generating subworkflow_id in parent workflow does not scale well for dynamic fork-join with many forks (around 1k forks). The idea behind the reservation is to keep things async, including generation of subworkflow-id, scheduling and starting the subworkflow. The reservation can quickly be created, which allows huge dyn fork-join fan-outs to complete quickly (preventing the the leased lock from getting timeout). Otherwise, large no. of forks executing sequentially within the single lease of parent cannot complete in time, leaving the workflow vulnerable to get inconsistent updates by a subsequent decider call

@v1r3n
Copy link
Copy Markdown
Collaborator

v1r3n commented Apr 26, 2026

I think we can avoid having the id reservation. Traditionally, ids are blocked and need to be accounted for, in case of conductor each id is a random UUID, so instead of doing reservation, we can just generate an id on demand and attach to a workflow (we will need to update workflow executor to be able to accept id as part of start request, which should only be available internally for sub-workflow flow and not for anything else).

This avoids the complexity, and if for some reason if the system fails, the next attempt will generate new id, we don't need to do reservation.

How about this approach?

@v1r3n
Copy link
Copy Markdown
Collaborator

v1r3n commented May 2, 2026

I see the sub-workflow ids are pre-fetched. I think an alternative to it might be to accept workflow Id as part of the start workflow request. The ids can be generated using IdGenerator attached to the parent workflow and then start the sub-workflow. This avoids the id reservation logic and keeps things simpler. This also means as a user you can pass the id as part of the start workflow request. I would then go ahead and update IdGenerator to add a validate method that ensures that the user supplied id is in a valid format.

Generating subworkflow_id in parent workflow does not scale well for dynamic fork-join with many forks (around 1k forks). The idea behind the reservation is to keep things async, including generation of subworkflow-id, scheduling and starting the subworkflow. The reservation can quickly be created, which allows huge dyn fork-join fan-outs to complete quickly (preventing the the leased lock from getting timeout). Otherwise, large no. of forks executing sequentially within the single lease of parent cannot complete in time, leaving the workflow vulnerable to get inconsistent updates by a subsequent decider call

what about generating them in parallel? we are talking about generating 10K UUIDs.

@rajeshwar-nu
Copy link
Copy Markdown
Contributor Author

@v1r3n I changed the approach here. Instead of using a reservation store, the flow is now:

  1. When a SUB_WORKFLOW task is scheduled, Conductor persists it as SCHEDULED.
  2. The SUB_WORKFLOW task is pushed to the system task queue for async processing.
  3. The async system task worker picks up the task and starts the child workflow through an idempotent start path.
  4. The child workflow id is deterministic for the parent task attempt: parentWorkflowId + parentTaskId + retryCount. It is still formatted as a UUID, so storage/backend compatibility is preserved.
  5. Because the child id is deterministic, retries/restarts for the same parent task attempt target the same child workflow id. If the child already exists, the idempotent starter reuses it; if it does not exist, it creates it.
  6. The parent SUB_WORKFLOW task is only attached to the child workflow id after the child workflow record exists, avoiding a long UI/API window where the parent points to a non-existent child.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants