Describe the bug
Description
When conductor.workflow-execution-lock.type is set to postgres (or any non-reentrant lock
implementation), workflows containing a SUB_WORKFLOW task stall permanently after the
sub-workflow completes. The parent workflow remains in RUNNING status indefinitely, and the
sweeper logs repeat every ~30 seconds:
INFO WorkflowSweeper : Running sweeper for workflow <id>
INFO WorkflowSweeper : can't get a lock on <id>, will try after 30 seconds
Root Cause
WorkflowSweeper.sweep() acquires the execution lock at line 162, then calls
workflowExecutor.decide(workflowId) which calls executionLockService.acquireLock() again
internally (WorkflowExecutorOps.decide(), line 1089). With a non-reentrant lock (Postgres
advisory locks are non-reentrant by design), the second acquisition always returns false.
decide() then returns null, which sweep() interprets as external lock contention and
re-queues the workflow with a backoff delay. This repeats forever.
// WorkflowSweeper.sweep() — acquires lock here:
if (!executionLockService.acquireLock(workflowId)) { ... }
// Then calls decide() which ALSO calls acquireLock() internally:
workflow = workflowExecutor.decide(workflowId); // always returns null with non-reentrant lock
// null is misread as contention -> re-queue with backoff -> infinite loop
if (workflow == null) {
queueDAO.push(DECIDER_QUEUE, workflowId, 0, backoffSeconds);
return;
}
The SUB_WORKFLOW task is the most common trigger because its completion does not re-enqueue the
parent workflow directly — it relies on the next sweep cycle to detect the completed sub-workflow
and advance the parent. That sweep cycle then hits this deadlock.
Steps to Reproduce
- Configure Conductor with
conductor.workflow-execution-lock.type: postgres and
conductor.app.workflow-execution-lock-enabled: true
- Define a workflow containing a
SUB_WORKFLOW task followed by any subsequent task
- Start the workflow and let the sub-workflow complete
- Observe the parent workflow remains in
RUNNING indefinitely
The following unit test reproduces the bug and can be added to WorkflowSweeperTest:
/**
* Reproduction for: sweep() acquires the execution lock, then calls decide(workflowId) which
* also calls acquireLock() internally. With a non-reentrant lock (e.g. Postgres advisory lock)
* the second acquisition always fails, decide() returns null, and the sweeper re-queues the
* workflow with a backoff delay — leaving it permanently stuck even though the sub-workflow has
* already completed.
*
* <p>Expected (correct) behaviour: decide() is called and the workflow advances.
* Actual (buggy) behaviour: decide() returns null because the sweeper already holds the lock;
* the workflow is pushed back to the decider queue with a 30-second delay indefinitely.
*/
@Test
public void sweepDoesNotStallWhenSubWorkflowCompletedAndLockIsNonReentrant() {
TaskModel subWorkflowTask =
newTask(
"sub-workflow-task",
TaskType.TASK_TYPE_SUB_WORKFLOW,
TaskModel.Status.IN_PROGRESS);
subWorkflowTask.setSubWorkflowId("sub-workflow-id");
WorkflowModel workflow = newWorkflow(List.of(subWorkflowTask));
WorkflowModel subWorkflow = new WorkflowModel();
subWorkflow.setStatus(WorkflowModel.Status.COMPLETED);
subWorkflow.setOutput(Map.of("result", "ok"));
// The sweeper acquires the lock successfully on the first call.
// The second call simulates a non-reentrant lock: decide(workflowId) calls acquireLock()
// internally and cannot acquire a lock that the sweeper already holds.
when(executionLockService.acquireLock(WORKFLOW_ID))
.thenReturn(true) // sweeper's own pre-acquire
.thenReturn(false); // decide()'s internal acquire — non-reentrant lock fails
when(workflowExecutor.getWorkflow(WORKFLOW_ID, true)).thenReturn(workflow);
// decide(workflowId) returns null when it cannot acquire the lock
when(workflowExecutor.decide(WORKFLOW_ID)).thenReturn(null);
workflowSweeper.sweep(WORKFLOW_ID);
// Bug: the workflow is pushed back to the decider queue with a backoff delay,
// causing it to stall permanently even though the sub-workflow is already done.
verify(queueDAO, never()).push(anyString(), anyString(), anyInt(), anyLong());
}
This test fails on the current main branch, confirming the bug.
Suggested Fix
Remove the redundant pre-acquisition of the lock from WorkflowSweeper.sweep(). The
decide(workflowId) method already handles its own locking — the sweeper's pre-acquire is not
only redundant but actively causes the deadlock with non-reentrant lock backends.
// WorkflowSweeper.sweep() — REMOVE these lines:
if (!executionLockService.acquireLock(workflowId)) {
log.error("Couldn't acquire lock to sweep workflow {}", workflowId);
return;
}
// ...
} finally {
executionLockService.releaseLock(workflowId); // also remove
}
With this change, sweep() calls decide(workflowId) directly, which acquires and releases the
lock itself. The workflow == null path remains correct for handling genuine external contention
(another thread holding the lock), since decide() still returns null in that case.
The executionLockService field and constructor parameter in WorkflowSweeper can then be
removed entirely as they are no longer used.
Environment
- Conductor version:
3.23.0 (also reproduced on current main)
- Lock type:
postgres (conductor.workflow-execution-lock.type: postgres)
- Trigger: any workflow with a
SUB_WORKFLOW task
Workaround
Disable the execution lock for single-node deployments:
conductor:
app:
workflow-execution-lock-enabled: false
workflow-execution-lock:
type: noop_lock
Describe the bug
Description
When
conductor.workflow-execution-lock.typeis set topostgres(or any non-reentrant lockimplementation), workflows containing a
SUB_WORKFLOWtask stall permanently after thesub-workflow completes. The parent workflow remains in
RUNNINGstatus indefinitely, and thesweeper logs repeat every ~30 seconds:
Root Cause
WorkflowSweeper.sweep()acquires the execution lock at line 162, then callsworkflowExecutor.decide(workflowId)which callsexecutionLockService.acquireLock()againinternally (
WorkflowExecutorOps.decide(), line 1089). With a non-reentrant lock (Postgresadvisory locks are non-reentrant by design), the second acquisition always returns
false.decide()then returnsnull, whichsweep()interprets as external lock contention andre-queues the workflow with a backoff delay. This repeats forever.
The
SUB_WORKFLOWtask is the most common trigger because its completion does not re-enqueue theparent workflow directly — it relies on the next sweep cycle to detect the completed sub-workflow
and advance the parent. That sweep cycle then hits this deadlock.
Steps to Reproduce
conductor.workflow-execution-lock.type: postgresandconductor.app.workflow-execution-lock-enabled: trueSUB_WORKFLOWtask followed by any subsequent taskRUNNINGindefinitelyThe following unit test reproduces the bug and can be added to
WorkflowSweeperTest:This test fails on the current
mainbranch, confirming the bug.Suggested Fix
Remove the redundant pre-acquisition of the lock from
WorkflowSweeper.sweep(). Thedecide(workflowId)method already handles its own locking — the sweeper's pre-acquire is notonly redundant but actively causes the deadlock with non-reentrant lock backends.
With this change,
sweep()callsdecide(workflowId)directly, which acquires and releases thelock itself. The
workflow == nullpath remains correct for handling genuine external contention(another thread holding the lock), since
decide()still returnsnullin that case.The
executionLockServicefield and constructor parameter inWorkflowSweepercan then beremoved entirely as they are no longer used.
Environment
3.23.0(also reproduced on currentmain)postgres(conductor.workflow-execution-lock.type: postgres)SUB_WORKFLOWtaskWorkaround
Disable the execution lock for single-node deployments: