Skip to content

WorkflowSweeper self-deadlocks with non-reentrant locks (Postgres), permanently stalling workflows after SUB_WORKFLOW complete #1058

@NicolasBissig

Description

@NicolasBissig

Describe the bug

Description

When conductor.workflow-execution-lock.type is set to postgres (or any non-reentrant lock
implementation), workflows containing a SUB_WORKFLOW task stall permanently after the
sub-workflow completes. The parent workflow remains in RUNNING status indefinitely, and the
sweeper logs repeat every ~30 seconds:

INFO WorkflowSweeper : Running sweeper for workflow <id>
INFO WorkflowSweeper : can't get a lock on <id>, will try after 30 seconds

Root Cause

WorkflowSweeper.sweep() acquires the execution lock at line 162, then calls
workflowExecutor.decide(workflowId) which calls executionLockService.acquireLock() again
internally (WorkflowExecutorOps.decide(), line 1089). With a non-reentrant lock (Postgres
advisory locks are non-reentrant by design), the second acquisition always returns false.
decide() then returns null, which sweep() interprets as external lock contention and
re-queues the workflow with a backoff delay. This repeats forever.

// WorkflowSweeper.sweep() — acquires lock here:
if (!executionLockService.acquireLock(workflowId)) { ... }

// Then calls decide() which ALSO calls acquireLock() internally:
workflow = workflowExecutor.decide(workflowId);  // always returns null with non-reentrant lock

// null is misread as contention -> re-queue with backoff -> infinite loop
if (workflow == null) {
    queueDAO.push(DECIDER_QUEUE, workflowId, 0, backoffSeconds);
    return;
}

The SUB_WORKFLOW task is the most common trigger because its completion does not re-enqueue the
parent workflow directly — it relies on the next sweep cycle to detect the completed sub-workflow
and advance the parent. That sweep cycle then hits this deadlock.

Steps to Reproduce

  1. Configure Conductor with conductor.workflow-execution-lock.type: postgres and
    conductor.app.workflow-execution-lock-enabled: true
  2. Define a workflow containing a SUB_WORKFLOW task followed by any subsequent task
  3. Start the workflow and let the sub-workflow complete
  4. Observe the parent workflow remains in RUNNING indefinitely

The following unit test reproduces the bug and can be added to WorkflowSweeperTest:

/**
 * Reproduction for: sweep() acquires the execution lock, then calls decide(workflowId) which
 * also calls acquireLock() internally. With a non-reentrant lock (e.g. Postgres advisory lock)
 * the second acquisition always fails, decide() returns null, and the sweeper re-queues the
 * workflow with a backoff delay — leaving it permanently stuck even though the sub-workflow has
 * already completed.
 *
 * <p>Expected (correct) behaviour: decide() is called and the workflow advances.
 * Actual (buggy) behaviour: decide() returns null because the sweeper already holds the lock;
 * the workflow is pushed back to the decider queue with a 30-second delay indefinitely.
 */
@Test
public void sweepDoesNotStallWhenSubWorkflowCompletedAndLockIsNonReentrant() {
    TaskModel subWorkflowTask =
            newTask(
                    "sub-workflow-task",
                    TaskType.TASK_TYPE_SUB_WORKFLOW,
                    TaskModel.Status.IN_PROGRESS);
    subWorkflowTask.setSubWorkflowId("sub-workflow-id");
    WorkflowModel workflow = newWorkflow(List.of(subWorkflowTask));

    WorkflowModel subWorkflow = new WorkflowModel();
    subWorkflow.setStatus(WorkflowModel.Status.COMPLETED);
    subWorkflow.setOutput(Map.of("result", "ok"));

    // The sweeper acquires the lock successfully on the first call.
    // The second call simulates a non-reentrant lock: decide(workflowId) calls acquireLock()
    // internally and cannot acquire a lock that the sweeper already holds.
    when(executionLockService.acquireLock(WORKFLOW_ID))
            .thenReturn(true)   // sweeper's own pre-acquire
            .thenReturn(false); // decide()'s internal acquire — non-reentrant lock fails

    when(workflowExecutor.getWorkflow(WORKFLOW_ID, true)).thenReturn(workflow);
    // decide(workflowId) returns null when it cannot acquire the lock
    when(workflowExecutor.decide(WORKFLOW_ID)).thenReturn(null);

    workflowSweeper.sweep(WORKFLOW_ID);

    // Bug: the workflow is pushed back to the decider queue with a backoff delay,
    // causing it to stall permanently even though the sub-workflow is already done.
    verify(queueDAO, never()).push(anyString(), anyString(), anyInt(), anyLong());
}

This test fails on the current main branch, confirming the bug.

Suggested Fix

Remove the redundant pre-acquisition of the lock from WorkflowSweeper.sweep(). The
decide(workflowId) method already handles its own locking — the sweeper's pre-acquire is not
only redundant but actively causes the deadlock with non-reentrant lock backends.

// WorkflowSweeper.sweep() — REMOVE these lines:
if (!executionLockService.acquireLock(workflowId)) {
    log.error("Couldn't acquire lock to sweep workflow {}", workflowId);
    return;
}
// ...
} finally {
    executionLockService.releaseLock(workflowId);  // also remove
}

With this change, sweep() calls decide(workflowId) directly, which acquires and releases the
lock itself. The workflow == null path remains correct for handling genuine external contention
(another thread holding the lock), since decide() still returns null in that case.

The executionLockService field and constructor parameter in WorkflowSweeper can then be
removed entirely as they are no longer used.

Environment

  • Conductor version: 3.23.0 (also reproduced on current main)
  • Lock type: postgres (conductor.workflow-execution-lock.type: postgres)
  • Trigger: any workflow with a SUB_WORKFLOW task

Workaround

Disable the execution lock for single-node deployments:

conductor:
  app:
    workflow-execution-lock-enabled: false
  workflow-execution-lock:
    type: noop_lock

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions