WorkflowSweeper self-deadlocks with non-reentrant locks (Postgres), permanently stalling workflows after SUB_WORKFLOW complete

### Describe the bug

## Description

When `conductor.workflow-execution-lock.type` is set to `postgres` (or any non-reentrant lock
implementation), workflows containing a `SUB_WORKFLOW` task stall permanently after the
sub-workflow completes. The parent workflow remains in `RUNNING` status indefinitely, and the
sweeper logs repeat every ~30 seconds:

```
INFO WorkflowSweeper : Running sweeper for workflow <id>
INFO WorkflowSweeper : can't get a lock on <id>, will try after 30 seconds
```

### Root Cause

`WorkflowSweeper.sweep()` acquires the execution lock at line 162, then calls
`workflowExecutor.decide(workflowId)` which calls `executionLockService.acquireLock()` **again**
internally (`WorkflowExecutorOps.decide()`, line 1089). With a non-reentrant lock (Postgres
advisory locks are non-reentrant by design), the second acquisition always returns `false`.
`decide()` then returns `null`, which `sweep()` interprets as external lock contention and
re-queues the workflow with a backoff delay. This repeats forever.

```java
// WorkflowSweeper.sweep() — acquires lock here:
if (!executionLockService.acquireLock(workflowId)) { ... }

// Then calls decide() which ALSO calls acquireLock() internally:
workflow = workflowExecutor.decide(workflowId);  // always returns null with non-reentrant lock

// null is misread as contention -> re-queue with backoff -> infinite loop
if (workflow == null) {
    queueDAO.push(DECIDER_QUEUE, workflowId, 0, backoffSeconds);
    return;
}
```

The `SUB_WORKFLOW` task is the most common trigger because its completion does not re-enqueue the
parent workflow directly — it relies on the next sweep cycle to detect the completed sub-workflow
and advance the parent. That sweep cycle then hits this deadlock.

## Steps to Reproduce

1. Configure Conductor with `conductor.workflow-execution-lock.type: postgres` and
   `conductor.app.workflow-execution-lock-enabled: true`
2. Define a workflow containing a `SUB_WORKFLOW` task followed by any subsequent task
3. Start the workflow and let the sub-workflow complete
4. Observe the parent workflow remains in `RUNNING` indefinitely

The following unit test reproduces the bug and can be added to `WorkflowSweeperTest`:

```java
/**
 * Reproduction for: sweep() acquires the execution lock, then calls decide(workflowId) which
 * also calls acquireLock() internally. With a non-reentrant lock (e.g. Postgres advisory lock)
 * the second acquisition always fails, decide() returns null, and the sweeper re-queues the
 * workflow with a backoff delay — leaving it permanently stuck even though the sub-workflow has
 * already completed.
 *
 * <p>Expected (correct) behaviour: decide() is called and the workflow advances.
 * Actual (buggy) behaviour: decide() returns null because the sweeper already holds the lock;
 * the workflow is pushed back to the decider queue with a 30-second delay indefinitely.
 */
@Test
public void sweepDoesNotStallWhenSubWorkflowCompletedAndLockIsNonReentrant() {
    TaskModel subWorkflowTask =
            newTask(
                    "sub-workflow-task",
                    TaskType.TASK_TYPE_SUB_WORKFLOW,
                    TaskModel.Status.IN_PROGRESS);
    subWorkflowTask.setSubWorkflowId("sub-workflow-id");
    WorkflowModel workflow = newWorkflow(List.of(subWorkflowTask));

    WorkflowModel subWorkflow = new WorkflowModel();
    subWorkflow.setStatus(WorkflowModel.Status.COMPLETED);
    subWorkflow.setOutput(Map.of("result", "ok"));

    // The sweeper acquires the lock successfully on the first call.
    // The second call simulates a non-reentrant lock: decide(workflowId) calls acquireLock()
    // internally and cannot acquire a lock that the sweeper already holds.
    when(executionLockService.acquireLock(WORKFLOW_ID))
            .thenReturn(true)   // sweeper's own pre-acquire
            .thenReturn(false); // decide()'s internal acquire — non-reentrant lock fails

    when(workflowExecutor.getWorkflow(WORKFLOW_ID, true)).thenReturn(workflow);
    // decide(workflowId) returns null when it cannot acquire the lock
    when(workflowExecutor.decide(WORKFLOW_ID)).thenReturn(null);

    workflowSweeper.sweep(WORKFLOW_ID);

    // Bug: the workflow is pushed back to the decider queue with a backoff delay,
    // causing it to stall permanently even though the sub-workflow is already done.
    verify(queueDAO, never()).push(anyString(), anyString(), anyInt(), anyLong());
}
```

This test **fails** on the current `main` branch, confirming the bug.

## Suggested Fix

Remove the redundant pre-acquisition of the lock from `WorkflowSweeper.sweep()`. The
`decide(workflowId)` method already handles its own locking — the sweeper's pre-acquire is not
only redundant but actively causes the deadlock with non-reentrant lock backends.

```java
// WorkflowSweeper.sweep() — REMOVE these lines:
if (!executionLockService.acquireLock(workflowId)) {
    log.error("Couldn't acquire lock to sweep workflow {}", workflowId);
    return;
}
// ...
} finally {
    executionLockService.releaseLock(workflowId);  // also remove
}
```

With this change, `sweep()` calls `decide(workflowId)` directly, which acquires and releases the
lock itself. The `workflow == null` path remains correct for handling genuine external contention
(another thread holding the lock), since `decide()` still returns `null` in that case.

The `executionLockService` field and constructor parameter in `WorkflowSweeper` can then be
removed entirely as they are no longer used.

## Environment

- Conductor version: `3.23.0` (also reproduced on current `main`)
- Lock type: `postgres` (`conductor.workflow-execution-lock.type: postgres`)
- Trigger: any workflow with a `SUB_WORKFLOW` task

## Workaround

Disable the execution lock for single-node deployments:

```yaml
conductor:
  app:
    workflow-execution-lock-enabled: false
  workflow-execution-lock:
    type: noop_lock
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WorkflowSweeper self-deadlocks with non-reentrant locks (Postgres), permanently stalling workflows after SUB_WORKFLOW complete #1058

Describe the bug

Description

Root Cause

Steps to Reproduce

Suggested Fix

Environment

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WorkflowSweeper self-deadlocks with non-reentrant locks (Postgres), permanently stalling workflows after SUB_WORKFLOW complete #1058

Description

Describe the bug

Description

Root Cause

Steps to Reproduce

Suggested Fix

Environment

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions