errorStrategy 'finish' causes indefinite hang when cloud executor nodes become unhealthy

## Bug

When using `errorStrategy 'finish'` with the Azure Batch executor, Nextflow hangs indefinitely if worker nodes become unhealthy (e.g., `NodeNotReady`) while tasks are running. The JVM never exits. The only recovery is `kill -9` or Ctrl+C (which triggers `abort()` via signal handler, bypassing the bug).

This also affects Google Batch. AWS Batch is partially protected by its service-level node management. Kubernetes is immune due to explicit `NodeTerminationException` handling.

## Root Cause

The hang is caused by a design gap in `Session.cancel()` — the shutdown path used by `errorStrategy 'finish'`.

When a task fails and exhausts `maxRetries`, `Session.fault()` routes to `cancel()` (not `abort()`) for the FINISH error strategy. `cancel()` sets `cancelled=true` and forces `processesBarrier`, but critically:

- Does **not** force `monitorsBarrier`
- Does **not** call `shutdown0()` (which runs cleanup hooks including `TaskPollingMonitor.cleanup()`)

This creates a deadlock:

1. `Session.await()` blocks on `monitorsBarrier.awaitCompletion()` — a `while(true)` loop with no timeout
2. `monitorsBarrier` waits for `TaskPollingMonitor.pollLoop()` to exit and call `arrive()`
3. `pollLoop()` break condition when cancelled: `session.isCancelled() && runningQueue.size() == 0`
4. Tasks on dead nodes remain in ACTIVE/RUNNING state (Azure hasn't marked them COMPLETED)
5. `runningQueue` never drains → `pollLoop` never exits → `monitorsBarrier` never completes
6. `Session.destroy()` calls `shutdown0()` but only **after** `await()` returns — which it never does
7. `TaskPollingMonitor.cleanup()` (which kills remaining tasks) is registered as a shutdown hook and never runs

By contrast, `Session.abort()` forces **both** barriers and calls `shutdown0()`, so it never hangs.

## Clarification on 409 Errors

The `Unable to cleanup batch task` warnings (HTTP 409 / NodeNotReady) visible in logs are a **cosmetic issue**, not the hang cause. These come from `deleteTask()` for tasks that **did** complete but whose cleanup failed. Those tasks are correctly evicted from the queue. The hang is caused by tasks that **never reach COMPLETED state**.

## Steps to Reproduce

1. Run a Nextflow pipeline with `errorStrategy 'finish'` and `maxRetries` on Azure Batch
2. Have one task fail (triggering the FINISH error strategy)
3. While other tasks are still running, cause the Azure Batch node to become unhealthy (e.g., node preemption, hardware failure)
4. Observe: Nextflow logs show the failed task but never exits. The process hangs indefinitely.

## Expected Behavior

Nextflow should eventually exit after a reasonable timeout, killing remaining tasks that cannot complete.

## Suggested Fixes

**Primary fix — add a cancel timeout to the framework** (fixes all executors):

Add a configurable timeout to the `cancel()` shutdown path. If `monitorsBarrier` doesn't complete within the timeout, escalate to `abort()` or `forceTermination()`. This could be implemented as a timeout parameter on `Barrier.awaitCompletion()` or a watchdog thread in `Session.await()`.

**Defense-in-depth — detect unhealthy nodes in cloud handlers**:

Azure and Google Batch handlers should detect tasks stuck on unhealthy nodes and mark them as failed, similar to how `K8sTaskHandler` handles `NodeTerminationException`. In `AzBatchTaskHandler.taskState0()`, check node health when a task remains ACTIVE/RUNNING beyond a threshold and treat unhealthy-node tasks as failed.

## Affected Versions

Believed to affect all current versions. The `cancel()` / `abort()` asymmetry has been present since the FINISH error strategy was introduced.

## Environment

- Executor: Azure Batch (confirmed), Google Batch (likely), AWS Batch (less likely due to service-level node management)
- Kubernetes: **not affected** (has explicit node failure handling)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errorStrategy 'finish' causes indefinite hang when cloud executor nodes become unhealthy #6947

Bug

Root Cause

Clarification on 409 Errors

Steps to Reproduce

Expected Behavior

Suggested Fixes

Affected Versions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

errorStrategy 'finish' causes indefinite hang when cloud executor nodes become unhealthy #6947

Description

Bug

Root Cause

Clarification on 409 Errors

Steps to Reproduce

Expected Behavior

Suggested Fixes

Affected Versions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions