Skip to content

errorStrategy 'finish' causes indefinite hang when cloud executor nodes become unhealthy #6947

@adamrtalbot

Description

@adamrtalbot

Bug

When using errorStrategy 'finish' with the Azure Batch executor, Nextflow hangs indefinitely if worker nodes become unhealthy (e.g., NodeNotReady) while tasks are running. The JVM never exits. The only recovery is kill -9 or Ctrl+C (which triggers abort() via signal handler, bypassing the bug).

This also affects Google Batch. AWS Batch is partially protected by its service-level node management. Kubernetes is immune due to explicit NodeTerminationException handling.

Root Cause

The hang is caused by a design gap in Session.cancel() — the shutdown path used by errorStrategy 'finish'.

When a task fails and exhausts maxRetries, Session.fault() routes to cancel() (not abort()) for the FINISH error strategy. cancel() sets cancelled=true and forces processesBarrier, but critically:

  • Does not force monitorsBarrier
  • Does not call shutdown0() (which runs cleanup hooks including TaskPollingMonitor.cleanup())

This creates a deadlock:

  1. Session.await() blocks on monitorsBarrier.awaitCompletion() — a while(true) loop with no timeout
  2. monitorsBarrier waits for TaskPollingMonitor.pollLoop() to exit and call arrive()
  3. pollLoop() break condition when cancelled: session.isCancelled() && runningQueue.size() == 0
  4. Tasks on dead nodes remain in ACTIVE/RUNNING state (Azure hasn't marked them COMPLETED)
  5. runningQueue never drains → pollLoop never exits → monitorsBarrier never completes
  6. Session.destroy() calls shutdown0() but only after await() returns — which it never does
  7. TaskPollingMonitor.cleanup() (which kills remaining tasks) is registered as a shutdown hook and never runs

By contrast, Session.abort() forces both barriers and calls shutdown0(), so it never hangs.

Clarification on 409 Errors

The Unable to cleanup batch task warnings (HTTP 409 / NodeNotReady) visible in logs are a cosmetic issue, not the hang cause. These come from deleteTask() for tasks that did complete but whose cleanup failed. Those tasks are correctly evicted from the queue. The hang is caused by tasks that never reach COMPLETED state.

Steps to Reproduce

  1. Run a Nextflow pipeline with errorStrategy 'finish' and maxRetries on Azure Batch
  2. Have one task fail (triggering the FINISH error strategy)
  3. While other tasks are still running, cause the Azure Batch node to become unhealthy (e.g., node preemption, hardware failure)
  4. Observe: Nextflow logs show the failed task but never exits. The process hangs indefinitely.

Expected Behavior

Nextflow should eventually exit after a reasonable timeout, killing remaining tasks that cannot complete.

Suggested Fixes

Primary fix — add a cancel timeout to the framework (fixes all executors):

Add a configurable timeout to the cancel() shutdown path. If monitorsBarrier doesn't complete within the timeout, escalate to abort() or forceTermination(). This could be implemented as a timeout parameter on Barrier.awaitCompletion() or a watchdog thread in Session.await().

Defense-in-depth — detect unhealthy nodes in cloud handlers:

Azure and Google Batch handlers should detect tasks stuck on unhealthy nodes and mark them as failed, similar to how K8sTaskHandler handles NodeTerminationException. In AzBatchTaskHandler.taskState0(), check node health when a task remains ACTIVE/RUNNING beyond a threshold and treat unhealthy-node tasks as failed.

Affected Versions

Believed to affect all current versions. The cancel() / abort() asymmetry has been present since the FINISH error strategy was introduced.

Environment

  • Executor: Azure Batch (confirmed), Google Batch (likely), AWS Batch (less likely due to service-level node management)
  • Kubernetes: not affected (has explicit node failure handling)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions