-
Notifications
You must be signed in to change notification settings - Fork 778
Description
Bug
When using errorStrategy 'finish' with the Azure Batch executor, Nextflow hangs indefinitely if worker nodes become unhealthy (e.g., NodeNotReady) while tasks are running. The JVM never exits. The only recovery is kill -9 or Ctrl+C (which triggers abort() via signal handler, bypassing the bug).
This also affects Google Batch. AWS Batch is partially protected by its service-level node management. Kubernetes is immune due to explicit NodeTerminationException handling.
Root Cause
The hang is caused by a design gap in Session.cancel() — the shutdown path used by errorStrategy 'finish'.
When a task fails and exhausts maxRetries, Session.fault() routes to cancel() (not abort()) for the FINISH error strategy. cancel() sets cancelled=true and forces processesBarrier, but critically:
- Does not force
monitorsBarrier - Does not call
shutdown0()(which runs cleanup hooks includingTaskPollingMonitor.cleanup())
This creates a deadlock:
Session.await()blocks onmonitorsBarrier.awaitCompletion()— awhile(true)loop with no timeoutmonitorsBarrierwaits forTaskPollingMonitor.pollLoop()to exit and callarrive()pollLoop()break condition when cancelled:session.isCancelled() && runningQueue.size() == 0- Tasks on dead nodes remain in ACTIVE/RUNNING state (Azure hasn't marked them COMPLETED)
runningQueuenever drains →pollLoopnever exits →monitorsBarriernever completesSession.destroy()callsshutdown0()but only afterawait()returns — which it never doesTaskPollingMonitor.cleanup()(which kills remaining tasks) is registered as a shutdown hook and never runs
By contrast, Session.abort() forces both barriers and calls shutdown0(), so it never hangs.
Clarification on 409 Errors
The Unable to cleanup batch task warnings (HTTP 409 / NodeNotReady) visible in logs are a cosmetic issue, not the hang cause. These come from deleteTask() for tasks that did complete but whose cleanup failed. Those tasks are correctly evicted from the queue. The hang is caused by tasks that never reach COMPLETED state.
Steps to Reproduce
- Run a Nextflow pipeline with
errorStrategy 'finish'andmaxRetrieson Azure Batch - Have one task fail (triggering the FINISH error strategy)
- While other tasks are still running, cause the Azure Batch node to become unhealthy (e.g., node preemption, hardware failure)
- Observe: Nextflow logs show the failed task but never exits. The process hangs indefinitely.
Expected Behavior
Nextflow should eventually exit after a reasonable timeout, killing remaining tasks that cannot complete.
Suggested Fixes
Primary fix — add a cancel timeout to the framework (fixes all executors):
Add a configurable timeout to the cancel() shutdown path. If monitorsBarrier doesn't complete within the timeout, escalate to abort() or forceTermination(). This could be implemented as a timeout parameter on Barrier.awaitCompletion() or a watchdog thread in Session.await().
Defense-in-depth — detect unhealthy nodes in cloud handlers:
Azure and Google Batch handlers should detect tasks stuck on unhealthy nodes and mark them as failed, similar to how K8sTaskHandler handles NodeTerminationException. In AzBatchTaskHandler.taskState0(), check node health when a task remains ACTIVE/RUNNING beyond a threshold and treat unhealthy-node tasks as failed.
Affected Versions
Believed to affect all current versions. The cancel() / abort() asymmetry has been present since the FINISH error strategy was introduced.
Environment
- Executor: Azure Batch (confirmed), Google Batch (likely), AWS Batch (less likely due to service-level node management)
- Kubernetes: not affected (has explicit node failure handling)