Fix LogsCheckpoint hanging head job shutdown on stuck log upload by jorgee · Pull Request #7188 · nextflow-io/nextflow

jorgee · 2026-05-29T10:22:22Z

Problem

A small fraction (<1%) of runs become zombie head jobs: Nextflow sends onFlowComplete to Seqera Platform (so the run shows as completed), but the head-job process never exits and the VM keeps running.

Diagnostic

The hang is in LogsCheckpoint, the observer that periodically uploads the .nextflow.log, report and timeline files to the remote work directory.

Its shutdown path was:

void onFlowComplete() {
    synchronized(lock) { thread.interrupt() }
    thread.join()                 // no timeout
}
// worker:
synchronized(lock) { handler.saveFiles() }   // FileHelper.copyPath -> cloud SDK

When a network stall leaves a cloud upload inside saveFiles() blocked on a half-open socket:

The worker is stuck inside saveFiles() holding lock.
The main shutdown thread enters onFlowComplete() and blocks on synchronized(lock) — it can't even deliver the interrupt.
Even past that, thread.join() has no timeout, so it would block forever regardless.
The shutdown thread is non-daemon → the JVM never exits → zombie head job, after Platform already received "completed".

This class has been re-fixed repeatedly (#4166, #6787, #6939, plus the later lock), ping-ponging between two failure modes: setting the interrupt flag made the AWS SDK abort in-flight uploads (AbortedException), while removing it caused a 90 s shutdown stall. The lock added to reconcile them is what turned a transient network stall into a permanent shutdown deadlock.

Solution

Replace the interrupt + lock + unbounded join with a CountDownLatch and a bounded join:

The worker sleeps with stopLatch.await(interval) instead of Thread.sleep. Shutdown calls countDown(), which wakes it immediately and without ever setting the interrupt flag — so the cloud SDK never sees an interrupt and the whole AbortedException/lock saga disappears.
onFlowComplete()/onFlowError() join the worker for at most terminateTimeout. If a saveFiles() upload is genuinely hung, the join times out and the worker (a daemon thread) is abandoned — it cannot keep the JVM alive, so the head job exits cleanly.

This is safe because the final, authoritative log upload is performed separately by CacheCommand; this observer is only a best-effort periodic uploader, so abandoning a stuck attempt loses nothing.

New, configurable bound (default 120s):

config: tower.logs.checkpoint.terminateTimeout
env: TOWER_LOGS_CHECKPOINT_TERMINATE_TIMEOUT

Tests

LogsCheckpointTest (7 cases) covers: the worker is a daemon, periodic checkpointing fires, stop() returns promptly when idle (no full-interval wait), and — the key guarantee — onFlowComplete() returns within terminateTimeout even when saveFiles() blocks forever, abandoning the daemon.

Notes

This is a provider-agnostic backstop and is complementary to #7024 (Apply socket timeout to S3 CRT connections, in 26.04 but not 25.10), which bounds the underlying stuck S3 upload itself. Affected runs on AWS + 25.10.x lack #7024, so the CRT upload could stall indefinitely; this change ensures the head job exits regardless of provider or SDK behavior.

🤖 Generated with Claude Code

Replace the interrupt + lock + unbounded join() shutdown logic with a CountDownLatch and a bounded join, so the head job can always exit even when a log checkpoint upload is stuck on a hung network connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jorgee <jorge.ejarque@seqera.io>

netlify · 2026-05-29T10:22:29Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`a9c0f17`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a1968e1731ac20008cef01d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LogsCheckpoint hanging head job shutdown on stuck log upload#7188

Fix LogsCheckpoint hanging head job shutdown on stuck log upload#7188
jorgee wants to merge 1 commit into
masterfrom
fix-logs-checkpoint-shutdown

jorgee commented May 29, 2026

Uh oh!

netlify Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jorgee commented May 29, 2026

Problem

Diagnostic

Solution

Tests

Notes

Uh oh!

netlify Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

netlify Bot commented May 29, 2026 •

edited

Loading