Skip to content

Fix LogsCheckpoint hanging head job shutdown on stuck log upload#7188

Open
jorgee wants to merge 1 commit into
masterfrom
fix-logs-checkpoint-shutdown
Open

Fix LogsCheckpoint hanging head job shutdown on stuck log upload#7188
jorgee wants to merge 1 commit into
masterfrom
fix-logs-checkpoint-shutdown

Conversation

@jorgee
Copy link
Copy Markdown
Contributor

@jorgee jorgee commented May 29, 2026

Problem

A small fraction (<1%) of runs become zombie head jobs: Nextflow sends onFlowComplete to Seqera Platform (so the run shows as completed), but the head-job process never exits and the VM keeps running.

Diagnostic

The hang is in LogsCheckpoint, the observer that periodically uploads the .nextflow.log, report and timeline files to the remote work directory.

Its shutdown path was:

void onFlowComplete() {
    synchronized(lock) { thread.interrupt() }
    thread.join()                 // no timeout
}
// worker:
synchronized(lock) { handler.saveFiles() }   // FileHelper.copyPath -> cloud SDK

When a network stall leaves a cloud upload inside saveFiles() blocked on a half-open socket:

  1. The worker is stuck inside saveFiles() holding lock.
  2. The main shutdown thread enters onFlowComplete() and blocks on synchronized(lock) — it can't even deliver the interrupt.
  3. Even past that, thread.join() has no timeout, so it would block forever regardless.
  4. The shutdown thread is non-daemon → the JVM never exits → zombie head job, after Platform already received "completed".

This class has been re-fixed repeatedly (#4166, #6787, #6939, plus the later lock), ping-ponging between two failure modes: setting the interrupt flag made the AWS SDK abort in-flight uploads (AbortedException), while removing it caused a 90 s shutdown stall. The lock added to reconcile them is what turned a transient network stall into a permanent shutdown deadlock.

Solution

Replace the interrupt + lock + unbounded join with a CountDownLatch and a bounded join:

  • The worker sleeps with stopLatch.await(interval) instead of Thread.sleep. Shutdown calls countDown(), which wakes it immediately and without ever setting the interrupt flag — so the cloud SDK never sees an interrupt and the whole AbortedException/lock saga disappears.
  • onFlowComplete()/onFlowError() join the worker for at most terminateTimeout. If a saveFiles() upload is genuinely hung, the join times out and the worker (a daemon thread) is abandoned — it cannot keep the JVM alive, so the head job exits cleanly.

This is safe because the final, authoritative log upload is performed separately by CacheCommand; this observer is only a best-effort periodic uploader, so abandoning a stuck attempt loses nothing.

New, configurable bound (default 120s):

  • config: tower.logs.checkpoint.terminateTimeout
  • env: TOWER_LOGS_CHECKPOINT_TERMINATE_TIMEOUT

Tests

LogsCheckpointTest (7 cases) covers: the worker is a daemon, periodic checkpointing fires, stop() returns promptly when idle (no full-interval wait), and — the key guarantee — onFlowComplete() returns within terminateTimeout even when saveFiles() blocks forever, abandoning the daemon.

Notes

This is a provider-agnostic backstop and is complementary to #7024 (Apply socket timeout to S3 CRT connections, in 26.04 but not 25.10), which bounds the underlying stuck S3 upload itself. Affected runs on AWS + 25.10.x lack #7024, so the CRT upload could stall indefinitely; this change ensures the head job exits regardless of provider or SDK behavior.

🤖 Generated with Claude Code

Replace the interrupt + lock + unbounded join() shutdown logic with a
CountDownLatch and a bounded join, so the head job can always exit even
when a log checkpoint upload is stuck on a hung network connection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify
Copy link
Copy Markdown

netlify Bot commented May 29, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit a9c0f17
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a1968e1731ac20008cef01d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant