Fix LogsCheckpoint hanging head job shutdown on stuck log upload#7188
Open
jorgee wants to merge 1 commit into
Open
Fix LogsCheckpoint hanging head job shutdown on stuck log upload#7188jorgee wants to merge 1 commit into
jorgee wants to merge 1 commit into
Conversation
Replace the interrupt + lock + unbounded join() shutdown logic with a CountDownLatch and a bounded join, so the head job can always exit even when a log checkpoint upload is stuck on a hung network connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jorgee <jorge.ejarque@seqera.io>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A small fraction (<1%) of runs become zombie head jobs: Nextflow sends
onFlowCompleteto Seqera Platform (so the run shows as completed), but the head-job process never exits and the VM keeps running.Diagnostic
The hang is in
LogsCheckpoint, the observer that periodically uploads the.nextflow.log, report and timeline files to the remote work directory.Its shutdown path was:
When a network stall leaves a cloud upload inside
saveFiles()blocked on a half-open socket:saveFiles()holdinglock.onFlowComplete()and blocks onsynchronized(lock)— it can't even deliver the interrupt.thread.join()has no timeout, so it would block forever regardless.This class has been re-fixed repeatedly (#4166, #6787, #6939, plus the later
lock), ping-ponging between two failure modes: setting the interrupt flag made the AWS SDK abort in-flight uploads (AbortedException), while removing it caused a 90 s shutdown stall. Thelockadded to reconcile them is what turned a transient network stall into a permanent shutdown deadlock.Solution
Replace the interrupt + lock + unbounded join with a
CountDownLatchand a bounded join:stopLatch.await(interval)instead ofThread.sleep. Shutdown callscountDown(), which wakes it immediately and without ever setting the interrupt flag — so the cloud SDK never sees an interrupt and the wholeAbortedException/lock saga disappears.onFlowComplete()/onFlowError()jointhe worker for at mostterminateTimeout. If asaveFiles()upload is genuinely hung, the join times out and the worker (a daemon thread) is abandoned — it cannot keep the JVM alive, so the head job exits cleanly.This is safe because the final, authoritative log upload is performed separately by
CacheCommand; this observer is only a best-effort periodic uploader, so abandoning a stuck attempt loses nothing.New, configurable bound (default
120s):tower.logs.checkpoint.terminateTimeoutTOWER_LOGS_CHECKPOINT_TERMINATE_TIMEOUTTests
LogsCheckpointTest(7 cases) covers: the worker is a daemon, periodic checkpointing fires,stop()returns promptly when idle (no full-interval wait), and — the key guarantee —onFlowComplete()returns withinterminateTimeouteven whensaveFiles()blocks forever, abandoning the daemon.Notes
This is a provider-agnostic backstop and is complementary to #7024 (Apply socket timeout to S3 CRT connections, in 26.04 but not 25.10), which bounds the underlying stuck S3 upload itself. Affected runs on AWS + 25.10.x lack #7024, so the CRT upload could stall indefinitely; this change ensures the head job exits regardless of provider or SDK behavior.
🤖 Generated with Claude Code