Skip to content

Commit 9ae552c

Browse files
cg505claude
andcommitted
[Jobs] Address review: wait unconditionally; document acquire placement
- Drop the `> 0` guard around the wait; sleep unconditionally after acquiring the lock (the constant is always positive). - Record why the lock acquire sits outside the recovery try/finally: the finally unlinks the signal file, but the lock-loss step-down re-touches it, so the finally is scoped to recovery only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Fh7ZZwmvfVeV3fHzjV2ACi
1 parent 61f7c33 commit 9ae552c

1 file changed

Lines changed: 15 additions & 8 deletions

File tree

sky/jobs/managed_job_refresh_thread.py

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,14 @@ def _become_leader_and_run(self) -> None:
9393
# its process and could mark the job FAILED_CONTROLLER. The signal
9494
# file makes update_managed_jobs_statuses and the scheduler's
9595
# controller-start path early-return until recovery completes.
96+
# NOTE: the acquire is deliberately NOT inside the try/finally that
97+
# wraps recovery below. The finally unlinks the signal file, but the
98+
# lock-loss step-down path (_suicide_on_lock_loss) re-touches it to
99+
# keep controllers gated through the shutdown drain — so that path must
100+
# not be followed by an unlink. Scoping the finally to recovery only
101+
# keeps those two concerns from fighting. It also means a raise from
102+
# acquire() leaves the gate file in place while run() retries, which is
103+
# what we want (controller starts stay gated until we hold the lock).
96104
signal_file = pathlib.Path(
97105
constants.PERSISTENT_RUN_RESTARTING_SIGNAL_FILE).expanduser()
98106
signal_file.touch()
@@ -102,17 +110,16 @@ def _become_leader_and_run(self) -> None:
102110
self._lock.acquire()
103111
logger.info('Consolidation mode lock acquired')
104112

105-
# Wait briefly before recovery so a prior leader (e.g. the old pod
106-
# during a rolling update) is fully gone first; see the comment on
113+
# Wait before recovery so a prior leader (e.g. the old pod during a
114+
# rolling update) is fully gone first; see the comment on
107115
# _RECOVERY_WAIT_AFTER_ACQUIRE_SECONDS. The signal file touched above
108116
# stays in place, gating controller starts and the FAILED_CONTROLLER
109117
# sweep until recovery completes.
110-
if _RECOVERY_WAIT_AFTER_ACQUIRE_SECONDS > 0:
111-
logger.info(
112-
f'Waiting {_RECOVERY_WAIT_AFTER_ACQUIRE_SECONDS}s after '
113-
'acquiring the consolidation mode lock before running '
114-
'recovery, to let any previous leader finish shutting down')
115-
time.sleep(_RECOVERY_WAIT_AFTER_ACQUIRE_SECONDS)
118+
logger.info(
119+
f'Waiting {_RECOVERY_WAIT_AFTER_ACQUIRE_SECONDS}s after acquiring '
120+
'the consolidation mode lock before running recovery, to let any '
121+
'previous leader finish shutting down')
122+
time.sleep(_RECOVERY_WAIT_AFTER_ACQUIRE_SECONDS)
116123

117124
# The wait above widens the window between acquiring the lock and
118125
# running recovery, during which the lock's underlying session could go

0 commit comments

Comments
 (0)