[Jobs] Add on_before_recovery runtime hook#9966
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an on_before_recovery hook to the managed job runtime, allowing the system to capture logs from a failing cluster before it is torn down or relaunched. Feedback on the changes points out that calling this synchronous, I/O-heavy hook directly inside the asynchronous _monitor_one_task method will block the event loop, and recommends wrapping the call in asyncio.to_thread to run it in a separate thread.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
e35df01 to
ee90f6d
Compare
Add an `on_before_recovery` method to the `ManagedJobRuntime` protocol and a module-level dispatch, invoked from the controller's recovery branch before the failing cluster is torn down or relaunched. This lets a registered runtime snapshot the about-to-be-lost run's logs while the cluster is (best-effort) still reachable, so a recovered job's previous run remains inspectable. The dispatch is defensive (skips runtimes that predate the hook) and the controller swallows exceptions, so a failure here never blocks recovery. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P2K1V9diRh1hJVyPJeRBuM
…d guard The hook call in _monitor_one_task was synchronous, stalling the event loop (and therefore all in-flight managed-job monitors) while the runtime downloads logs over SSH/network. Wrap it with asyncio.to_thread to match every other blocking call in that coroutine. Add an is_registered() guard so no thread-pool task is spawned when no runtime is installed, and update the docstring to reflect that the cluster may already be unreachable when the hook fires. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s
Widen the on_before_recovery protocol + module dispatch with an ``exit_codes`` parameter, and forward the per-node exit codes the controller already computes for the failed run. The controller resets ``exit_codes`` to None each monitor-loop iteration so a stale value from a prior iteration never leaks into the hook; it stays None for infra-level failures (preemption) where there is no app exit. This lets a runtime record why a run recovered (the app-level exit code), not just its logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s
ee90f6d to
f803421
Compare
A runtime's log download on a shared pool cluster must be scoped to the job's on-cluster id, or it can pull a different job's latest run. The controller's own download_log_and_stream already passes job_ids=[job_id_on_pool_cluster] for this reason; thread the same id into the on_before_recovery hook (protocol + dispatch + call site) so a runtime can disambiguate too. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s
When a multi-node run loses a node, rsyncing logs from every node exec's into each one -- and an unreachable/terminated node aborts the whole fan-out, losing the surviving nodes' logs too. Add head_only so callers can rsync just the head, whose run.log already aggregates every node's task output via the runtime's log streaming. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a PluginSlot beside the log node picker so a plugin can contribute extra log filters next to the existing node dropdown. Mirrors the jobs.detail.logs context (jobId/taskId) so a plugin can key shared state on the same identity as the log pane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f95af39 to
cfa7a46
Compare
|
/smoke-test --kubernetes --jobs-consolidation --no-resource-heavy |
Summary
Adds extension points so a registered managed-jobs runtime can preserve a
recovering job's logs before its cluster is torn down, plus a dashboard slot to
surface them. When a managed job recovers, its previous run's logs are otherwise
lost with the replaced cluster; these hooks let a runtime snapshot them first.
Changes:
on_before_recoveryhook. NewManagedJobRuntime.on_before_recovery(handle, backend, job_id, task_id, exit_codes, job_id_on_pool_cluster)protocol method + a module-level dispatch in
sky/jobs/runtime.py. Thecontroller (
sky/jobs/controller.py) invokes it in_monitor_one_task'srecovery branch — before cleanup/relaunch, while the cluster is (best-effort)
still reachable — via
asyncio.to_thread(so the blocking log I/O neverstalls the event loop) and gated on
is_registered().exit_codes: the failed run's per-node exit codes (when recovery wastriggered by a job failure), so the hook can record why each node exited.
job_id_on_pool_cluster: disambiguates which job's logs to pull on ashared pool cluster.
sync_down_logs(head_only=...).CloudVmRayBackend.sync_down_logsgainsa
head_onlyoption to rsync only the head node'srun.log(which alreadyaggregates every node's output via the runtime's log streaming). During a
multi-node recovery a node is often already gone, and the all-node fan-out
exec's into every node — so one unreachable node aborts the whole download.
Head-only sidesteps that and still yields the full aggregated log.
jobs.detail.logfiltersdashboard slot. APluginSlotbeside the lognode picker so a plugin can contribute extra log filters, keyed on the same
jobId/taskIdcontext as the log pane.
Defensive: the dispatch skips runtimes predating the hook, and the controller
swallows hook exceptions so a failure never blocks recovery. No behavior change
on the default OSS path (no runtime registered) or for runtimes that don't
implement the hook.
Test plan
skew (runtime lacking the method), and fires once per recovery otherwise.
head_onlyrsyncs only the head node.🤖 Generated with Claude Code