[Jobs] Add on_before_recovery runtime hook by Michaelvll · Pull Request #9966 · skypilot-org/skypilot

Michaelvll · 2026-06-26T15:32:36Z

Summary

Adds extension points so a registered managed-jobs runtime can preserve a
recovering job's logs before its cluster is torn down, plus a dashboard slot to
surface them. When a managed job recovers, its previous run's logs are otherwise
lost with the replaced cluster; these hooks let a runtime snapshot them first.

Changes:

on_before_recovery hook. New
ManagedJobRuntime.on_before_recovery(handle, backend, job_id, task_id, exit_codes, job_id_on_pool_cluster)
protocol method + a module-level dispatch in sky/jobs/runtime.py. The
controller (sky/jobs/controller.py) invokes it in _monitor_one_task's
recovery branch — before cleanup/relaunch, while the cluster is (best-effort)
still reachable — via asyncio.to_thread (so the blocking log I/O never
stalls the event loop) and gated on is_registered().
- exit_codes: the failed run's per-node exit codes (when recovery was
  triggered by a job failure), so the hook can record why each node exited.
- job_id_on_pool_cluster: disambiguates which job's logs to pull on a
  shared pool cluster.
sync_down_logs(head_only=...). CloudVmRayBackend.sync_down_logs gains
a head_only option to rsync only the head node's run.log (which already
aggregates every node's output via the runtime's log streaming). During a
multi-node recovery a node is often already gone, and the all-node fan-out
exec's into every node — so one unreachable node aborts the whole download.
Head-only sidesteps that and still yields the full aggregated log.
jobs.detail.logfilters dashboard slot. A PluginSlot beside the log
node picker so a plugin can contribute extra log filters, keyed on the same jobId/taskId
context as the log pane.

Defensive: the dispatch skips runtimes predating the hook, and the controller
swallows hook exceptions so a failure never blocks recovery. No behavior change
on the default OSS path (no runtime registered) or for runtimes that don't
implement the hook.

Test plan

Existing managed-jobs unit tests pass.
Verified the dispatch is a no-op with no runtime registered and under version
skew (runtime lacking the method), and fires once per recovery otherwise.
Verified head_only rsyncs only the head node.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces an on_before_recovery hook to the managed job runtime, allowing the system to capture logs from a failing cluster before it is torn down or relaunched. Feedback on the changes points out that calling this synchronous, I/O-heavy hook directly inside the asynchronous _monitor_one_task method will block the event loop, and recommends wrapping the call in asyncio.to_thread to run it in a separate thread.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Add an `on_before_recovery` method to the `ManagedJobRuntime` protocol and a module-level dispatch, invoked from the controller's recovery branch before the failing cluster is torn down or relaunched. This lets a registered runtime snapshot the about-to-be-lost run's logs while the cluster is (best-effort) still reachable, so a recovered job's previous run remains inspectable. The dispatch is defensive (skips runtimes that predate the hook) and the controller swallows exceptions, so a failure here never blocks recovery. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P2K1V9diRh1hJVyPJeRBuM

…d guard The hook call in _monitor_one_task was synchronous, stalling the event loop (and therefore all in-flight managed-job monitors) while the runtime downloads logs over SSH/network. Wrap it with asyncio.to_thread to match every other blocking call in that coroutine. Add an is_registered() guard so no thread-pool task is spawned when no runtime is installed, and update the docstring to reflect that the cluster may already be unreachable when the hook fires. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s

Widen the on_before_recovery protocol + module dispatch with an ``exit_codes`` parameter, and forward the per-node exit codes the controller already computes for the failed run. The controller resets ``exit_codes`` to None each monitor-loop iteration so a stale value from a prior iteration never leaks into the hook; it stays None for infra-level failures (preemption) where there is no app exit. This lets a runtime record why a run recovered (the app-level exit code), not just its logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s

A runtime's log download on a shared pool cluster must be scoped to the job's on-cluster id, or it can pull a different job's latest run. The controller's own download_log_and_stream already passes job_ids=[job_id_on_pool_cluster] for this reason; thread the same id into the on_before_recovery hook (protocol + dispatch + call site) so a runtime can disambiguate too. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s

When a multi-node run loses a node, rsyncing logs from every node exec's into each one -- and an unreachable/terminated node aborts the whole fan-out, losing the surviving nodes' logs too. Add head_only so callers can rsync just the head, whose run.log already aggregates every node's task output via the runtime's log streaming. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a PluginSlot beside the log node picker so a plugin can contribute extra log filters next to the existing node dropdown. Mirrors the jobs.detail.logs context (jobId/taskId) so a plugin can key shared state on the same identity as the log pane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

kevinmingtarja · 2026-07-01T02:29:18Z

/smoke-test --kubernetes --jobs-consolidation --no-resource-heavy

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread sky/jobs/controller.py Outdated

kevinmingtarja self-requested a review June 26, 2026 20:00

kevinmingtarja force-pushed the zhwu/jobs-recovery-log-hooks branch from e35df01 to ee90f6d Compare June 30, 2026 17:33

Michaelvll and others added 3 commits June 30, 2026 10:35

kevinmingtarja force-pushed the zhwu/jobs-recovery-log-hooks branch from ee90f6d to f803421 Compare June 30, 2026 17:35

kevinmingtarja and others added 3 commits June 30, 2026 13:01

kevinmingtarja force-pushed the zhwu/jobs-recovery-log-hooks branch from f95af39 to cfa7a46 Compare July 1, 2026 01:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Jobs] Add on_before_recovery runtime hook#9966

[Jobs] Add on_before_recovery runtime hook#9966
Michaelvll wants to merge 6 commits into
masterfrom
zhwu/jobs-recovery-log-hooks

Michaelvll commented Jun 26, 2026 •

edited by kevinmingtarja

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

kevinmingtarja commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Michaelvll commented Jun 26, 2026 • edited by kevinmingtarja Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kevinmingtarja commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Michaelvll commented Jun 26, 2026 •

edited by kevinmingtarja

Loading