Skip to content

[Jobs] Add on_before_recovery runtime hook#9966

Open
Michaelvll wants to merge 6 commits into
masterfrom
zhwu/jobs-recovery-log-hooks
Open

[Jobs] Add on_before_recovery runtime hook#9966
Michaelvll wants to merge 6 commits into
masterfrom
zhwu/jobs-recovery-log-hooks

Conversation

@Michaelvll

@Michaelvll Michaelvll commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds extension points so a registered managed-jobs runtime can preserve a
recovering job's logs before its cluster is torn down, plus a dashboard slot to
surface them. When a managed job recovers, its previous run's logs are otherwise
lost with the replaced cluster; these hooks let a runtime snapshot them first.

Changes:

  1. on_before_recovery hook. New
    ManagedJobRuntime.on_before_recovery(handle, backend, job_id, task_id, exit_codes, job_id_on_pool_cluster)
    protocol method + a module-level dispatch in sky/jobs/runtime.py. The
    controller (sky/jobs/controller.py) invokes it in _monitor_one_task's
    recovery branch — before cleanup/relaunch, while the cluster is (best-effort)
    still reachable — via asyncio.to_thread (so the blocking log I/O never
    stalls the event loop) and gated on is_registered().

    • exit_codes: the failed run's per-node exit codes (when recovery was
      triggered by a job failure), so the hook can record why each node exited.
    • job_id_on_pool_cluster: disambiguates which job's logs to pull on a
      shared pool cluster.
  2. sync_down_logs(head_only=...). CloudVmRayBackend.sync_down_logs gains
    a head_only option to rsync only the head node's run.log (which already
    aggregates every node's output via the runtime's log streaming). During a
    multi-node recovery a node is often already gone, and the all-node fan-out
    exec's into every node — so one unreachable node aborts the whole download.
    Head-only sidesteps that and still yields the full aggregated log.

  3. jobs.detail.logfilters dashboard slot. A PluginSlot beside the log
    node picker so a plugin can contribute extra log filters, keyed on the same jobId/taskId
    context as the log pane.

Defensive: the dispatch skips runtimes predating the hook, and the controller
swallows hook exceptions so a failure never blocks recovery. No behavior change
on the default OSS path (no runtime registered) or for runtimes that don't
implement the hook.

Test plan

  • Existing managed-jobs unit tests pass.
  • Verified the dispatch is a no-op with no runtime registered and under version
    skew (runtime lacking the method), and fires once per recovery otherwise.
  • Verified head_only rsyncs only the head node.

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an on_before_recovery hook to the managed job runtime, allowing the system to capture logs from a failing cluster before it is torn down or relaunched. Feedback on the changes points out that calling this synchronous, I/O-heavy hook directly inside the asynchronous _monitor_one_task method will block the event loop, and recommends wrapping the call in asyncio.to_thread to run it in a separate thread.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sky/jobs/controller.py Outdated
@kevinmingtarja kevinmingtarja self-requested a review June 26, 2026 20:00
@kevinmingtarja kevinmingtarja force-pushed the zhwu/jobs-recovery-log-hooks branch from e35df01 to ee90f6d Compare June 30, 2026 17:33
Michaelvll and others added 3 commits June 30, 2026 10:35
Add an `on_before_recovery` method to the `ManagedJobRuntime` protocol and a
module-level dispatch, invoked from the controller's recovery branch before
the failing cluster is torn down or relaunched. This lets a registered runtime
snapshot the about-to-be-lost run's logs while the cluster is (best-effort)
still reachable, so a recovered job's previous run remains inspectable.

The dispatch is defensive (skips runtimes that predate the hook) and the
controller swallows exceptions, so a failure here never blocks recovery.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P2K1V9diRh1hJVyPJeRBuM
…d guard

The hook call in _monitor_one_task was synchronous, stalling the event
loop (and therefore all in-flight managed-job monitors) while the
runtime downloads logs over SSH/network. Wrap it with asyncio.to_thread
to match every other blocking call in that coroutine.

Add an is_registered() guard so no thread-pool task is spawned when no
runtime is installed, and update the docstring to reflect that the
cluster may already be unreachable when the hook fires.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s
Widen the on_before_recovery protocol + module dispatch with an
``exit_codes`` parameter, and forward the per-node exit codes the
controller already computes for the failed run. The controller resets
``exit_codes`` to None each monitor-loop iteration so a stale value from
a prior iteration never leaks into the hook; it stays None for
infra-level failures (preemption) where there is no app exit.

This lets a runtime record why a run recovered (the app-level exit
code), not just its logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s
@kevinmingtarja kevinmingtarja force-pushed the zhwu/jobs-recovery-log-hooks branch from ee90f6d to f803421 Compare June 30, 2026 17:35
kevinmingtarja and others added 3 commits June 30, 2026 13:01
A runtime's log download on a shared pool cluster must be scoped to the
job's on-cluster id, or it can pull a different job's latest run. The
controller's own download_log_and_stream already passes
job_ids=[job_id_on_pool_cluster] for this reason; thread the same id
into the on_before_recovery hook (protocol + dispatch + call site) so a
runtime can disambiguate too.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0135Wz4EsfxEDrZXxKT7Ts7s
When a multi-node run loses a node, rsyncing logs from every node
exec's into each one -- and an unreachable/terminated node aborts the
whole fan-out, losing the surviving nodes' logs too. Add head_only so
callers can rsync just the head, whose run.log already aggregates every
node's task output via the runtime's log streaming.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a PluginSlot beside the log node picker so a plugin can contribute
extra log filters next to the existing node dropdown. Mirrors the
jobs.detail.logs context (jobId/taskId) so a plugin can key shared state
on the same identity as the log pane.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kevinmingtarja kevinmingtarja force-pushed the zhwu/jobs-recovery-log-hooks branch from f95af39 to cfa7a46 Compare July 1, 2026 01:09
@kevinmingtarja

Copy link
Copy Markdown
Collaborator

/smoke-test --kubernetes --jobs-consolidation --no-resource-heavy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants