Skip to content

[debug dump] collect managed-jobs controller submit-job logs#9955

Open
ishankaul1 wants to merge 5 commits into
skypilot-org:masterfrom
ishankaul1:debug-dump-collect-controller-submit-logs
Open

[debug dump] collect managed-jobs controller submit-job logs#9955
ishankaul1 wants to merge 5 commits into
skypilot-org:masterfrom
ishankaul1:debug-dump-collect-controller-submit-logs

Conversation

@ishankaul1

Copy link
Copy Markdown
Collaborator

Summary

Adds the managed-jobs controller's per-submission submit-job-*.log files to the
debug dump. These logs record the "Started N controllers" count for each
submission — the signal for diagnosing a controller over-count, which is not
reconstructable from the per-job <jobid>.log or job_info already collected.

Implementation

  • Collected via the existing controller debug-dump manifest: the on-controller
    CodeGen (collect_debug_dump_manifest) globs
    ~/sky_logs/managed_jobs/submit-job-*.log and adds matches to file_paths, so
    they ride the manifest's existing parallel rsync — no separate enumeration
    round-trip and no changes to debug_utils.py.
  • Scoped to the requested jobs: filenames are parsed via the inverse of
    _job_ids_to_str, so a range-named batch like submit-job-580-588.log is
    matched when any requested job falls in the range — mirroring how
    _collect_controller_system_log_paths scopes, so a long-lived controller's
    entire submission history isn't dragged into every dump.

Notes

submit-job-*.log is written in consolidation mode, where the submission runs on
the API server without a Ray runtime and is therefore logged to an explicit path.
In non-consolidation mode the analogous submission output is captured by Ray's
controller.log, and the "Started N controllers" signal also lands in the
controller's skylet.log (already collected by the dump), so non-consolidation
deployments aren't left without the signal.

Test plan

  • Unit tests (tests/unit_tests/test_jobs_utils.py):

    • TestParseSubmitLogJobIds — single id, inclusive range, mixed, malformed.
    • TestControllerSubmitLogScoping — range-contains match, unrelated excluded,
      empty job_ids, missing dir, unparseable filename → warned + skipped.
  • Manual e2e (consolidation mode, local Kubernetes):

    1. Enable jobs consolidation mode; restart the API server.
    2. sky jobs launch -n e2e --infra k8s -y "echo hi; sleep 45".
    3. sky debug-dump -j <id> --output dump.zip.
    4. Confirmed managed_jobs/controller_submit_logs/submit-job-<id>.log is present
      in the zip with the real submission-log contents.

    Also verified on a separate (non-consolidation) controller that the new code is
    installed and correctly collects zero submit-job logs (none exist in that mode)
    without error.

Add ~/sky_logs/managed_jobs/submit-job-*.log to the debug dump. These
per-submission logs record the "Started N controllers" count -- the
controller over-count signal that pins leaked controllers to a specific
submission -- which is not reconstructable from the per-job data already
collected.

Collected via the existing controller debug-dump manifest: the on-
controller CodeGen (collect_debug_dump_manifest) globs the submission
logs and adds them to file_paths, so they ride the manifest's existing
parallel rsync instead of a separate enumeration round-trip. Scoped to
the requested jobs, mirroring the controller_system log scoping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functionality to collect managed-job submission log file paths (submit-job-*.log) scoped to specific job IDs during debug dump collection, along with corresponding unit tests. The review feedback highlights a potential memory issue (OOM risk) when expanding large job ID ranges into a set, and suggests refactoring the parser to return a list of integer ranges (List[Tuple[int, int]]) instead. Consequently, the feedback also recommends updating the log collection logic to perform range-based intersection checks and adjusting the unit tests to match this new return type.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sky/jobs/utils.py Outdated
Comment thread sky/jobs/utils.py
Comment thread tests/unit_tests/test_jobs_utils.py Outdated
ishankaul1 and others added 2 commits June 25, 2026 16:36
…ansion)

Keep submit-job filename id-sets as inclusive (start, end) ranges instead of
expanding them into a set of ints, so a pathological filename like
submit-job-1-100000000.log can't blow up memory. Match requested jobs via
interval membership. Also skip non-file glob matches and reject inverted
ranges.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ishankaul1 ishankaul1 changed the title [Core] debug-dump: collect managed-jobs controller submit-job logs [debug dump] collect managed-jobs controller submit-job logs Jun 25, 2026
@Michaelvll Michaelvll requested a review from cg505 June 26, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant