[debug dump] collect managed-jobs controller submit-job logs#9955
[debug dump] collect managed-jobs controller submit-job logs#9955ishankaul1 wants to merge 5 commits into
Conversation
Add ~/sky_logs/managed_jobs/submit-job-*.log to the debug dump. These per-submission logs record the "Started N controllers" count -- the controller over-count signal that pins leaked controllers to a specific submission -- which is not reconstructable from the per-job data already collected. Collected via the existing controller debug-dump manifest: the on- controller CodeGen (collect_debug_dump_manifest) globs the submission logs and adds them to file_paths, so they ride the manifest's existing parallel rsync instead of a separate enumeration round-trip. Scoped to the requested jobs, mirroring the controller_system log scoping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces functionality to collect managed-job submission log file paths (submit-job-*.log) scoped to specific job IDs during debug dump collection, along with corresponding unit tests. The review feedback highlights a potential memory issue (OOM risk) when expanding large job ID ranges into a set, and suggests refactoring the parser to return a list of integer ranges (List[Tuple[int, int]]) instead. Consequently, the feedback also recommends updating the log collection logic to perform range-based intersection checks and adjusting the unit tests to match this new return type.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…ansion) Keep submit-job filename id-sets as inclusive (start, end) ranges instead of expanding them into a set of ints, so a pathological filename like submit-job-1-100000000.log can't blow up memory. Match requested jobs via interval membership. Also skip non-file glob matches and reject inverted ranges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Adds the managed-jobs controller's per-submission
submit-job-*.logfiles to thedebug dump. These logs record the "Started N controllers" count for each
submission — the signal for diagnosing a controller over-count, which is not
reconstructable from the per-job
<jobid>.logorjob_infoalready collected.Implementation
CodeGen (
collect_debug_dump_manifest) globs~/sky_logs/managed_jobs/submit-job-*.logand adds matches tofile_paths, sothey ride the manifest's existing parallel rsync — no separate enumeration
round-trip and no changes to
debug_utils.py._job_ids_to_str, so a range-named batch likesubmit-job-580-588.logismatched when any requested job falls in the range — mirroring how
_collect_controller_system_log_pathsscopes, so a long-lived controller'sentire submission history isn't dragged into every dump.
Notes
submit-job-*.logis written in consolidation mode, where the submission runs onthe API server without a Ray runtime and is therefore logged to an explicit path.
In non-consolidation mode the analogous submission output is captured by Ray's
controller.log, and the "Started N controllers" signal also lands in thecontroller's
skylet.log(already collected by the dump), so non-consolidationdeployments aren't left without the signal.
Test plan
Unit tests (
tests/unit_tests/test_jobs_utils.py):TestParseSubmitLogJobIds— single id, inclusive range, mixed, malformed.TestControllerSubmitLogScoping— range-contains match, unrelated excluded,empty
job_ids, missing dir, unparseable filename → warned + skipped.Manual e2e (consolidation mode, local Kubernetes):
sky jobs launch -n e2e --infra k8s -y "echo hi; sleep 45".sky debug-dump -j <id> --output dump.zip.managed_jobs/controller_submit_logs/submit-job-<id>.logis presentin the zip with the real submission-log contents.
Also verified on a separate (non-consolidation) controller that the new code is
installed and correctly collects zero submit-job logs (none exist in that mode)
without error.