-
Notifications
You must be signed in to change notification settings - Fork 19
feat(slurm): implement squeue+sacct hybrid for accurate job status #506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(slurm): implement squeue+sacct hybrid for accurate job status #506
Conversation
Signed-off-by: Adam Rajfer <[email protected]>
|
/ok to test 4abd03c |
| then falls back to sacct for completed/historical jobs that squeue doesn't show. | ||
| Args: | ||
| slurm_job_ids: List of SLURM job IDs to query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that most of the bugs are coming from the fact that we cannot reliably tell the slurm job id for a specific job. We are trying to read this from a file, but there can be some race conditions and manual restarts that can make the file to be out-of-sync from reality.
For the concrete case we discussed offline, will this fix the status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should try to handle cases where a user does something manually, e.g. restarts the job.
Also I think the file with job IDs is the closest thing to the truth that we can get. If we tried to get the information from all user's jobs, we'd open a new can of worms - most folks run different things, not only evaluations, it's hard to predict what corner cases we'd hit.
Signed-off-by: Adam Rajfer <[email protected]>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughThe SLURM executor refactored status retrieval to return tuples of (status, job_id) instead of plain status strings. Two new helper functions separate queries for active jobs (squeue) and historical jobs (sacct), with squeue preferred and sacct as fallback. Updated tests validate the new tuple format and combined query behavior. Changes
Sequence DiagramsequenceDiagram
participant Caller
participant _query_slurm_jobs_status
participant squeue
participant sacct
Caller->>_query_slurm_jobs_status: Query job statuses for [job_ids]
_query_slurm_jobs_status->>squeue: _query_squeue_for_jobs (active jobs)
alt squeue succeeds
squeue-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
_query_slurm_jobs_status-->>Caller: Return squeue results
else squeue returns no data
_query_slurm_jobs_status->>sacct: _query_sacct_for_jobs (historical jobs)
sacct-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
_query_slurm_jobs_status-->>Caller: Return sacct results
else combine partial results
_query_slurm_jobs_status->>sacct: Query remaining jobs in sacct
sacct-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
_query_slurm_jobs_status-->>Caller: Return merged {squeue + sacct}
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (2)
1086-1090: Potential false positive in dependency matching.The substring check
if known_job_id in dependencycould incorrectly match job IDs that are substrings of other job IDs. For example, ifknown_job_id="123"anddependency="afternotok:123456", this would incorrectly match.Consider using a more precise match:
for dep_job_id, dep_status, dependency in dependent_jobs: for known_job_id in slurm_job_ids: - if known_job_id in dependency and known_job_id not in squeue_statuses: + # Use word boundary matching to avoid substring false positives + import re + if re.search(rf'\b{re.escape(known_job_id)}\b', dependency) and known_job_id not in squeue_statuses: squeue_statuses[known_job_id] = dep_status, dep_job_id breakAlternatively, parse the dependency string properly (e.g., split on
:and,to extract exact job IDs).
1065-1091: Consider logging squeue failures for debugging.When squeue fails (returncode != 0), the function silently returns an empty dict and falls back to sacct. While this is correct behavior, logging a warning would help diagnose issues in production.
squeue_statuses = {} dependent_jobs = [] if completed_process.returncode == 0: squeue_output = completed_process.stdout.decode("utf-8") # ... parsing logic ... + else: + logger.warning( + "squeue query failed, falling back to sacct", + returncode=completed_process.returncode, + stderr=completed_process.stderr.decode("utf-8") if completed_process.stderr else "", + ) return squeue_statuses
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py(4 hunks)packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py (1)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (3)
_query_squeue_for_jobs(1022-1092)_query_slurm_jobs_status(982-1019)_query_sacct_for_jobs(1095-1142)
🔇 Additional comments (9)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (3)
391-394: LGTM - Correct tuple access for new status format.The code correctly accesses
[0]to extract the status string from the new(status, current_job_id)tuple format returned by_query_slurm_jobs_status.
982-1019: Well-designed hybrid approach for accurate job status.The implementation correctly:
- Queries
squeuefirst for active jobs (more accurate for running jobs)- Falls back to
sacctonly for jobs not found insqueue- Combines results with squeue data taking precedence
This addresses the PR objective of fixing jobs showing incorrect status during active execution.
1095-1142: LGTM - Clean refactor of sacct query logic.The function is properly refactored to return the new tuple format
(status, slurm_job_id)while maintaining the existing sacct parsing logic.packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py (6)
1258-1258: LGTM - Mock correctly updated for tuple format.The mock return value is properly updated to use the new
(status, job_id)tuple format.
1752-1777: Good test coverage for squeue parsing.The test correctly validates parsing of various job ID formats including regular jobs, array jobs (
123456790_0), and bracket notation (123456791[1-10]).
1779-1809: Test validates dependency resolution behavior.This test validates the current substring matching behavior for dependent jobs. Note that if the implementation is updated to use word boundary matching (as suggested in the executor review), this test will need to be updated accordingly.
1811-1854: Good coverage of hybrid squeue+sacct approach.The test properly validates that:
- Running jobs are fetched from squeue
- Completed jobs fall back to sacct
- Results are correctly combined
1692-1729: Test correctly updated for new hybrid implementation.The test properly mocks both squeue (returning empty) and sacct commands, validating the fallback behavior when squeue doesn't find active jobs.
1856-1881: LGTM - Good unit test for sacct helper.The test validates the new
_query_sacct_for_jobsfunction returns the correct tuple format for each job.
Fixes: Jobs showing incorrect status during active execution
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.