feat(slurm): implement squeue+sacct hybrid for accurate job status #506

AdamRajfer · 2025-11-28T18:58:25Z

Fixes: Jobs showing incorrect status during active execution

Use squeue for real-time active job status (more reliable)
Fall back to sacct for completed/historical jobs
Add comprehensive tests for new squeue parsing logic

Summary by CodeRabbit

Improvements
- Improved SLURM job status monitoring with enhanced dual-source query capability for more reliable tracking of job states and lifecycle information.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Adam Rajfer <[email protected]>

copy-pr-bot · 2025-11-28T18:58:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…status

AdamRajfer · 2025-12-01T12:53:53Z

/ok to test 4abd03c

fgalko-oss · 2025-12-01T12:56:59Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

+    then falls back to sacct for completed/historical jobs that squeue doesn't show.
+
+    Args:
+        slurm_job_ids: List of SLURM job IDs to query.


I suspect that most of the bugs are coming from the fact that we cannot reliably tell the slurm job id for a specific job. We are trying to read this from a file, but there can be some race conditions and manual restarts that can make the file to be out-of-sync from reality.

For the concrete case we discussed offline, will this fix the status?

I don't think we should try to handle cases where a user does something manually, e.g. restarts the job.

Also I think the file with job IDs is the closest thing to the truth that we can get. If we tried to get the information from all user's jobs, we'd open a new can of worms - most folks run different things, not only evaluations, it's hard to predict what corner cases we'd hit.

…status

Signed-off-by: Adam Rajfer <[email protected]>

gchlebus · 2025-12-15T16:45:59Z

@coderabbitai full review

coderabbitai · 2025-12-15T16:46:09Z

✅ Actions performed

Full review triggered.

coderabbitai · 2025-12-15T16:48:22Z

Walkthrough

The SLURM executor refactored status retrieval to return tuples of (status, job_id) instead of plain status strings. Two new helper functions separate queries for active jobs (squeue) and historical jobs (sacct), with squeue preferred and sacct as fallback. Updated tests validate the new tuple format and combined query behavior.

Changes

Cohort / File(s)	Summary
SLURM Executor Status Refactoring `packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py`	Updated `_query_slurm_jobs_status` to return `Dict[str, Tuple[str, str]]` mapping job IDs to (status, current_job_id) tuples. Added new helper functions `_query_squeue_for_jobs` and `_query_sacct_for_jobs` to separately query active and historical SLURM jobs. Modified control flow to prefer squeue results with sacct fallback, combined into aggregated results. Updated docstrings and status accessor logic to handle tuple format.
SLURM Executor Test Updates `packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py`	Updated existing test mocks and assertions to reflect tuple-based status returns (`("STATE", "STATE_ID")` instead of `"STATE"`). Added new test cases for `_query_squeue_for_jobs` and `_query_sacct_for_jobs` helpers including ID pattern and dependency handling. Introduced `test_query_slurm_jobs_status_combined_approach` to validate multi-source squeue/sacct workflow. Adjusted all status mapping expectations across affected tests.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant _query_slurm_jobs_status
    participant squeue
    participant sacct
    
    Caller->>_query_slurm_jobs_status: Query job statuses for [job_ids]
    _query_slurm_jobs_status->>squeue: _query_squeue_for_jobs (active jobs)
    alt squeue succeeds
        squeue-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
        _query_slurm_jobs_status-->>Caller: Return squeue results
    else squeue returns no data
        _query_slurm_jobs_status->>sacct: _query_sacct_for_jobs (historical jobs)
        sacct-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
        _query_slurm_jobs_status-->>Caller: Return sacct results
    else combine partial results
        _query_slurm_jobs_status->>sacct: Query remaining jobs in sacct
        sacct-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
        _query_slurm_jobs_status-->>Caller: Return merged {squeue + sacct}
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Status tuple format change requires verification that all call sites correctly index [0] for status retrieval and handle the new second element (current_job_id)
Dual-query logic in _query_slurm_jobs_status needs careful review of fallback and merge behavior between squeue and sacct
Test coverage across multiple test functions updating mock return values and assertions — ensure consistency in all cases
Helper function integration — verify _query_squeue_for_jobs and _query_sacct_for_jobs return types match expectations throughout the executor

Poem

🐰 Hop hop! Status now wears a hat—
Tuple of two where once was flat.
Query the queue, then archives deep,
Job IDs paired in tuples we keep!
A dual-dance of jobs, both swift and slow. 🎯

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: implementing a hybrid squeue+sacct approach for SLURM job status queries to fix status accuracy issues during active execution.
Docstring Coverage	✅ Passed	Docstring coverage is 94.74% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch arajfer/use-squeue-for-checking-active-jobs-status

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (2)
1086-1090: Potential false positive in dependency matching.

The substring check if known_job_id in dependency could incorrectly match job IDs that are substrings of other job IDs. For example, if known_job_id="123" and dependency="afternotok:123456", this would incorrectly match.

Consider using a more precise match:
         for dep_job_id, dep_status, dependency in dependent_jobs:
             for known_job_id in slurm_job_ids:
-                if known_job_id in dependency and known_job_id not in squeue_statuses:
+                # Use word boundary matching to avoid substring false positives
+                import re
+                if re.search(rf'\b{re.escape(known_job_id)}\b', dependency) and known_job_id not in squeue_statuses:
                     squeue_statuses[known_job_id] = dep_status, dep_job_id
                     break
Alternatively, parse the dependency string properly (e.g., split on : and , to extract exact job IDs).

1065-1091: Consider logging squeue failures for debugging.

When squeue fails (returncode != 0), the function silently returns an empty dict and falls back to sacct. While this is correct behavior, logging a warning would help diagnose issues in production.
     squeue_statuses = {}
     dependent_jobs = []
     if completed_process.returncode == 0:
         squeue_output = completed_process.stdout.decode("utf-8")
         # ... parsing logic ...
+    else:
+        logger.warning(
+            "squeue query failed, falling back to sacct",
+            returncode=completed_process.returncode,
+            stderr=completed_process.stderr.decode("utf-8") if completed_process.stderr else "",
+        )

     return squeue_statuses

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dec21fc and e405cdb.

📒 Files selected for processing (2)

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (4 hunks)
packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py (1)

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (3)

_query_squeue_for_jobs (1022-1092)

_query_slurm_jobs_status (982-1019)

_query_sacct_for_jobs (1095-1142)

🔇 Additional comments (9)

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (3)

391-394: LGTM - Correct tuple access for new status format.

The code correctly accesses [0] to extract the status string from the new (status, current_job_id) tuple format returned by _query_slurm_jobs_status.

982-1019: Well-designed hybrid approach for accurate job status.

The implementation correctly:

Queries squeue first for active jobs (more accurate for running jobs)

Falls back to sacct only for jobs not found in squeue

Combines results with squeue data taking precedence

This addresses the PR objective of fixing jobs showing incorrect status during active execution.

1095-1142: LGTM - Clean refactor of sacct query logic.

The function is properly refactored to return the new tuple format (status, slurm_job_id) while maintaining the existing sacct parsing logic.

packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py (6)

1258-1258: LGTM - Mock correctly updated for tuple format.

The mock return value is properly updated to use the new (status, job_id) tuple format.

1752-1777: Good test coverage for squeue parsing.

The test correctly validates parsing of various job ID formats including regular jobs, array jobs (123456790_0), and bracket notation (123456791[1-10]).

1779-1809: Test validates dependency resolution behavior.

This test validates the current substring matching behavior for dependent jobs. Note that if the implementation is updated to use word boundary matching (as suggested in the executor review), this test will need to be updated accordingly.

1811-1854: Good coverage of hybrid squeue+sacct approach.

The test properly validates that:

Running jobs are fetched from squeue

Completed jobs fall back to sacct

Results are correctly combined

1692-1729: Test correctly updated for new hybrid implementation.

The test properly mocks both squeue (returning empty) and sacct commands, validating the fallback behavior when squeue doesn't find active jobs.

1856-1881: LGTM - Good unit test for sacct helper.

The test validates the new _query_sacct_for_jobs function returns the correct tuple format for each job.

…status

Use squeue for real-time active job status

41b661e

Signed-off-by: Adam Rajfer <[email protected]>

AdamRajfer requested review from a team as code owners November 28, 2025 18:58

github-actions bot added nemo-evaluator-launcher tests labels Nov 28, 2025

AdamRajfer requested a review from fgalko-oss November 28, 2025 19:01

Merge branch 'main' into arajfer/use-squeue-for-checking-active-jobs-…

4abd03c

…status

AdamRajfer enabled auto-merge (squash) December 1, 2025 12:37

copy-pr-bot bot temporarily deployed to test December 1, 2025 12:54 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 12:54 Inactive

fgalko-oss reviewed Dec 1, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 13:00 Inactive

Merge branch 'main' into arajfer/use-squeue-for-checking-active-jobs-…

0673b6a

…status

AdamRajfer disabled auto-merge December 1, 2025 16:17

Detect follow-up jobs via dependency parsing

e405cdb

Signed-off-by: Adam Rajfer <[email protected]>

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

Merge branch 'main' into arajfer/use-squeue-for-checking-active-jobs-…

b09c3d0

…status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(slurm): implement squeue+sacct hybrid for accurate job status #506

feat(slurm): implement squeue+sacct hybrid for accurate job status #506

Uh oh!

AdamRajfer commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Nov 28, 2025

Uh oh!

AdamRajfer commented Dec 1, 2025

Uh oh!

fgalko-oss Dec 1, 2025

Uh oh!

marta-sd Dec 16, 2025

Uh oh!

gchlebus commented Dec 15, 2025

Uh oh!

coderabbitai bot commented Dec 15, 2025

Uh oh!

coderabbitai bot commented Dec 15, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat(slurm): implement squeue+sacct hybrid for accurate job status #506

Are you sure you want to change the base?

feat(slurm): implement squeue+sacct hybrid for accurate job status #506

Uh oh!

Conversation

AdamRajfer commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Nov 28, 2025

Uh oh!

AdamRajfer commented Dec 1, 2025

Uh oh!

fgalko-oss Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

marta-sd Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

gchlebus commented Dec 15, 2025

Uh oh!

coderabbitai bot commented Dec 15, 2025

Uh oh!

coderabbitai bot commented Dec 15, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AdamRajfer commented Nov 28, 2025 •

edited by coderabbitai bot

Loading