Skip to content

Conversation

@AdamRajfer
Copy link
Contributor

@AdamRajfer AdamRajfer commented Nov 28, 2025

Fixes: Jobs showing incorrect status during active execution

  • Use squeue for real-time active job status (more reliable)
  • Fall back to sacct for completed/historical jobs
  • Add comprehensive tests for new squeue parsing logic

Summary by CodeRabbit

  • Improvements
    • Improved SLURM job status monitoring with enhanced dual-source query capability for more reliable tracking of job states and lifecycle information.

✏️ Tip: You can customize this high-level summary in your review settings.

@AdamRajfer AdamRajfer requested review from a team as code owners November 28, 2025 18:58
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@AdamRajfer AdamRajfer enabled auto-merge (squash) December 1, 2025 12:37
@AdamRajfer
Copy link
Contributor Author

/ok to test 4abd03c

then falls back to sacct for completed/historical jobs that squeue doesn't show.
Args:
slurm_job_ids: List of SLURM job IDs to query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that most of the bugs are coming from the fact that we cannot reliably tell the slurm job id for a specific job. We are trying to read this from a file, but there can be some race conditions and manual restarts that can make the file to be out-of-sync from reality.

For the concrete case we discussed offline, will this fix the status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should try to handle cases where a user does something manually, e.g. restarts the job.

Also I think the file with job IDs is the closest thing to the truth that we can get. If we tried to get the information from all user's jobs, we'd open a new can of worms - most folks run different things, not only evaluations, it's hard to predict what corner cases we'd hit.

@AdamRajfer AdamRajfer disabled auto-merge December 1, 2025 16:17
@gchlebus
Copy link

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Dec 15, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Dec 15, 2025

Walkthrough

The SLURM executor refactored status retrieval to return tuples of (status, job_id) instead of plain status strings. Two new helper functions separate queries for active jobs (squeue) and historical jobs (sacct), with squeue preferred and sacct as fallback. Updated tests validate the new tuple format and combined query behavior.

Changes

Cohort / File(s) Summary
SLURM Executor Status Refactoring
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py
Updated _query_slurm_jobs_status to return Dict[str, Tuple[str, str]] mapping job IDs to (status, current_job_id) tuples. Added new helper functions _query_squeue_for_jobs and _query_sacct_for_jobs to separately query active and historical SLURM jobs. Modified control flow to prefer squeue results with sacct fallback, combined into aggregated results. Updated docstrings and status accessor logic to handle tuple format.
SLURM Executor Test Updates
packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py
Updated existing test mocks and assertions to reflect tuple-based status returns (("STATE", "STATE_ID") instead of "STATE"). Added new test cases for _query_squeue_for_jobs and _query_sacct_for_jobs helpers including ID pattern and dependency handling. Introduced test_query_slurm_jobs_status_combined_approach to validate multi-source squeue/sacct workflow. Adjusted all status mapping expectations across affected tests.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant _query_slurm_jobs_status
    participant squeue
    participant sacct
    
    Caller->>_query_slurm_jobs_status: Query job statuses for [job_ids]
    _query_slurm_jobs_status->>squeue: _query_squeue_for_jobs (active jobs)
    alt squeue succeeds
        squeue-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
        _query_slurm_jobs_status-->>Caller: Return squeue results
    else squeue returns no data
        _query_slurm_jobs_status->>sacct: _query_sacct_for_jobs (historical jobs)
        sacct-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
        _query_slurm_jobs_status-->>Caller: Return sacct results
    else combine partial results
        _query_slurm_jobs_status->>sacct: Query remaining jobs in sacct
        sacct-->>_query_slurm_jobs_status: {job_id: (status, current_id)}
        _query_slurm_jobs_status-->>Caller: Return merged {squeue + sacct}
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Status tuple format change requires verification that all call sites correctly index [0] for status retrieval and handle the new second element (current_job_id)
  • Dual-query logic in _query_slurm_jobs_status needs careful review of fallback and merge behavior between squeue and sacct
  • Test coverage across multiple test functions updating mock return values and assertions — ensure consistency in all cases
  • Helper function integration — verify _query_squeue_for_jobs and _query_sacct_for_jobs return types match expectations throughout the executor

Poem

🐰 Hop hop! Status now wears a hat—
Tuple of two where once was flat.
Query the queue, then archives deep,
Job IDs paired in tuples we keep!
A dual-dance of jobs, both swift and slow. 🎯

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: implementing a hybrid squeue+sacct approach for SLURM job status queries to fix status accuracy issues during active execution.
Docstring Coverage ✅ Passed Docstring coverage is 94.74% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch arajfer/use-squeue-for-checking-active-jobs-status

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (2)

1086-1090: Potential false positive in dependency matching.

The substring check if known_job_id in dependency could incorrectly match job IDs that are substrings of other job IDs. For example, if known_job_id="123" and dependency="afternotok:123456", this would incorrectly match.

Consider using a more precise match:

         for dep_job_id, dep_status, dependency in dependent_jobs:
             for known_job_id in slurm_job_ids:
-                if known_job_id in dependency and known_job_id not in squeue_statuses:
+                # Use word boundary matching to avoid substring false positives
+                import re
+                if re.search(rf'\b{re.escape(known_job_id)}\b', dependency) and known_job_id not in squeue_statuses:
                     squeue_statuses[known_job_id] = dep_status, dep_job_id
                     break

Alternatively, parse the dependency string properly (e.g., split on : and , to extract exact job IDs).


1065-1091: Consider logging squeue failures for debugging.

When squeue fails (returncode != 0), the function silently returns an empty dict and falls back to sacct. While this is correct behavior, logging a warning would help diagnose issues in production.

     squeue_statuses = {}
     dependent_jobs = []
     if completed_process.returncode == 0:
         squeue_output = completed_process.stdout.decode("utf-8")
         # ... parsing logic ...
+    else:
+        logger.warning(
+            "squeue query failed, falling back to sacct",
+            returncode=completed_process.returncode,
+            stderr=completed_process.stderr.decode("utf-8") if completed_process.stderr else "",
+        )

     return squeue_statuses
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dec21fc and e405cdb.

📒 Files selected for processing (2)
  • packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (4 hunks)
  • packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py (1)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (3)
  • _query_squeue_for_jobs (1022-1092)
  • _query_slurm_jobs_status (982-1019)
  • _query_sacct_for_jobs (1095-1142)
🔇 Additional comments (9)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (3)

391-394: LGTM - Correct tuple access for new status format.

The code correctly accesses [0] to extract the status string from the new (status, current_job_id) tuple format returned by _query_slurm_jobs_status.


982-1019: Well-designed hybrid approach for accurate job status.

The implementation correctly:

  1. Queries squeue first for active jobs (more accurate for running jobs)
  2. Falls back to sacct only for jobs not found in squeue
  3. Combines results with squeue data taking precedence

This addresses the PR objective of fixing jobs showing incorrect status during active execution.


1095-1142: LGTM - Clean refactor of sacct query logic.

The function is properly refactored to return the new tuple format (status, slurm_job_id) while maintaining the existing sacct parsing logic.

packages/nemo-evaluator-launcher/tests/unit_tests/test_slurm_executor.py (6)

1258-1258: LGTM - Mock correctly updated for tuple format.

The mock return value is properly updated to use the new (status, job_id) tuple format.


1752-1777: Good test coverage for squeue parsing.

The test correctly validates parsing of various job ID formats including regular jobs, array jobs (123456790_0), and bracket notation (123456791[1-10]).


1779-1809: Test validates dependency resolution behavior.

This test validates the current substring matching behavior for dependent jobs. Note that if the implementation is updated to use word boundary matching (as suggested in the executor review), this test will need to be updated accordingly.


1811-1854: Good coverage of hybrid squeue+sacct approach.

The test properly validates that:

  1. Running jobs are fetched from squeue
  2. Completed jobs fall back to sacct
  3. Results are correctly combined

1692-1729: Test correctly updated for new hybrid implementation.

The test properly mocks both squeue (returning empty) and sacct commands, validating the fallback behavior when squeue doesn't find active jobs.


1856-1881: LGTM - Good unit test for sacct helper.

The test validates the new _query_sacct_for_jobs function returns the correct tuple format for each job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants