Skip to content

fix(multinode-slurm): fix visibility of job component URLs#625

Merged
fgalko-oss merged 2 commits intomainfrom
awarno/multinode-fix
Jan 12, 2026
Merged

fix(multinode-slurm): fix visibility of job component URLs#625
fgalko-oss merged 2 commits intomainfrom
awarno/multinode-fix

Conversation

@AWarno
Copy link
Copy Markdown
Contributor

@AWarno AWarno commented Jan 12, 2026

  1. Ensure execution of the client, proxy, and exporter on node 0.
  2. Terminate the process on proxy failure.

Signed-off-by: Anna Warno <awarno@nvidia.com>
@AWarno AWarno self-assigned this Jan 12, 2026
@AWarno AWarno requested a review from a team as a code owner January 12, 2026 17:21
@AWarno AWarno added the bug Something isn't working label Jan 12, 2026
@AWarno AWarno requested a review from a team as a code owner January 12, 2026 17:21
Signed-off-by: Anna Warno <awarno@nvidia.com>
@fgalko-oss fgalko-oss merged commit 17b2cd1 into main Jan 12, 2026
49 checks passed
@fgalko-oss fgalko-oss deleted the awarno/multinode-fix branch January 12, 2026 19:41
@pruprakash
Copy link
Copy Markdown

QA RCCA Analysis

Date: 2026-04-20
Analyst: AI QA Agent (issues-rca skill)
Issue: #625 — fix(multinode-slurm): fix visibility of job component URLs


1. Fix Reference

This issue uses the conventional commit format fix(multinode-slurm): indicating it is itself the tracked fix item. Closed 2026-01-12 as completed. Two changes were documented in the body:

  1. Ensure execution of client, proxy, and exporter on node 0 — pinning these components to the head node so their URLs are accessible to users.
  2. Terminate the process on proxy failure — preventing silent hangs when the proxy crashes mid-evaluation.

No separate fix PR number was referenced.


2. Root Cause

In multi-node SLURM deployments, the launcher started the client, proxy, and exporter processes across all allocated nodes without pinning them to a specific node rank. When these processes started on worker nodes (rank > 0), their port bindings were on worker-node network interfaces — addresses that were typically not accessible from the head node where the user queries job status and retrieves endpoint URLs. The result was that nemo-evaluator-launcher ls or job status queries returned URLs bound to worker nodes that were unreachable, making the deployment appear broken. The secondary bug was that a proxy failure did not propagate — the main process continued running, causing an infinite silent hang.


3. Trigger Config

Trigger conditions (AND — all must be present):

  • Multi-node SLURM deployment (≥ 2 nodes allocated)
  • nemo-evaluator-launcher starts the evaluation job across nodes
  • Client, proxy, or exporter component starts on a worker node (rank > 0) rather than node 0
  • User queries job URLs from the head node

Deterministic? Non-deterministic before the fix — whether components started on node 0 or a worker depended on SLURM's process placement. After the fix (pinned to node 0), it is deterministic.


4. Nature of Bug

Primary classification: Functional correctness bug — multi-node deployments produced inaccessible URLs, rendering the deployment effectively broken for multi-node setups

Impact scope: All users running nemo-evaluator-launcher on ≥ 2 SLURM nodes. Single-node deployments are unaffected.

NOT affected: Single-node SLURM evaluations, local executor evaluations, Ray-based deployments.


5. Functional Test Coverage

Verdict: PARTIAL

Test File Key Config What it covers
test_slurm_executor_importable evaluator/testcases/rcca/launcher/test_multinode_slurm_url_visibility.py No SLURM needed SLURM executor module is present and importable
test_slurm_executor_has_node_rank_awareness same No SLURM needed Source inspection: executor references node rank / SLURM_NODEID / node 0 concepts
test_proxy_component_has_failure_handling same No SLURM needed Source inspection: proxy or executor source contains process termination logic
test_job_component_url_references_node_address same No SLURM needed Source scan: URL construction uses node address (not hardcoded localhost)
test_slurm_executor_exposes_job_url_method same No SLURM needed SlurmExecutor exposes get_status or URL retrieval method

6. Gaps and Limitations

Gap 1 — No live multi-node test (out of scope — hardware constraint):
Verifying that components actually start on node 0 and URLs are accessible from the head node requires ≥ 2 physical SLURM nodes, allocated GPUs, and a real evaluation job. This is fully outside CI scope.

Gap 2 — Source inspection is heuristic:
The node rank awareness test checks for string indicators in the source. If the fix uses a different naming convention (e.g., SLURM_GTIDS or a custom rank variable), the test may give a false negative.

Overall gap assessment: Medium regression risk. The structural tests guard against code regressions that remove node-rank handling, but live multi-node routing validation requires SLURM hardware.


7. New Test Added

Field Value
Test file evaluator/testcases/rcca/launcher/test_multinode_slurm_url_visibility.py
Test functions test_slurm_executor_importable, test_slurm_executor_has_node_rank_awareness, test_proxy_component_has_failure_handling, test_job_component_url_references_node_address, test_slurm_executor_exposes_job_url_method
QA repo nmfw_tests (local: evaluator/testcases/rcca/launcher/)
PR Pending
What it validates SLURM executor has node rank awareness; URL construction uses node address; proxy failure handling present
How it would catch a regression If node rank handling is removed from the executor source, test_slurm_executor_has_node_rank_awareness fails. If URL construction reverts to localhost, test_job_component_url_references_node_address fails.

8. Conclusion

Issue #625 was caused by job components (client, proxy, exporter) not being pinned to node 0 in multi-node SLURM deployments, resulting in URLs bound to worker-node interfaces that were invisible from the head node; a secondary bug allowed proxy failures to cause silent hangs. The fix pinned all components to node 0 and added proxy failure termination. A new five-function structural regression test has been added to guard the node rank awareness and URL construction code paths; live multi-node validation remains out of scope for automated CI.


Auto-generated by the issues-rca skill — QA RCCA Analysis v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants