fix(multinode-slurm): fix visibility of job component URLs#625
fix(multinode-slurm): fix visibility of job component URLs#625fgalko-oss merged 2 commits intomainfrom
Conversation
Signed-off-by: Anna Warno <awarno@nvidia.com>
Signed-off-by: Anna Warno <awarno@nvidia.com>
QA RCCA AnalysisDate: 2026-04-20 1. Fix ReferenceThis issue uses the conventional commit format
No separate fix PR number was referenced. 2. Root CauseIn multi-node SLURM deployments, the launcher started the client, proxy, and exporter processes across all allocated nodes without pinning them to a specific node rank. When these processes started on worker nodes (rank > 0), their port bindings were on worker-node network interfaces — addresses that were typically not accessible from the head node where the user queries job status and retrieves endpoint URLs. The result was that 3. Trigger ConfigTrigger conditions (AND — all must be present):
Deterministic? Non-deterministic before the fix — whether components started on node 0 or a worker depended on SLURM's process placement. After the fix (pinned to node 0), it is deterministic. 4. Nature of BugPrimary classification: Functional correctness bug — multi-node deployments produced inaccessible URLs, rendering the deployment effectively broken for multi-node setups Impact scope: All users running NOT affected: Single-node SLURM evaluations, local executor evaluations, Ray-based deployments. 5. Functional Test CoverageVerdict: PARTIAL
6. Gaps and LimitationsGap 1 — No live multi-node test (out of scope — hardware constraint): Gap 2 — Source inspection is heuristic: Overall gap assessment: Medium regression risk. The structural tests guard against code regressions that remove node-rank handling, but live multi-node routing validation requires SLURM hardware. 7. New Test Added
8. ConclusionIssue #625 was caused by job components (client, proxy, exporter) not being pinned to node 0 in multi-node SLURM deployments, resulting in URLs bound to worker-node interfaces that were invisible from the head node; a secondary bug allowed proxy failures to cause silent hangs. The fix pinned all components to node 0 and added proxy failure termination. A new five-function structural regression test has been added to guard the node rank awareness and URL construction code paths; live multi-node validation remains out of scope for automated CI. Auto-generated by the issues-rca skill — QA RCCA Analysis v2 |
Uh oh!
There was an error while loading. Please reload this page.