-
Notifications
You must be signed in to change notification settings - Fork 149
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
NVRxStragglerDetection outputs nans as scores. Tested on 2x4 GB200 nodes. Example from logs:
INFO:megatron.bridge.NVRxStragglerDetection:
GPU individual performance:
Rank=7 Node=<NODE_ID> Score=nan
Rank=6 Node=<NODE_ID> Score=nan
Rank=5 Node=<NODE_ID> Score=nan
Rank=4 Node=<NODE_ID> Score=nan
Rank=3 Node=<NODE_ID> Score=nan
Rank=2 Node=<NODE_ID> Score=nan
Rank=1 Node=<NODE_ID> Score=nan
Rank=0 Node=<NODE_ID> Score=nan
INFO:megatron.bridge.NVRxStragglerDetection:gpu_relative_perf scores: {'gpu_relative_perf/min': nan, 'gpu_relative_perf/median': nan, 'gpu_relative_perf/max': nan}
Steps/Code to reproduce bug
Attaching .py script and .sh Slurm launch script. Tested on nvcr.io/nvidia/nemo:25.11.nemotron_3_nano container.
Expected behavior
Straggler detection outputs correct GPU performance scores.
straggler_detection_pretrain.py
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working