Skip to content

NVRxStragglerDetection outputs nans as GPU performance scores #2081

@OlegSudakov

Description

@OlegSudakov

Describe the bug

NVRxStragglerDetection outputs nans as scores. Tested on 2x4 GB200 nodes. Example from logs:

INFO:megatron.bridge.NVRxStragglerDetection:
GPU individual performance:
  Rank=7 Node=<NODE_ID> Score=nan
  Rank=6 Node=<NODE_ID> Score=nan
  Rank=5 Node=<NODE_ID> Score=nan
  Rank=4 Node=<NODE_ID> Score=nan
  Rank=3 Node=<NODE_ID> Score=nan
  Rank=2 Node=<NODE_ID> Score=nan
  Rank=1 Node=<NODE_ID> Score=nan
  Rank=0 Node=<NODE_ID> Score=nan

INFO:megatron.bridge.NVRxStragglerDetection:gpu_relative_perf scores: {'gpu_relative_perf/min': nan, 'gpu_relative_perf/median': nan, 'gpu_relative_perf/max': nan}

Steps/Code to reproduce bug

Attaching .py script and .sh Slurm launch script. Tested on nvcr.io/nvidia/nemo:25.11.nemotron_3_nano container.

Expected behavior

Straggler detection outputs correct GPU performance scores.

slurm_launch_gb200.sh

straggler_detection_pretrain.py

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions