Describe the bug
Current commit 4df8c97392bc9c3906739e658f203f6810008821 for Megatron Bridge does not support correct NUMA binding on GB300 platform: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/4df8c97392bc9c3906739e658f203f6810008821/scripts/performance/utils/executors.py#L124
This was fixed here: NVIDIA-NeMo/Megatron-Bridge@487bbaa
Current behavior results in incorrect NUMA binding for training processes (all 4 of them end up bound to NUMA0) and leads to performance loss.
Affected workloads: https://github.com/search?q=repo%3ANVIDIA%2Fdgxc-benchmarking%204df8c97392bc9c3906739e658f203f6810008821&type=code
Steps/Code to reproduce bug
Install the latest tag v25.12.02 while selecting gb300 GPU. Launch any of the following workloads [nemotron-h, deepseek_v3, qwen3, llama3.1] on GB300 Slurm cluster.
Expected behavior
Use later Megatron-Bridge commit with support for GB300 so task 0 and 1 are bound to NUMA0, task 2 and 3 are bound to NUMA1.
Environment details (please complete the following information):
- Environment location: Cloud(Nebius)