Skip to content

Megatron-Bridge commit needs update to support GB300 [nemotron-h, deepseek_v3, qwen3, llama3.1] #38

@cyril-k

Description

@cyril-k

Describe the bug
Current commit 4df8c97392bc9c3906739e658f203f6810008821 for Megatron Bridge does not support correct NUMA binding on GB300 platform: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/4df8c97392bc9c3906739e658f203f6810008821/scripts/performance/utils/executors.py#L124
This was fixed here: NVIDIA-NeMo/Megatron-Bridge@487bbaa

Current behavior results in incorrect NUMA binding for training processes (all 4 of them end up bound to NUMA0) and leads to performance loss.
Affected workloads: https://github.com/search?q=repo%3ANVIDIA%2Fdgxc-benchmarking%204df8c97392bc9c3906739e658f203f6810008821&type=code

Steps/Code to reproduce bug
Install the latest tag v25.12.02 while selecting gb300 GPU. Launch any of the following workloads [nemotron-h, deepseek_v3, qwen3, llama3.1] on GB300 Slurm cluster.

Expected behavior
Use later Megatron-Bridge commit with support for GB300 so task 0 and 1 are bound to NUMA0, task 2 and 3 are bound to NUMA1.

Environment details (please complete the following information):

  • Environment location: Cloud(Nebius)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions