fix(training): run K8s trainer from data-mover-synced code (r0.5.0 backport) by ko3n1g · Pull Request #4320 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-06-12T07:14:14Z

Claude summary

Background / motivation

Backport of fix(training): run K8s trainer from data-mover-synced code #4319 onto the r0.5.0 release line, to validate the K8s data-mover code-sync path for v0.5.0 (mbridge-v0.5.0-debug1).
On Kubeflow the trainer ran the image's /opt/Megatron-Bridge, never the --mbridge-ref checkout synced onto the workdir PVC; the entrypoint was hardcoded to the image path and the run.Script PYTHONPATH does not reach the trainer container.

What changed

Cherry-pick of the launcher fix (scripts/performance/setup_experiment.py, utils/executors.py): when kubeflow_workdir_local_path is set, point the entrypoint at /nemo_run/scripts/performance/run_recipe.py and front-load /nemo_run/src on PYTHONPATH via custom_env_vars. Without it, behavior is unchanged (image code).

Details

Pairs with nemo-ci !2522 (stages src/+scripts/, sets KUBEFLOW_WORKDIR_LOCAL_PATH).
Mainline PR: fix(training): run K8s trainer from data-mover-synced code #4319.
Cherry-pick applied cleanly onto origin/r0.5.0 (-x recorded).

Tested

Validation run launched on nemo-ci branch ko3n1g-ci-mbridge-k8s-code-sync with --test-image nvcr.io/nvidia/nemo:26.04, --mbridge-ref ko3n1g/fix/k8s-code-sync-r0.5.0, --mcore-ref core_r0.18.0, test-case moonlight_16b_64gpu_gb200_release.

On Kubeflow the trainer entrypoint was hardcoded to the image's /opt/Megatron-Bridge, so a --mbridge-ref checkout copied onto the workdir PVC by nemo-run's data-mover pod was never executed: the launch.sh symlink (/nemo_run -> code_dir) was built but the torchrun command still pointed at the image. When kubeflow_workdir_local_path is set, point the entrypoint at /nemo_run/scripts/performance/run_recipe.py and front-load /nemo_run/src on PYTHONPATH (via custom_env_vars, since the run.Script env is not propagated to the trainer container) so megatron.bridge resolves to the synced source while megatron.core et al. still come from the image. Without it, behavior is unchanged (image code). Also correct the executors.py comment: KubeflowExecutor.package() ignores the packager and never overlays /opt/Megatron-Bridge. Signed-off-by: oliver könig <okoenig@nvidia.com> (cherry picked from commit 028da1d)

copy-pr-bot · 2026-06-12T07:14:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Add /nemo_run/3rdparty/Megatron-LM to the trainer PYTHONPATH so megatron.core resolves to the synced --mcore-ref checkout (version-matched with the synced megatron.bridge), not the image's mismatched mcore. Signed-off-by: oliver könig <okoenig@nvidia.com>

compile_helpers() ran on each node's local rank 0 AFTER init_process_group but before the first NCCL collective (barrier). The first-time C++ build is slow, so on K8s — where the NCCL bootstrap connect budget is short — peers fail the barrier with 'connection refused' / 'remote process exited' while a rank is mid -compile. (Slurm's rendezvous tolerated it; K8s does not. K8s historically used images with the helper prebuilt, so on-the-fly compile was never exercised there.) Move the compile before torch.distributed init so the rendezvous (long, tolerant store-connect retries) absorbs the delay; a prebuilt .so stays a fast no-op. Per-node local-rank-0 build is unchanged. Signed-off-by: oliver könig <okoenig@nvidia.com>

When the megatron/core datasets dir is on a filesystem shared across all trainer nodes (e.g. a K8s PVC), a per-node compile has many ranks run g++/ld against the same .so concurrently, which over NFS fails with 'ld: final link failed: Stale file handle' — killing a rank and cascading into a NCCL barrier failure. flock can't fix it here: the NFS has no working lock daemon, so flock(LOCK_EX) hangs. Instead global rank 0 builds the .so once before distributed init (single writer, no race; pre-init so the tolerant rendezvous absorbs the build, not a NCCL collective). The per-node compile after init is retained for disconnected per-node filesystems (e.g. Slurm/Lustre): a no-op on a shared FS, an independent build per node otherwise. A prebuilt .so stays a fast no-op. Signed-off-by: oliver könig <okoenig@nvidia.com>

Plumb KUBEFLOW_SETUP_COMMANDS_JSON (read directly from env to avoid $(argument_builder) word-splitting of space-containing shell commands) through setup_experiment -> kubeflow_executor -> nemo-run KubeflowExecutor.setup_commands, which runs them once per pod in launch.sh before the job. Feature-detected (hasattr) so an older pinned nemo-run without the field is a no-op, not an error. Used to install a dependency missing from a (broken RC) image into the container venv without rebuilding. Signed-off-by: oliver könig <okoenig@nvidia.com>

moonlight 16B pretrains on the GB200/GB300 NVL72 target, where HybridEP is the recommended MoE token dispatcher (per the dispatcher-selection guide) over DeepEP. Use the flex dispatcher with the hybridep backend (num_sms=16). Signed-off-by: oliver könig <okoenig@nvidia.com>

The fine-grained 64-expert MoE issues many tiny per-expert/router/attn kernels per step, so the step is host-launch-bound with CUDA graphs off. Newer base images (26.06.rc0) raised per-kernel launch latency, surfacing a ~9% moonlight- only step-time regression. Capture the launch-heavy attn/moe_router/moe_preprocess scopes with TE graphs to remove that host overhead. Mirrors nemotronh/nemotron_3_super. Signed-off-by: oliver könig <okoenig@nvidia.com>

…pretrain" This reverts commit 907c805.

…tcher" This reverts commit 02c41f8.

ko3n1g added 3 commits June 12, 2026 09:07

ko3n1g force-pushed the ko3n1g/fix/k8s-code-sync-r0.5.0 branch from df84ea9 to fec2dbe Compare June 12, 2026 11:16

ko3n1g added 5 commits June 12, 2026 12:25

Revert "perf(recipe): enable TE-scoped CUDA graphs for moonlight 16B …

5870cf0

…pretrain" This reverts commit 907c805.

Revert "perf(recipe): switch moonlight 16B pretrain to HybridEP dispa…

49affe2

…tcher" This reverts commit 02c41f8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(training): run K8s trainer from data-mover-synced code (r0.5.0 backport)#4320

fix(training): run K8s trainer from data-mover-synced code (r0.5.0 backport)#4320
ko3n1g wants to merge 9 commits into
r0.5.0from
ko3n1g/fix/k8s-code-sync-r0.5.0

ko3n1g commented Jun 12, 2026

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ko3n1g commented Jun 12, 2026

Background / motivation

What changed

Details

Tested

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant