Skip to content

fix(training): run K8s trainer from data-mover-synced code (r0.5.0 backport)#4320

Draft
ko3n1g wants to merge 9 commits into
r0.5.0from
ko3n1g/fix/k8s-code-sync-r0.5.0
Draft

fix(training): run K8s trainer from data-mover-synced code (r0.5.0 backport)#4320
ko3n1g wants to merge 9 commits into
r0.5.0from
ko3n1g/fix/k8s-code-sync-r0.5.0

Conversation

@ko3n1g

@ko3n1g ko3n1g commented Jun 12, 2026

Copy link
Copy Markdown
Contributor
Claude summary

Background / motivation

  • Backport of fix(training): run K8s trainer from data-mover-synced code #4319 onto the r0.5.0 release line, to validate the K8s data-mover code-sync path for v0.5.0 (mbridge-v0.5.0-debug1).
  • On Kubeflow the trainer ran the image's /opt/Megatron-Bridge, never the --mbridge-ref checkout synced onto the workdir PVC; the entrypoint was hardcoded to the image path and the run.Script PYTHONPATH does not reach the trainer container.

What changed

  • Cherry-pick of the launcher fix (scripts/performance/setup_experiment.py, utils/executors.py): when kubeflow_workdir_local_path is set, point the entrypoint at /nemo_run/scripts/performance/run_recipe.py and front-load /nemo_run/src on PYTHONPATH via custom_env_vars. Without it, behavior is unchanged (image code).

Details

Tested

  • Validation run launched on nemo-ci branch ko3n1g-ci-mbridge-k8s-code-sync with --test-image nvcr.io/nvidia/nemo:26.04, --mbridge-ref ko3n1g/fix/k8s-code-sync-r0.5.0, --mcore-ref core_r0.18.0, test-case moonlight_16b_64gpu_gb200_release.

On Kubeflow the trainer entrypoint was hardcoded to the image's
/opt/Megatron-Bridge, so a --mbridge-ref checkout copied onto the workdir
PVC by nemo-run's data-mover pod was never executed: the launch.sh symlink
(/nemo_run -> code_dir) was built but the torchrun command still pointed at
the image.

When kubeflow_workdir_local_path is set, point the entrypoint at
/nemo_run/scripts/performance/run_recipe.py and front-load /nemo_run/src on
PYTHONPATH (via custom_env_vars, since the run.Script env is not propagated to
the trainer container) so megatron.bridge resolves to the synced source while
megatron.core et al. still come from the image. Without it, behavior is
unchanged (image code).

Also correct the executors.py comment: KubeflowExecutor.package() ignores the
packager and never overlays /opt/Megatron-Bridge.

Signed-off-by: oliver könig <okoenig@nvidia.com>
(cherry picked from commit 028da1d)
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ko3n1g added 3 commits June 12, 2026 09:07
Add /nemo_run/3rdparty/Megatron-LM to the trainer PYTHONPATH so megatron.core
resolves to the synced --mcore-ref checkout (version-matched with the synced
megatron.bridge), not the image's mismatched mcore.

Signed-off-by: oliver könig <okoenig@nvidia.com>
compile_helpers() ran on each node's local rank 0 AFTER init_process_group but
before the first NCCL collective (barrier). The first-time C++ build is slow, so
on K8s — where the NCCL bootstrap connect budget is short — peers fail the
barrier with 'connection refused' / 'remote process exited' while a rank is mid
-compile. (Slurm's rendezvous tolerated it; K8s does not. K8s historically used
images with the helper prebuilt, so on-the-fly compile was never exercised
there.) Move the compile before torch.distributed init so the rendezvous (long,
tolerant store-connect retries) absorbs the delay; a prebuilt .so stays a fast
no-op. Per-node local-rank-0 build is unchanged.

Signed-off-by: oliver könig <okoenig@nvidia.com>
When the megatron/core datasets dir is on a filesystem shared across all
trainer nodes (e.g. a K8s PVC), a per-node compile has many ranks run g++/ld
against the same .so concurrently, which over NFS fails with 'ld: final link
failed: Stale file handle' — killing a rank and cascading into a NCCL barrier
failure. flock can't fix it here: the NFS has no working lock daemon, so
flock(LOCK_EX) hangs. Instead global rank 0 builds the .so once before
distributed init (single writer, no race; pre-init so the tolerant rendezvous
absorbs the build, not a NCCL collective). The per-node compile after init is
retained for disconnected per-node filesystems (e.g. Slurm/Lustre): a no-op on
a shared FS, an independent build per node otherwise. A prebuilt .so stays a
fast no-op.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g force-pushed the ko3n1g/fix/k8s-code-sync-r0.5.0 branch from df84ea9 to fec2dbe Compare June 12, 2026 11:16
ko3n1g added 5 commits June 12, 2026 12:25
Plumb KUBEFLOW_SETUP_COMMANDS_JSON (read directly from env to avoid
$(argument_builder) word-splitting of space-containing shell commands) through
setup_experiment -> kubeflow_executor -> nemo-run KubeflowExecutor.setup_commands,
which runs them once per pod in launch.sh before the job. Feature-detected
(hasattr) so an older pinned nemo-run without the field is a no-op, not an error.
Used to install a dependency missing from a (broken RC) image into the container
venv without rebuilding.

Signed-off-by: oliver könig <okoenig@nvidia.com>
moonlight 16B pretrains on the GB200/GB300 NVL72 target, where HybridEP is
the recommended MoE token dispatcher (per the dispatcher-selection guide) over
DeepEP. Use the flex dispatcher with the hybridep backend (num_sms=16).

Signed-off-by: oliver könig <okoenig@nvidia.com>
The fine-grained 64-expert MoE issues many tiny per-expert/router/attn kernels
per step, so the step is host-launch-bound with CUDA graphs off. Newer base
images (26.06.rc0) raised per-kernel launch latency, surfacing a ~9% moonlight-
only step-time regression. Capture the launch-heavy attn/moe_router/moe_preprocess
scopes with TE graphs to remove that host overhead. Mirrors nemotronh/nemotron_3_super.

Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant