fix(training): run K8s trainer from data-mover-synced code (r0.5.0 backport)#4320
Draft
ko3n1g wants to merge 9 commits into
Draft
fix(training): run K8s trainer from data-mover-synced code (r0.5.0 backport)#4320ko3n1g wants to merge 9 commits into
ko3n1g wants to merge 9 commits into
Conversation
On Kubeflow the trainer entrypoint was hardcoded to the image's /opt/Megatron-Bridge, so a --mbridge-ref checkout copied onto the workdir PVC by nemo-run's data-mover pod was never executed: the launch.sh symlink (/nemo_run -> code_dir) was built but the torchrun command still pointed at the image. When kubeflow_workdir_local_path is set, point the entrypoint at /nemo_run/scripts/performance/run_recipe.py and front-load /nemo_run/src on PYTHONPATH (via custom_env_vars, since the run.Script env is not propagated to the trainer container) so megatron.bridge resolves to the synced source while megatron.core et al. still come from the image. Without it, behavior is unchanged (image code). Also correct the executors.py comment: KubeflowExecutor.package() ignores the packager and never overlays /opt/Megatron-Bridge. Signed-off-by: oliver könig <okoenig@nvidia.com> (cherry picked from commit 028da1d)
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Add /nemo_run/3rdparty/Megatron-LM to the trainer PYTHONPATH so megatron.core resolves to the synced --mcore-ref checkout (version-matched with the synced megatron.bridge), not the image's mismatched mcore. Signed-off-by: oliver könig <okoenig@nvidia.com>
compile_helpers() ran on each node's local rank 0 AFTER init_process_group but before the first NCCL collective (barrier). The first-time C++ build is slow, so on K8s — where the NCCL bootstrap connect budget is short — peers fail the barrier with 'connection refused' / 'remote process exited' while a rank is mid -compile. (Slurm's rendezvous tolerated it; K8s does not. K8s historically used images with the helper prebuilt, so on-the-fly compile was never exercised there.) Move the compile before torch.distributed init so the rendezvous (long, tolerant store-connect retries) absorbs the delay; a prebuilt .so stays a fast no-op. Per-node local-rank-0 build is unchanged. Signed-off-by: oliver könig <okoenig@nvidia.com>
When the megatron/core datasets dir is on a filesystem shared across all trainer nodes (e.g. a K8s PVC), a per-node compile has many ranks run g++/ld against the same .so concurrently, which over NFS fails with 'ld: final link failed: Stale file handle' — killing a rank and cascading into a NCCL barrier failure. flock can't fix it here: the NFS has no working lock daemon, so flock(LOCK_EX) hangs. Instead global rank 0 builds the .so once before distributed init (single writer, no race; pre-init so the tolerant rendezvous absorbs the build, not a NCCL collective). The per-node compile after init is retained for disconnected per-node filesystems (e.g. Slurm/Lustre): a no-op on a shared FS, an independent build per node otherwise. A prebuilt .so stays a fast no-op. Signed-off-by: oliver könig <okoenig@nvidia.com>
df84ea9 to
fec2dbe
Compare
Plumb KUBEFLOW_SETUP_COMMANDS_JSON (read directly from env to avoid $(argument_builder) word-splitting of space-containing shell commands) through setup_experiment -> kubeflow_executor -> nemo-run KubeflowExecutor.setup_commands, which runs them once per pod in launch.sh before the job. Feature-detected (hasattr) so an older pinned nemo-run without the field is a no-op, not an error. Used to install a dependency missing from a (broken RC) image into the container venv without rebuilding. Signed-off-by: oliver könig <okoenig@nvidia.com>
moonlight 16B pretrains on the GB200/GB300 NVL72 target, where HybridEP is the recommended MoE token dispatcher (per the dispatcher-selection guide) over DeepEP. Use the flex dispatcher with the hybridep backend (num_sms=16). Signed-off-by: oliver könig <okoenig@nvidia.com>
The fine-grained 64-expert MoE issues many tiny per-expert/router/attn kernels per step, so the step is host-launch-bound with CUDA graphs off. Newer base images (26.06.rc0) raised per-kernel launch latency, surfacing a ~9% moonlight- only step-time regression. Capture the launch-heavy attn/moe_router/moe_preprocess scopes with TE graphs to remove that host overhead. Mirrors nemotronh/nemotron_3_super. Signed-off-by: oliver könig <okoenig@nvidia.com>
…pretrain" This reverts commit 907c805.
…tcher" This reverts commit 02c41f8.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude summary
Background / motivation
r0.5.0release line, to validate the K8s data-mover code-sync path for v0.5.0 (mbridge-v0.5.0-debug1)./opt/Megatron-Bridge, never the--mbridge-refcheckout synced onto the workdir PVC; the entrypoint was hardcoded to the image path and therun.ScriptPYTHONPATH does not reach the trainer container.What changed
scripts/performance/setup_experiment.py,utils/executors.py): whenkubeflow_workdir_local_pathis set, point the entrypoint at/nemo_run/scripts/performance/run_recipe.pyand front-load/nemo_run/srcon PYTHONPATH viacustom_env_vars. Without it, behavior is unchanged (image code).Details
!2522(stagessrc/+scripts/, setsKUBEFLOW_WORKDIR_LOCAL_PATH).origin/r0.5.0(-xrecorded).Tested
ko3n1g-ci-mbridge-k8s-code-syncwith--test-image nvcr.io/nvidia/nemo:26.04,--mbridge-ref ko3n1g/fix/k8s-code-sync-r0.5.0,--mcore-ref core_r0.18.0, test-casemoonlight_16b_64gpu_gb200_release.