-
Notifications
You must be signed in to change notification settings - Fork 42
M bridge updates #767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
M bridge updates #767
Conversation
📝 WalkthroughWalkthroughAdds new Megatron-Bridge test configurations for multiple GPU architectures (B200, GB200, GB300, H100) with Qwen3 30B model settings. Refactors megatron_bridge.py to enforce explicit git_repos configuration and remove inferential logic for repo selection. Renames artifact output paths across the module with a "cloudai_" prefix and updates corresponding test expectations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR updates the Megatron-Bridge workload implementation to align with the stable M-Bridge 25.11 release (r0.2.0) by making git repository specification mandatory in test TOML files, enabling reproducibility through explicit commit pinning. Key changes:
Confidence Score: 5/5
Important Files Changed
|
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Fix all issues with AI Agents 🤖
In @conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml:
- Line 38: The hf_token field in the TOML is empty and will fail the non-empty
validation in the megatron_bridge validator; update the hf_token key to contain
a valid Hugging Face token (or a placeholder like "<HF_TOKEN>" that will be
replaced at deployment or an env-var reference) so the validator in
megatron_bridge accepts it, or remove/override this setting if you intend to
inherit a token from elsewhere.
In @conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml:
- Line 38: The hf_token field in the TOML is empty which triggers validation
failure; update the hf_token entry in megatron_bridge_qwen_30b.toml so it is not
an empty string—either set it to a valid HuggingFace token string (or a
placeholder like "<HF_TOKEN>") or switch to an environment-variable-based value
used elsewhere in configs (e.g., reference the same env var pattern used by
other test files) so loading the test definition passes validation.
In @conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml:
- Around line 23-26: The git repo entry uses the tag string "r0.2.0" in the
commit field which is less reproducible; replace that tag with the exact commit
SHA that the tag points to (the value currently under commit) so the
[[git_repos]] block (url, commit, mount_as) pins to an immutable commit SHA
instead of "r0.2.0" to ensure deterministic checkouts.
- Around line 28-37: The container_image value under [cmd_args] is using '#' as
a separator ("nvcr.io#nvidia/nemo:25.11.01"); update the container_image entry
to use the standard Docker image format with slashes (e.g.,
"nvcr.io/nvidia/nemo:25.11.01") so the registry, organization and image are
correctly separated; ensure you only change the container_image string and leave
other keys (gpu_type, model_name, model_size, etc.) unchanged.
In @conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml:
- Line 38: The test config currently sets hf_token = "" which will trigger the
non-empty token check in the validator in
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (the hf_token
validation that raises ValueError for empty/whitespace-only tokens); fix by
either giving a non-empty placeholder token in this TOML (e.g., a local-test
token) or modify the validator logic in megatron_bridge.py to allow empty
hf_token by skipping the non-empty check (or making the field optional) so it no
longer raises ValueError for an empty string.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.tomlsrc/cloudai/workloads/megatron_bridge/megatron_bridge.pysrc/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.pytests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:16.200Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.
📚 Learning: 2025-12-23T00:23:16.200Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:16.200Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.
Applied to files:
conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.tomlconf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.tomlsrc/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.pyconf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.tomltests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.pyconf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.tomlsrc/cloudai/workloads/megatron_bridge/megatron_bridge.py
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.
Applied to files:
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.pytests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.pysrc/cloudai/workloads/megatron_bridge/megatron_bridge.py
🧬 Code graph analysis (2)
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (1)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (2)
MegatronBridgeTestDefinition(92-445)megatron_bridge_repo(144-153)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (1)
src/cloudai/_core/installables.py (1)
GitRepo(87-115)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🔇 Additional comments (19)
conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml (2)
37-37: Verify compute_dtype inconsistency across GPU types.This config uses
compute_dtype = "fp8_cs"while the b200 and gb200 configs usecompute_dtype = "fp8_mx". Confirm whether this difference is intentional based on H100 hardware capabilities or if it should be standardized.
23-26: LGTM: Git repository configuration aligns with new validation.The
[[git_repos]]block correctly specifies the Megatron-Bridge repository with a pinned commit and explicit mount point, satisfying the new validation requirements introduced inmegatron_bridge.py.conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml (1)
23-26: LGTM: Repository configuration is correct.The git_repos block properly defines the Megatron-Bridge repository with pinned version and mount point.
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (4)
106-112: LGTM: Repository selection logic is well-designed.The static helper correctly identifies the Megatron-Bridge repository by URL substring or mount path, and properly normalizes the mount point using
model_copy. This approach is clean and maintains immutability.
114-129: LGTM: Validation enforces explicit repository pinning.The validator ensures users explicitly specify the Megatron-Bridge repository via
[[git_repos]], improving reproducibility and alignment with the M-Bridge 26.02 release requirements noted in the PR objectives.
146-152: LGTM: Property implementation is consistent with validation.The property correctly uses the selection helper and provides a clear error message if the Megatron-Bridge repo is missing, though the validator should catch this earlier.
83-89: Remove this concern—test configs with empty hf_token are already handled.Test TOML files in
conf/experimental/megatron_bridge/test/*/megatron_bridge_qwen_30b.tomldo sethf_token = "", buttests/test_test_definitions.pyexplicitly skips these configurations before any parsing occurs:if toml_dict.get("test_template_name") == "MegatronBridge": cmd_args = toml_dict.get("cmd_args", {}) or {} if cmd_args.get("hf_token", None) == "": pytest.skip("MegatronBridge example config requires user to set cmd_args.hf_token.")The skip happens at line 91-94, before
Parser.parse_tests()at line 100 would trigger config loading and validation. Tests never instantiate these placeholder configs, so the validator is never invoked and no inconsistency exists.conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml (2)
23-26: LGTM: Git repository configuration now enforced.Adding the
[[git_repos]]block satisfies the new mandatory requirement for explicit Megatron-Bridge repository pinning.
39-40: LGTM: New feature flags align with M-Bridge 26.02.The
enable_vboostanddetachparameters support the M-Bridge r0.2.0 API as noted in the PR objectives.src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (1)
80-80: LGTM: Consistent naming convention with cloudai_ prefix.The renaming of generated artifacts to use the
cloudai_prefix improves clarity about which files are CloudAI-managed versus user-provided. The changes are purely cosmetic with no behavioral impact.Also applies to: 114-115
conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml (3)
1-21: LGTM!The copyright header, metadata, and configuration structure are correct.
39-40: LGTM!The
enable_vboostanddetachsettings are correctly configured, withenable_vboost=truealigning with the new vboost configuration feature mentioned in the PR objectives.
38-38: Empty hf_token is intentional for example configuration files.The empty
hf_tokenis valid and expected in example TOML files. The test framework explicitly skips tests with empty hf_token (seetests/test_test_definitions.pylines 93-94):if cmd_args.get("hf_token", None) == "": pytest.skip("MegatronBridge example config requires user to set cmd_args.hf_token.")The validator in
MegatronBridgeCmdArgsrejects empty tokens only when the class is instantiated programmatically viamodel_validate(), not during TOML parsing. This is the correct design pattern for example configurations that users must customize before actual execution.tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (6)
2-2: LGTM!Copyright year updated correctly to span 2025-2026.
54-60: LGTM!The
git_reposfield correctly configures the Megatron-Bridge repository pin as now required by the updatedMegatronBridgeTestDefinition. The use oftype: ignore[arg-type]is appropriate for the dict-to-GitRepo conversion.
90-107: LGTM!This test effectively validates that the Megatron-Bridge commit can be pinned via
git_reposand correctly retrieved. The use of a full commit SHA in the test (rather than a tag) provides a good example of explicit versioning for reproducibility.
126-132: LGTM!Consistent addition of
git_reposto ensure the test fixture conforms to the updatedMegatronBridgeTestDefinitionschema.
162-162: LGTM!The wrapper script path references are correctly updated to use the new
cloudai_megatron_bridge_submit_and_parse_jobid.shfilename, consistent with the CloudAI prefix convention introduced in this PR.Also applies to: 170-170, 203-203
218-218: LGTM!The generated command file path and content assertions are correctly updated to reflect the
cloudai_naming prefix, completing the consistent migration across all test expectations.Also applies to: 223-223
conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
Show resolved
Hide resolved
conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
Show resolved
Hide resolved
conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml
Show resolved
Hide resolved
conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml
Show resolved
Hide resolved
conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
Show resolved
Hide resolved
alexmanle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the enhanced clarity!
Summary
This PR updates the following
r2.0). To align CloudAI with stable M-Bridge release, we make it mandatory to specify the git repo, commit hash for repetabilityTest Plan
Additional Notes
Include any other notes or comments about the pull request here. This can include challenges faced, future considerations, or context that reviewers might find helpful.