Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Jan 5, 2026

Summary

This PR updates the following

  • Make specifying git repo field mandatory in test toml (M-bridge changes the flag for certain field for their 26.02 release). The last stable release is 25.11 (r2.0). To align CloudAI with stable M-Bridge release, we make it mandatory to specify the git repo, commit hash for repetability
  • Enable vboost config in the test toml
  • minor changes related to name of CloudAI generated files
  • Added configs for Gb200/b200/H100 system

Test Plan

  • CI/CD
  • Real system
$ cloudai run --system-config ../cloudaix/conf/common/system/lyris
.toml --tests-dir conf/experimental/megatron_bridge/test --test-scenario conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml
[INFO] System Name: lyris
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: megatron_bridge_qwen_30b
[INFO] Checking if workloads components are installed.
[INFO] Test Scenario: megatron_bridge_qwen_30b

Section Name: megatron_bridge_qwen_30b
  Test Name: megatron_bridge_qwen_30b
  Description: Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Scenario results will be stored at: results/megatron_bridge_qwen_30b_2026-01-05_13-49-30
[INFO] Starting test: megatron_bridge_qwen_30b (results at: results/megatron_bridge_qwen_30b_2026-01-05_13-49-30/megatron_bridge_qwen_30b/0)
[INFO] Running test: megatron_bridge_qwen_30b
[INFO] Submitted slurm job: 625897
[INFO] Job completed: megatron_bridge_qwen_30b (iteration 1 of 1)
[INFO] Generated scenario report at results/megatron_bridge_qwen_30b_2026-01-05_13-49-30/megatron_bridge_qwen_30b.html
[INFO] Scenario results                                                                                                       
┌──────────────────────────_────────_─────────────────────────────────────────────────────────────────────────────────┐
│ Case                     │ Status │ Details                                                                         │
_──────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────_
│ megatron_bridge_qwen_30b │ PASSED │ results/megatron_bridge_qwen_30b_2026-01-05_13-49-30/megatron_bridge_qwen_30b/0 │
└──────────────────────────_────────_─────────────────────────────────────────────────────────────────────────────────┘

Additional Notes

Include any other notes or comments about the pull request here. This can include challenges faced, future considerations, or context that reviewers might find helpful.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

Adds new Megatron-Bridge test configurations for multiple GPU architectures (B200, GB200, GB300, H100) with Qwen3 30B model settings. Refactors megatron_bridge.py to enforce explicit git_repos configuration and remove inferential logic for repo selection. Renames artifact output paths across the module with a "cloudai_" prefix and updates corresponding test expectations.

Changes

Cohort / File(s) Summary
New Megatron-Bridge Test Configurations
conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml, conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml, conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml, conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
Introduces four new TOML deployment configurations for Qwen3 30B model testing across different GPU types (B200, GB200, GB300, H100). Each specifies git_repos pointing to Megatron-Bridge (commit r0.2.0), container image, model parameters, GPU distribution, compute settings (fp8_mx), and vBoost enablement.
Megatron-Bridge Workload Refactoring
src/cloudai/workloads/megatron_bridge/megatron_bridge.py
Removes nemo_container_version and megatron_bridge_ref fields from MegatronBridgeCmdArgs. Adds field-level validator validate_git_repos_has_megatron_bridge_repo enforcing Megatron-Bridge presence in git_repos. Introduces static helper method _select_megatron_bridge_repo for repo localization. Reworks megatron_bridge_repo property to use explicit repo selection; removes inferential methods _infer_megatron_bridge_ref and _map_container_version_to_mbridge_ref.
Artifact Path Renaming
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Prefixes generated artifact filenames with "cloudai_" prefix: generated_command.shcloudai_generated_command.sh, megatron_bridge_submit_and_parse_jobid.shcloudai_megatron_bridge_submit_and_parse_jobid.sh, megatron_bridge_launcher.logcloudai_megatron_bridge_launcher.log. Updates copyright year. No logic changes.
Test Updates
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
Adds git_repos configuration to test MegatronBridgeTestDefinition instantiations. Updates output path expectations to match new "cloudai_" prefixed filenames. Introduces new test test_git_repos_can_pin_megatron_bridge_commit validating commit persistence from git_repos. Updates copyright year.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Four new configs hop into place,
Repos pinned with grace and space,
Cloudai prefixes lead the way,
Validation guards the test-run day,
Inference shed, explicit rules stay!

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'M bridge updates' is vague and does not clearly convey the main changes; it uses an abbreviation without context and lacks specificity about what was updated. Use a more descriptive title such as 'Make git repo field mandatory in Megatron-Bridge test configs and add system configurations' to clearly summarize the primary changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description clearly outlines the main changes including mandatory git repo field, vboost enablement, file naming changes, and new system configs, with a test plan demonstrating successful execution.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review January 5, 2026 22:17
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

This PR updates the Megatron-Bridge workload implementation to align with the stable M-Bridge 25.11 release (r0.2.0) by making git repository specification mandatory in test TOML files, enabling reproducibility through explicit commit pinning.

Key changes:

  • Removed deprecated nemo_container_version and megatron_bridge_ref fields from MegatronBridgeCmdArgs, replacing automatic version mapping with user-specified git repos
  • Added field validator validate_git_repos_has_megatron_bridge_repo() to enforce mandatory [[git_repos]] specification with Megatron-Bridge repository
  • Implemented _select_megatron_bridge_repo() helper to extract and normalize the Megatron-Bridge repo from git_repos list
  • Added support for enable_vboost configuration parameter
  • Prefixed CloudAI-generated files with cloudai_ for better identification (cloudai_generated_command.sh, cloudai_megatron_bridge_submit_and_parse_jobid.sh, cloudai_megatron_bridge_launcher.log)
  • Added configuration files for GB200, B200, H100, and GB300 GPU platforms with appropriate settings for each architecture
  • Updated copyright year to 2025-2026 across all modified files
  • Updated all tests to include mandatory git_repos field and reflect new file naming conventions

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • All changes are well-structured refactoring that improves reproducibility and maintainability. The mandatory git_repos validation ensures version pinning for stable releases. Tests have been updated to reflect the changes. No breaking changes to existing functionality beyond the intentional requirement for git_repos specification.
  • No files require special attention

Important Files Changed

Filename Overview
src/cloudai/workloads/megatron_bridge/megatron_bridge.py Removed deprecated container version fields and added mandatory git repo validation for reproducibility
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Updated generated file naming to include cloudai prefix for clarity
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py Updated tests to require git_repos field and reflect new cloudai-prefixed file naming

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant TestDefinition
    participant ValidationLayer
    participant CommandGen
    participant MegatronBridge

    User->>CloudAI: Load test TOML with [[git_repos]]
    CloudAI->>TestDefinition: Create MegatronBridgeTestDefinition
    TestDefinition->>ValidationLayer: Validate git_repos field
    alt git_repos is empty
        ValidationLayer-->>User: Error: git_repos required
    else git_repos missing Megatron-Bridge
        ValidationLayer-->>User: Error: Megatron-Bridge repo required
    else git_repos valid
        ValidationLayer->>TestDefinition: Pass validation
    end
    
    TestDefinition->>TestDefinition: Select Megatron-Bridge repo
    TestDefinition->>TestDefinition: Normalize mount_as path
    
    User->>CloudAI: Run test
    CloudAI->>CommandGen: Generate execution command
    CommandGen->>CommandGen: Build launcher flags (enable_vboost, detach, etc.)
    CommandGen->>CommandGen: Create cloudai_* prefixed files
    CommandGen->>MegatronBridge: Submit job via wrapper script
    MegatronBridge-->>CloudAI: Return job ID
    CloudAI-->>User: Test execution started
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Fix all issues with AI Agents 🤖
In @conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml:
- Line 38: The hf_token field in the TOML is empty and will fail the non-empty
validation in the megatron_bridge validator; update the hf_token key to contain
a valid Hugging Face token (or a placeholder like "<HF_TOKEN>" that will be
replaced at deployment or an env-var reference) so the validator in
megatron_bridge accepts it, or remove/override this setting if you intend to
inherit a token from elsewhere.

In @conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml:
- Line 38: The hf_token field in the TOML is empty which triggers validation
failure; update the hf_token entry in megatron_bridge_qwen_30b.toml so it is not
an empty string—either set it to a valid HuggingFace token string (or a
placeholder like "<HF_TOKEN>") or switch to an environment-variable-based value
used elsewhere in configs (e.g., reference the same env var pattern used by
other test files) so loading the test definition passes validation.

In @conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml:
- Around line 23-26: The git repo entry uses the tag string "r0.2.0" in the
commit field which is less reproducible; replace that tag with the exact commit
SHA that the tag points to (the value currently under commit) so the
[[git_repos]] block (url, commit, mount_as) pins to an immutable commit SHA
instead of "r0.2.0" to ensure deterministic checkouts.
- Around line 28-37: The container_image value under [cmd_args] is using '#' as
a separator ("nvcr.io#nvidia/nemo:25.11.01"); update the container_image entry
to use the standard Docker image format with slashes (e.g.,
"nvcr.io/nvidia/nemo:25.11.01") so the registry, organization and image are
correctly separated; ensure you only change the container_image string and leave
other keys (gpu_type, model_name, model_size, etc.) unchanged.

In @conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml:
- Line 38: The test config currently sets hf_token = "" which will trigger the
non-empty token check in the validator in
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (the hf_token
validation that raises ValueError for empty/whitespace-only tokens); fix by
either giving a non-empty placeholder token in this TOML (e.g., a local-test
token) or modify the validator logic in megatron_bridge.py to allow empty
hf_token by skipping the non-empty check (or making the field optional) so it no
longer raises ValueError for an empty string.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6da438c and b122c68.

📒 Files selected for processing (7)
  • conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:16.200Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.
📚 Learning: 2025-12-23T00:23:16.200Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:16.200Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.

Applied to files:

  • conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
  • conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Applied to files:

  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
🧬 Code graph analysis (2)
tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (1)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (2)
  • MegatronBridgeTestDefinition (92-445)
  • megatron_bridge_repo (144-153)
src/cloudai/workloads/megatron_bridge/megatron_bridge.py (1)
src/cloudai/_core/installables.py (1)
  • GitRepo (87-115)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🔇 Additional comments (19)
conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml (2)

37-37: Verify compute_dtype inconsistency across GPU types.

This config uses compute_dtype = "fp8_cs" while the b200 and gb200 configs use compute_dtype = "fp8_mx". Confirm whether this difference is intentional based on H100 hardware capabilities or if it should be standardized.


23-26: LGTM: Git repository configuration aligns with new validation.

The [[git_repos]] block correctly specifies the Megatron-Bridge repository with a pinned commit and explicit mount point, satisfying the new validation requirements introduced in megatron_bridge.py.

conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml (1)

23-26: LGTM: Repository configuration is correct.

The git_repos block properly defines the Megatron-Bridge repository with pinned version and mount point.

src/cloudai/workloads/megatron_bridge/megatron_bridge.py (4)

106-112: LGTM: Repository selection logic is well-designed.

The static helper correctly identifies the Megatron-Bridge repository by URL substring or mount path, and properly normalizes the mount point using model_copy. This approach is clean and maintains immutability.


114-129: LGTM: Validation enforces explicit repository pinning.

The validator ensures users explicitly specify the Megatron-Bridge repository via [[git_repos]], improving reproducibility and alignment with the M-Bridge 26.02 release requirements noted in the PR objectives.


146-152: LGTM: Property implementation is consistent with validation.

The property correctly uses the selection helper and provides a clear error message if the Megatron-Bridge repo is missing, though the validator should catch this earlier.


83-89: Remove this concern—test configs with empty hf_token are already handled.

Test TOML files in conf/experimental/megatron_bridge/test/*/megatron_bridge_qwen_30b.toml do set hf_token = "", but tests/test_test_definitions.py explicitly skips these configurations before any parsing occurs:

if toml_dict.get("test_template_name") == "MegatronBridge":
    cmd_args = toml_dict.get("cmd_args", {}) or {}
    if cmd_args.get("hf_token", None) == "":
        pytest.skip("MegatronBridge example config requires user to set cmd_args.hf_token.")

The skip happens at line 91-94, before Parser.parse_tests() at line 100 would trigger config loading and validation. Tests never instantiate these placeholder configs, so the validator is never invoked and no inconsistency exists.

conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml (2)

23-26: LGTM: Git repository configuration now enforced.

Adding the [[git_repos]] block satisfies the new mandatory requirement for explicit Megatron-Bridge repository pinning.


39-40: LGTM: New feature flags align with M-Bridge 26.02.

The enable_vboost and detach parameters support the M-Bridge r0.2.0 API as noted in the PR objectives.

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py (1)

80-80: LGTM: Consistent naming convention with cloudai_ prefix.

The renaming of generated artifacts to use the cloudai_ prefix improves clarity about which files are CloudAI-managed versus user-provided. The changes are purely cosmetic with no behavioral impact.

Also applies to: 114-115

conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml (3)

1-21: LGTM!

The copyright header, metadata, and configuration structure are correct.


39-40: LGTM!

The enable_vboost and detach settings are correctly configured, with enable_vboost=true aligning with the new vboost configuration feature mentioned in the PR objectives.


38-38: Empty hf_token is intentional for example configuration files.

The empty hf_token is valid and expected in example TOML files. The test framework explicitly skips tests with empty hf_token (see tests/test_test_definitions.py lines 93-94):

if cmd_args.get("hf_token", None) == "":
    pytest.skip("MegatronBridge example config requires user to set cmd_args.hf_token.")

The validator in MegatronBridgeCmdArgs rejects empty tokens only when the class is instantiated programmatically via model_validate(), not during TOML parsing. This is the correct design pattern for example configurations that users must customize before actual execution.

tests/slurm_command_gen_strategy/test_megatron_bridge_slurm_command_gen_strategy.py (6)

2-2: LGTM!

Copyright year updated correctly to span 2025-2026.


54-60: LGTM!

The git_repos field correctly configures the Megatron-Bridge repository pin as now required by the updated MegatronBridgeTestDefinition. The use of type: ignore[arg-type] is appropriate for the dict-to-GitRepo conversion.


90-107: LGTM!

This test effectively validates that the Megatron-Bridge commit can be pinned via git_repos and correctly retrieved. The use of a full commit SHA in the test (rather than a tag) provides a good example of explicit versioning for reproducibility.


126-132: LGTM!

Consistent addition of git_repos to ensure the test fixture conforms to the updated MegatronBridgeTestDefinition schema.


162-162: LGTM!

The wrapper script path references are correctly updated to use the new cloudai_megatron_bridge_submit_and_parse_jobid.sh filename, consistent with the CloudAI prefix convention introduced in this PR.

Also applies to: 170-170, 203-203


218-218: LGTM!

The generated command file path and content assertions are correctly updated to reflect the cloudai_ naming prefix, completing the consistent migration across all test expectations.

Also applies to: 223-223

Copy link

@alexmanle alexmanle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the enhanced clarity!

@srivatsankrishnan srivatsankrishnan merged commit 7b63c79 into NVIDIA:main Jan 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants