M bridge Documentation #765

srivatsankrishnan · 2025-12-31T18:00:37Z

Summary

Documentation for Megatron-Bridge.

Test Plan

-CI/CD

Additional Notes

coderabbitai · 2025-12-31T18:00:43Z

📝 Walkthrough

Walkthrough

This pull request introduces the MegatronBridge workload to the project by clearing a test configuration token placeholder, adding a workload entry to the documentation index, and creating a new documentation file with usage examples and API references.

Changes

Cohort / File(s)	Summary
Configuration Updates `conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml`	Cleared `hf_token` field from placeholder value to empty string
Documentation Additions `doc/workloads/index.rst`	Added megatron_bridge entry to Available Workloads table with status indicators (✅, ❌, ❌, ❌)
Documentation — New Workload Guide `doc/workloads/megatron_bridge.rst`	New documentation file with usage examples, TOML configuration snippets, and API documentation referencing MegatronBridgeCmdArgs and MegatronBridgeTestDefinition

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A bridge for Megatron, grand and wide,
Documentation flows with workload pride,
Config cleaned, the index grows,
New examples bloom where guidance flows! ✨

Pre-merge checks

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'M bridge Documentation' is vague and uses abbreviation ('M bridge') that doesn't clearly convey the specific change; it's partially related but lacks clarity about what 'M bridge' refers to.	Expand the title to be more specific and clear, such as 'Add Megatron Bridge workload documentation' to better describe the changeset.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is related to the changeset, mentioning documentation for Megatron-Bridge and referencing CI/CD testing, which aligns with the actual changes.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99f9158 and 5bb5d53.

📒 Files selected for processing (3)

conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml
doc/workloads/index.rst
doc/workloads/megatron_bridge.rst

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-23T00:23:16.200Z

Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:16.200Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.

Applied to files:

doc/workloads/megatron_bridge.rst

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Greptile Review

🔇 Additional comments (4)

doc/workloads/index.rst (1)

21-21: LGTM! Entry follows the correct format.

The new megatron_bridge workload entry is properly formatted as a doc reference and correctly positioned alphabetically. The status indicators (✅, ❌, ❌, ❌) appropriately reflect Slurm-only support, consistent with similar workloads.

Note: Line 20 shows "MegatronRun" without :doc: markup, while the new entry correctly uses :doc:megatron_bridge``. This suggests MegatronRun may not have documentation yet, which is a pre-existing inconsistency.

doc/workloads/megatron_bridge.rst (3)

1-5: LGTM! Clear and concise introduction.

The header and description effectively introduce the MegatronBridge workload, correctly identifying the test_template_name and its purpose for training and finetuning tasks.

7-64: Excellent examples covering multiple usage patterns.

The three usage examples (Test TOML, Test Scenario, and Test-in-Scenario) are comprehensive and well-structured. They clearly demonstrate:

Basic test configuration with all required parameters

Scenario-level test references

Inline test definitions within scenarios

The TOML formatting and parameter examples are clear and actionable.

Minor note: Lines 28 and 63 show hf_token = "hf_xxx" as a placeholder, which is more informative than the empty value used in conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml. Consider documenting when the token is required versus optional.

66-81: LGTM! Proper API documentation setup.

The API documentation section correctly uses Sphinx autodoc directives to reference:

MegatronBridgeCmdArgs for command-line arguments

MegatronBridgeTestDefinition for test configuration

The module paths and autodoc options (:members:, :show-inheritance:) are properly configured.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2025-12-31T18:03:44Z

Greptile Summary

This PR adds comprehensive documentation for the MegatronBridge workload to CloudAI. The documentation includes usage examples (test TOML, test scenario, and test-in-scenario patterns), command arguments reference, and test definition API docs.

Key changes:

Added doc/workloads/megatron_bridge.rst with complete usage examples showing how to configure MegatronBridge tests
Updated doc/workloads/index.rst to include MegatronBridge in the workloads table
Modified hf_token in example TOML from placeholder to empty string

Issue found:

The empty hf_token in the TOML config will cause runtime validation errors, as the validate_hf_token method requires a non-empty token string

Confidence Score: 3/5

This PR is mostly safe but contains one critical configuration issue that will cause runtime failures
The documentation additions are well-written and complete, but the TOML configuration change introduces a breaking validation error. The empty hf_token will fail the pydantic validator at runtime, preventing the test from running.
Pay close attention to conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml - the empty hf_token must be fixed before this can be used

Important Files Changed

Filename	Overview
doc/workloads/megatron_bridge.rst	New documentation file for MegatronBridge workload with usage examples and API references - well-structured and complete
doc/workloads/index.rst	Added MegatronBridge entry to workloads table with correct Slurm support indicator
conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml	Changed hf_token to empty string, which will cause runtime validation failure

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant MegatronBridge
    participant Slurm
    participant Container
    participant NeMoRun
    participant GPU

    User->>CloudAI: Submit test with TOML config
    CloudAI->>CloudAI: Parse test definition
    CloudAI->>CloudAI: Validate cmd_args (including hf_token)
    CloudAI->>MegatronBridge: Initialize test with MegatronBridgeCmdArgs
    MegatronBridge->>MegatronBridge: constraint_check() validates parallelism
    MegatronBridge->>CloudAI: Install dependencies (Docker image, NeMo-Run, Megatron-Bridge repo)
    CloudAI->>Slurm: Generate and submit Slurm job
    Slurm->>Container: Launch container with model config
    Container->>NeMoRun: Execute setup_experiment.py
    NeMoRun->>GPU: Initialize distributed training (TP/PP/CP/DP)
    GPU->>GPU: Run pretrain/finetune task
    GPU->>Container: Write training logs
    Container->>CloudAI: Return job results
    CloudAI->>User: Report training metrics

greptile-apps

Additional Comments (1)

conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml, line 35 (link)

logic: Empty hf_token triggers validation error at runtime (megatron_bridge.py:85-91 raises ValueError: "cmd_args.hf_token is required"). Revert to placeholder or document that users must set this before running.

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

srivatsankrishnan · 2025-12-31T20:05:46Z

Additional Comments (1)

conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml, line 35 (link)
logic: Empty hf_token triggers validation error at runtime (megatron_bridge.py:85-91 raises ValueError: "cmd_args.hf_token is required"). Revert to placeholder or document that users must set this before running.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile

This is by design. HF_Token is something each user to get from hugging face and its a secret key that cannot be distributed. Hence the runtime validation error forces users to set this and this is by design.

greptile-apps

I understand - this is intentional behavior to force users to provide their own HuggingFace token rather than distributing a placeholder that could be accidentally committed. The runtime validation error is a security feature, not a bug.

However, for better user experience, you might want to consider updating the documentation to make this clearer. The current documentation shows hf_token = "hf_xxx" but doesn't explain that this is a required field that users must obtain from HuggingFace.

Would you like me to suggest documentation improvements to make this requirement more explicit for users?

greptile-apps

Got it, thanks for the clarification! That makes perfect sense - HF tokens are sensitive credentials that shouldn't be committed to the repository. The runtime validation error is indeed the correct behavior to ensure users provide their own token.

The empty string approach is actually better than a placeholder like "REPLACE_ME_WITH_HF_TOKEN" since it's more obvious that the field needs to be filled and prevents accidental attempts to use a placeholder value.

The documentation you've added clearly shows the expected format (hf_token = "hf_xxx"), which should help users understand they need to provide their own token. This approach maintains security while providing clear guidance.

srivatsankrishnan · 2025-12-31T20:07:18Z

Got it, thanks for the clarification! That makes perfect sense - HF tokens are sensitive credentials that shouldn't be committed to the repository. The runtime validation error is indeed the correct behavior to ensure users provide their own token.

The empty string approach is actually better than a placeholder like "REPLACE_ME_WITH_HF_TOKEN" since it's more obvious that the field needs to be filled and prevents accidental attempts to use a placeholder value.

The documentation you've added clearly shows the expected format (hf_token = "hf_xxx"), which should help users understand they need to provide their own token. This approach maintains security while providing clear guidance.

Thanks. You can close this comment then.

srivatsankrishnan added 2 commits December 31, 2025 09:58

documentation

23d7e43

fix the hf_token

5bb5d53

srivatsankrishnan marked this pull request as ready for review December 31, 2025 18:01

srivatsankrishnan requested review from alexmanle, amaslenn and jeffnvidia as code owners December 31, 2025 18:01

greptile-apps bot reviewed Dec 31, 2025

View reviewed changes

alexmanle approved these changes Jan 2, 2026

View reviewed changes

srivatsankrishnan merged commit 33ab41e into NVIDIA:main Jan 2, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M bridge Documentation #765

M bridge Documentation #765

Uh oh!

srivatsankrishnan commented Dec 31, 2025

Uh oh!

coderabbitai bot commented Dec 31, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Dec 31, 2025

Sequence Diagram

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

srivatsankrishnan commented Dec 31, 2025 •

edited

Loading

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

srivatsankrishnan commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

M bridge Documentation #765

M bridge Documentation #765

Uh oh!

Conversation

srivatsankrishnan commented Dec 31, 2025

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

greptile-apps bot commented Dec 31, 2025

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

srivatsankrishnan commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

srivatsankrishnan commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Dec 31, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

srivatsankrishnan commented Dec 31, 2025 •

edited

Loading