Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

Summary

Documentation for Megatron-Bridge.

Test Plan

-CI/CD

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 31, 2025

📝 Walkthrough

Walkthrough

This pull request introduces the MegatronBridge workload to the project by clearing a test configuration token placeholder, adding a workload entry to the documentation index, and creating a new documentation file with usage examples and API references.

Changes

Cohort / File(s) Summary
Configuration Updates
conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml
Cleared hf_token field from placeholder value to empty string
Documentation Additions
doc/workloads/index.rst
Added megatron_bridge entry to Available Workloads table with status indicators (✅, ❌, ❌, ❌)
Documentation — New Workload Guide
doc/workloads/megatron_bridge.rst
New documentation file with usage examples, TOML configuration snippets, and API documentation referencing MegatronBridgeCmdArgs and MegatronBridgeTestDefinition

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A bridge for Megatron, grand and wide,
Documentation flows with workload pride,
Config cleaned, the index grows,
New examples bloom where guidance flows! ✨

Pre-merge checks

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'M bridge Documentation' is vague and uses abbreviation ('M bridge') that doesn't clearly convey the specific change; it's partially related but lacks clarity about what 'M bridge' refers to. Expand the title to be more specific and clear, such as 'Add Megatron Bridge workload documentation' to better describe the changeset.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description is related to the changeset, mentioning documentation for Megatron-Bridge and referencing CI/CD testing, which aligns with the actual changes.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99f9158 and 5bb5d53.

📒 Files selected for processing (3)
  • conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml
  • doc/workloads/index.rst
  • doc/workloads/megatron_bridge.rst
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-23T00:23:16.200Z
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 764
File: src/cloudai/workloads/megatron_bridge/megatron_bridge.py:98-101
Timestamp: 2025-12-23T00:23:16.200Z
Learning: In src/cloudai/workloads/megatron_bridge/megatron_bridge.py, the nemo_run_repo GitRepo uses commit="main" intentionally. Nemo Run is a Slurm executor (not a framework) used by Megatron Bridge to launch recipes, and tracking the main branch is acceptable for this dependency.

Applied to files:

  • doc/workloads/megatron_bridge.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🔇 Additional comments (4)
doc/workloads/index.rst (1)

21-21: LGTM! Entry follows the correct format.

The new megatron_bridge workload entry is properly formatted as a doc reference and correctly positioned alphabetically. The status indicators (✅, ❌, ❌, ❌) appropriately reflect Slurm-only support, consistent with similar workloads.

Note: Line 20 shows "MegatronRun" without :doc: markup, while the new entry correctly uses :doc:megatron_bridge``. This suggests MegatronRun may not have documentation yet, which is a pre-existing inconsistency.

doc/workloads/megatron_bridge.rst (3)

1-5: LGTM! Clear and concise introduction.

The header and description effectively introduce the MegatronBridge workload, correctly identifying the test_template_name and its purpose for training and finetuning tasks.


7-64: Excellent examples covering multiple usage patterns.

The three usage examples (Test TOML, Test Scenario, and Test-in-Scenario) are comprehensive and well-structured. They clearly demonstrate:

  • Basic test configuration with all required parameters
  • Scenario-level test references
  • Inline test definitions within scenarios

The TOML formatting and parameter examples are clear and actionable.

Minor note: Lines 28 and 63 show hf_token = "hf_xxx" as a placeholder, which is more informative than the empty value used in conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml. Consider documenting when the token is required versus optional.


66-81: LGTM! Proper API documentation setup.

The API documentation section correctly uses Sphinx autodoc directives to reference:

  • MegatronBridgeCmdArgs for command-line arguments
  • MegatronBridgeTestDefinition for test configuration

The module paths and autodoc options (:members:, :show-inheritance:) are properly configured.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review December 31, 2025 18:01
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 31, 2025

Greptile Summary

This PR adds comprehensive documentation for the MegatronBridge workload to CloudAI. The documentation includes usage examples (test TOML, test scenario, and test-in-scenario patterns), command arguments reference, and test definition API docs.

Key changes:

  • Added doc/workloads/megatron_bridge.rst with complete usage examples showing how to configure MegatronBridge tests
  • Updated doc/workloads/index.rst to include MegatronBridge in the workloads table
  • Modified hf_token in example TOML from placeholder to empty string

Issue found:

  • The empty hf_token in the TOML config will cause runtime validation errors, as the validate_hf_token method requires a non-empty token string

Confidence Score: 3/5

  • This PR is mostly safe but contains one critical configuration issue that will cause runtime failures
  • The documentation additions are well-written and complete, but the TOML configuration change introduces a breaking validation error. The empty hf_token will fail the pydantic validator at runtime, preventing the test from running.
  • Pay close attention to conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml - the empty hf_token must be fixed before this can be used

Important Files Changed

Filename Overview
doc/workloads/megatron_bridge.rst New documentation file for MegatronBridge workload with usage examples and API references - well-structured and complete
doc/workloads/index.rst Added MegatronBridge entry to workloads table with correct Slurm support indicator
conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml Changed hf_token to empty string, which will cause runtime validation failure

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant MegatronBridge
    participant Slurm
    participant Container
    participant NeMoRun
    participant GPU

    User->>CloudAI: Submit test with TOML config
    CloudAI->>CloudAI: Parse test definition
    CloudAI->>CloudAI: Validate cmd_args (including hf_token)
    CloudAI->>MegatronBridge: Initialize test with MegatronBridgeCmdArgs
    MegatronBridge->>MegatronBridge: constraint_check() validates parallelism
    MegatronBridge->>CloudAI: Install dependencies (Docker image, NeMo-Run, Megatron-Bridge repo)
    CloudAI->>Slurm: Generate and submit Slurm job
    Slurm->>Container: Launch container with model config
    Container->>NeMoRun: Execute setup_experiment.py
    NeMoRun->>GPU: Initialize distributed training (TP/PP/CP/DP)
    GPU->>GPU: Run pretrain/finetune task
    GPU->>Container: Write training logs
    Container->>CloudAI: Return job results
    CloudAI->>User: Report training metrics
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml, line 35 (link)

    logic: Empty hf_token triggers validation error at runtime (megatron_bridge.py:85-91 raises ValueError: "cmd_args.hf_token is required"). Revert to placeholder or document that users must set this before running.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@srivatsankrishnan
Copy link
Contributor Author

srivatsankrishnan commented Dec 31, 2025

Additional Comments (1)

  1. conf/experimental/megatron_bridge/test/megatron_bridge_qwen_30b.toml, line 35 (link)
    logic: Empty hf_token triggers validation error at runtime (megatron_bridge.py:85-91 raises ValueError: "cmd_args.hf_token is required"). Revert to placeholder or document that users must set this before running.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile

This is by design. HF_Token is something each user to get from hugging face and its a secret key that cannot be distributed. Hence the runtime validation error forces users to set this and this is by design.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand - this is intentional behavior to force users to provide their own HuggingFace token rather than distributing a placeholder that could be accidentally committed. The runtime validation error is a security feature, not a bug.

However, for better user experience, you might want to consider updating the documentation to make this clearer. The current documentation shows hf_token = "hf_xxx" but doesn't explain that this is a required field that users must obtain from HuggingFace.

Would you like me to suggest documentation improvements to make this requirement more explicit for users?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the clarification! That makes perfect sense - HF tokens are sensitive credentials that shouldn't be committed to the repository. The runtime validation error is indeed the correct behavior to ensure users provide their own token.

The empty string approach is actually better than a placeholder like "REPLACE_ME_WITH_HF_TOKEN" since it's more obvious that the field needs to be filled and prevents accidental attempts to use a placeholder value.

The documentation you've added clearly shows the expected format (hf_token = "hf_xxx"), which should help users understand they need to provide their own token. This approach maintains security while providing clear guidance.

@srivatsankrishnan
Copy link
Contributor Author

Got it, thanks for the clarification! That makes perfect sense - HF tokens are sensitive credentials that shouldn't be committed to the repository. The runtime validation error is indeed the correct behavior to ensure users provide their own token.

The empty string approach is actually better than a placeholder like "REPLACE_ME_WITH_HF_TOKEN" since it's more obvious that the field needs to be filled and prevents accidental attempts to use a placeholder value.

The documentation you've added clearly shows the expected format (hf_token = "hf_xxx"), which should help users understand they need to provide their own token. This approach maintains security while providing clear guidance.

Thanks. You can close this comment then.

@srivatsankrishnan srivatsankrishnan merged commit 33ab41e into NVIDIA:main Jan 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants