Skip to content

feat: megatron bridge adaptation#1056

Open
gursimar wants to merge 2 commits intoinclusionAI:mainfrom
gursimar:megatron-bridge-adaptation
Open

feat: megatron bridge adaptation#1056
gursimar wants to merge 2 commits intoinclusionAI:mainfrom
gursimar:megatron-bridge-adaptation

Conversation

@gursimar
Copy link
Contributor

@gursimar gursimar commented Mar 19, 2026

Description

First PR for the adaptation of Megatron-Bridge into AReaL. See RFC #1055 for more details.

Implementation details

1. Introduced a new parameter whose default value is mbridge. So it will not break/ change flow of existing code

actor:
  megatron:
    bridge_type: mbridge 

2. Implemented megatron-bridge model creation when bridge_type=megatron-bridge

  • make_mcore_model supports model creation based on bridge_type
  • get_bridge function allows creation of appropriate bridge type based on yaml param
  • _save_model_to_hf and _load_model_from_hf functions are adapted to support megatron-bridge.

3. Testing

  • Qwen3-0.6B TP=1, TP=2, TP=2,PP=2 works as expected.

4. Resolved dependency conflicts

5. Added docs pages

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):
No Breaking change, in fact it is backward compatible

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Megatron-Bridge library into AReaL, providing an alternative backend for managing Megatron models alongside the existing mbridge. This enhancement offers greater flexibility in model conversion and creation workflows, allowing users to leverage different bridge implementations. The changes are designed to be fully backward compatible, ensuring that existing setups continue to operate seamlessly while enabling the adoption of the new bridge for future development and optimization.

Highlights

  • New Bridge Type Parameter: Introduced a new bridge_type parameter within MegatronEngineConfig to allow selection between mbridge (default) and megatron-bridge for model handling.
  • Megatron-Bridge Integration: Implemented comprehensive support for creating Megatron models using the megatron-bridge backend, including specific configurations for variable sequence lengths and P2P communication.
  • Adapted Model Persistence: Modified the model saving (_save_model_to_hf) and loading (_load_model_from_hf) functions to correctly interact with the megatron-bridge implementation.
  • Backward Compatibility: Ensured that the existing mbridge functionality remains the default, maintaining backward compatibility for current configurations.
  • Initial Testing: Conducted initial functional testing with Qwen3-0.6B across various Tensor Parallelism (TP) and Pipeline Parallelism (PP) setups (TP=1, TP=2, TP=2,PP=2).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for megatron-bridge as an alternative to mbridge for creating Megatron models. This is controlled by a new bridge_type configuration parameter. The changes correctly adapt model creation, saving, and loading logic based on the selected bridge. The implementation looks solid, but I have a few suggestions to improve code quality by removing a debug print statement and refactoring some duplicated and repetitive code blocks.

@garrett4wade
Copy link
Collaborator

Hi @gursimar , thanks for the contribution, but IMO the current form has several issues that should be addressed:

  1. Could you please update pyproject.toml to resolve the dependency conflict?
  2. Could you please add a minimal unit-test to test the integration and benchmark the speed of save/load using megatron bridge vs mbridge?
  3. If megatron-bridge is generally preferred or there are some trade-offs, we should draft a new document about this new feature.

@gursimar gursimar changed the title Megatron bridge adaptation feat: megatron bridge adaptation Mar 20, 2026
@gursimar gursimar force-pushed the megatron-bridge-adaptation branch from 7ec6c3b to f49f94b Compare March 20, 2026 21:38
@gursimar
Copy link
Contributor Author

gursimar commented Mar 20, 2026

Accuracy validated to be similar.
Experiment conducted on 8 GPUs with Qwen3-0.6B and gsm-8k
pp1_tp1_dp4
pp1_tp1_dp4_train

SwanLab-Chart_3-20-2026,_3_05_21_PM SwanLab-Chart_3-20-2026,_3_05_36_PM SwanLab-Chart_3-20-2026,_3_07_09_PM SwanLab-Chart_3-20-2026,_3_07_30_PM

@gursimar
Copy link
Contributor Author

gursimar commented Mar 20, 2026

@garrett4wade
Save/load benchmarking using the attached script

Using Qwen3-0.6B (TP=1, PP=1)

baseline runs save (s) load (s) total (s)
mbridge-fast 10 1.64 ±0.08 0.23 ±0.00 1.87 ±0.08
mbridge-standard 10 1.73 ±0.04 0.61 ±0.02 2.34 ±0.04
megatron-bridge 10 3.05 ±0.03 0.41 ±0.02 3.46 ±0.03

Using Qwen2.5-14B-Instruct (TP=1, PP=1)

baseline runs save (s) load (s) total (s)
mbridge-fast 10 84.14 ±1.47 5.52 ±0.19 89.66 ±1.43
mbridge-standard 10 86.69 ±1.27 6.38 ±0.25 93.07 ±1.23
megatron-bridge 10 86.68 ±0.88 5.43 ±0.11 92.11 ±0.90

megatron-bridge seems to have slightly poor save/load times.
Nevertheless, there are important reasons why megatron-bridge can still be useful. See RFC #1055

I think its better that we address the optimized save/load in a separate, future PR.

Script: benchmark_bridges.py

Let me know if something else needs to be addressed for merging this PR.

- tested TP,PP>1 megatron-bridge integration with mbridge backward compatibility
- darwin with x86_64 needs special handling as torch >2.9.1 stops support
- some packages conflicts due to megatron-bridge are overridden to previous versions
@gursimar gursimar force-pushed the megatron-bridge-adaptation branch from f49f94b to da01bc4 Compare March 20, 2026 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants