Releases: NVIDIA-NeMo/Megatron-Bridge
Releases · NVIDIA-NeMo/Megatron-Bridge
NVIDIA Megatron-Bridge 0.3.0
Highlights
- Model Collection Support
- Performance
- NVFP4 support for LLama3 models.
- HybridEP support for NVL8 systems (PR#494)
- MLA performance improvement with cudnn layernorm and cudnn 9.18
- LN+MXFP8 quantization fusion with TE.sequence and cudnn backend
- Supports FSDP for MoE models with MXFP8 (PR#2135, PR#2239)
- Support Muon Optimizer (PR#683)
- NVFP4 Llama Playbook (PR#1409)
- Training & Functionality
- LoRA Bridge (initial): RL LoRA support for VeRL / nemo-rl (PR#1766)
- Multi-token prediction (MTP): Qwen3 dense examples (PR#2138)
- Decentralized parallel group (M4) end to end support and examples (PR#2011, examples)
- Context Parallelism (CP) with sequence packing in LLMs (PR#1867)
- Context Parallelism (CP) with sequence packing in VLMs (PR#1997)
- Callbacks integration (PR#2063)
- Low memory save for model importing from HF (fix Deepseek V3 and Kimi-K2 import) (PR#1949)
- Community Contributions
- @HollowMan6: MoE router weight adapter wrapper (PR#1834), temporary disable adapter support (PR#1811), flexible LoRA target_modules (PR#1799), separate layernorm mappings (PR#1808), shared_experts MoE fix (PR#1800), LoRA split QKV with GQA fix (PR#1818), Moonlight/Kimi rotary_emb export fix (PR#1838), configurable use_arbitrary_attention_mask (PR#1807)
- @Hayak3: Fix Qwen3-VL unsupported normalization arg (PR#1970)
- @shaltielshmid: Disable FP8 during CPU initialization for export (PR#1815)
- @therealnaveenkamal: MLFlow integration (PR#2112)
- @kannankumar: Fill-in-the-Middle (FIM) dataset support (PR#2066)
- A big thank you to our community contributors for their valuable support!
Changelog Details
- concise naming | weak scaling | save cfg to file by @malay-nagda :: PR: #1246
- cg_scope valid list and default none by @malay-nagda :: PR: #1264
- chore: Merge fp8 args by @ko3n1g :: PR: #1279
- cg and nan grad norm fix by @malay-nagda :: PR: #1309
- feat: Support PEFT weight mapping and merge LoRA adapters when export to hf by @HollowMan6 :: PR: #1310
- Add Nemotron nano v2 vl by @cuichenx :: PR: #1136
- Replay "Ko3n1g/ci/cleanup recipe evaluator (#1349)" by @ko3n1g :: PR: #1377
- Gemma3 VL LoRA Recipe + Documentations by @suiyoubi :: PR: #1388
- Add GLM4.5 FT Recipe by @suiyoubi :: PR: #1382
- Adding FLA as dependency for Qwen3-Next by @adityavavreNVDA :: PR: #1359
- fix: default to
ncclcomm overlap bootstrap backend by @ananthsub :: PR: #1395 - Add Qwen2/2.5 FT recipes by @ananthsub :: PR: #1385
- [PEFT/LoRA] fix: using ETP instead of TP for expert layers by @HollowMan6 :: PR: #1380
- Llama3 PEFT- 8B, 70B by @malay-nagda :: PR: #1381
- Add option for LoRA with Transformer Engine op fuser by @michal2409 :: PR: #1324
- [OMNIML-2937] Support Megatron Bridge quantized checkpoint export to HF unified checkpoint by @yueshen2016 :: PR: #1302
- HybridEP support by @erhoo82 :: PR: #1367
- expose option to dump config to file during end to end tests by @ananthsub :: PR: #1400
- [OMNIML-2935] PTQ support of MOE model (Qwen-3) on Megatron-Bridge by @yueshen2016 :: PR: #1405
- Revert "feat: Dependabot automerge if successful (#1051)" by @pablo-garay :: PR: #1428
- Update perf docs by @gautham-kollu :: PR: #1426
- Add Qwen3VL support (dense and moe) by @yashaswikarnati :: PR: #1174
- Fix llama3-8b NVFP4 recipe by @adityavavreNVDA :: PR: #1347
- fix GPT-OSS perf scripts by @erhoo82 :: PR: #1438
- Add functional test for finetuning with sequence packing by @ananthsub :: PR: #861
- feat: Pass custom srun args into Run by @ko3n1g :: PR: #1440
- Fix typo in dataclass from
callable=>typing.Callableinnemotron_h_provider.pyby @shaltielshmid :: PR: #1442 - pass the support of deepep for B200 and B300 GPUs by @erhoo82 :: PR: #1436
- cuda graph fine grained scope | hybridEP | a2a overlap by @malay-nagda :: PR: #1348
- nvfp4 for dense models by @sanandaraj5597 :: PR: #1453
- Added Qwen 3 next perf scripts by @sanandaraj5597 :: PR: #1451
- reset gradient_accumulation_fusion with megatron fsdp by @ananthsub :: PR: #1386
- guard trust_remote_code by @dimapihtar :: PR: #1291
- fix lint checks on main by @ananthsub :: PR: #1463
- DSv3- gb200 base cfg fix | b200 no a2a overlap by @malay-nagda :: PR: #1476
- sequence_length -> seq_length by @dimapihtar :: PR: #1023
- feat: Add whitelist support for mismatched params in load_hf_weights by @yaoyu-33 :: PR: #1447
- [docs] Update readme with supported models/recipes by @ananthsub :: PR: #1455
- Add Gemma2 recipes by @ananthsub :: PR: #1383
- [docs] Add release section for changelog and software component versions by @ananthsub :: PR: #1490
- [docs] Add 0.2.0 version picker by @ananthsub :: PR: #1488
- Reduced precision (BF16, FP8, MXFP8, NVFP4) training tutorial using Megatron-Bridge by @sergiopperez :: PR: #1409
- Update conversion compare script and add accelerate dependency by @yaoyu-33 :: PR: #1344
- [main] Fix functional conftest to handle optional
nvdlfw-inspectdependency by @ananthsub :: PR: #1496 - [docs] Update supported model docs by @ananthsub :: PR: #1503
- fix: Escape user inputs in data tutorials by @ananthsub :: PR: #1465
- Bridge instantiate_utils: drop unexpected config keys with warning by @yaoyu-33 :: PR: #1203
- Make container image point to last known release container by @gautham-kollu :: PR: #1443
- Revamp recipe tutorials by @ananthsub :: PR: #1308
- [docs] 25.11 release notes by @ananthsub :: PR: #1504
- Add generic scripts for training by @ananthsub :: PR: #1390
- Nemotron nano v2 finetune by @cuichenx :: PR: #1391
- Replay: M4 Remove parallel state usage in train loops, train steps and utils #1175 + Bug fix by @yaoyu-33 :: PR: #1445
- track dtype in scatter to tp ranks by @ananthsub :: PR: #1509
- Update performance scripts to align with llmb requirements by @scsudhakaran :: PR: #1416
- fix qwen3_vl by changing sequence_length to seq_length by @shifangx :: PR: #1511
- Update GPT-OSS pretrain config parameters by @cuichenx :: PR: #1375
- feat: mcore trigger mbridge by @pablo-garay :: PR: #1441
- fix: cleanup by @pablo-garay :: PR: #1540
- Revert strong-scaling support for DeepSeek-V3 by @scsudhakaran :: PR: #1548
- Add fallback for shared embedding flag by @yaoyu-33 :: PR: #1521
- Wan Bridge (checkpoints conversion) by @huvunvidia :: PR: #1550
- feat: defer flop calculation to model_provider "get_num_floating_point_operations" if provided by @yaoyu-33 :: PR: #1446
- refactor: Unify launchers by @ko3n1g :: PR: #1519
- bug fixes- unify launchers by @malay-nagda :: PR: #1573
- ci: Bump MCore and ModelOpt by @chtruong814 :: PR: #1551
- docs: Update documentation.md to include install submodules command by @chenopis :: PR: #1576
- fix: Fix load failure when
load_megatron_modelfrom a model trained with uneven pp by @yaoyu-33 :: PR: #1579 - Added 25.11 starter pack by @sanandaraj5597 :: PR: #1596
- fix: Wandb mocking by @ko3n1g :: PR: #1587
- fix: Use model seq length as default if no CLI is provided by @ko3n1g :: PR: #1600
- scripts: Update help string of args.detach by @ko3n1g :: PR: #1589
- ci: Add DGXC executor by @ko3n1g :: PR: #1584
- fix: Fix model parallel initialization ordering by @yaoyu-33 :: PR: #1574
- fix: Missing return of parse_additional_slurm_params by @ko3n1g :: PR: #1619
- Add fix for users who want to provide a path on disk to a custom HF tokenizer by @jstjohn :: PR: #1594
- fix: wandb exp name in recipe path by @ko3n1g :: PR: #1623
- Rename TensorRT Model Optimizer to Model Optimizer by @AAnoosheh :: PR: #1484
- Cleanup partial CG objects by @gautham-kollu :: PR: #1615
- [Canonical LoRA] fix: use correct q_out_features for
linear_qby @HollowMan6 :: PR: #1627 - [Canonical LoRA] fix: forward under expert layers by @HollowMan6 :: PR: #1628
- qwen3 235b config update by @malay-nagda :: PR: #1613
- chore: Update codeowners of performance scripts by @ko3n1g :: PR: #1641
- Re-use higher-level config override util in tutorials by @ananthsub :: PR: #1524
- docs: add wayfinder readme.md files for each docs directory by @chenopis :: PR: #1617
- ci: Fix DGXC env vars by @ko3n1g :: PR: #1629
- Support strong scaling ...
NVIDIA Megatron-Bridge 0.2.2
- This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com
NVIDIA Megatron-Bridge 0.2.1
- Performance
- Activation offloading to host memory support with pipelining
- Supports the high activation memory needs of MoE models training with dynamic shapes
- Fixed Nemotron FLOPS calculation model
- Activation offloading to host memory support with pipelining
- Model Collection Support
- Ministral 3
- Enhanced LoRA support
- LoRA support for Mamba layers (for Nemotron Nano V2 and NemotronH finetuning)
NVIDIA Megatron-Bridge 0.2.0
-
- LLM
- HuggingFace Conversion + training recipes:
- GPT-OSS
- Qwen3 Next
- Nemotron-H
- Nemotron Nano v2
- Moonlight
- OlMoE
- GLM 4.5
- Gemma 3
- HuggingFace conversion support:
- Llama Nemotron
- Mistral
- Gemma
- Gemma 2
- HuggingFace Conversion + training recipes:
- VLM
- Nemotron Nano v2 VL
- Qwen 3 VL
- Qwen2.5 VL
- Gemma3 VL
- LLM
-
- Megatron-Bridge support for new benchmarks
- Benchmarks (same workloads as GB200 system) for GB300 system
- GPT-OSS 120B
- Qwen3-Next 80B_A3B
- Support for linear attention on Blackwell - Gated Delta Networks
- Pre-training with NVFP4 precision: Llama3 8B, Lama3 70B, Llama3.1 405B
- Megatron-Bridge support for benchmarks previously existing only for NeMo 2.0
- Nemotron-H 56B
- Fine-tuning (SFT and LoRA): Llama3 8B and Llama3 70B
- HybridEP: DeepSeek V3 benchmarks on GB200 and GB300 systems now use HybridEP
- CUDA Graphs
- Full-model iteration CUDA graph used for dense models- Llama3 8B, Llama3 70B, Llama3.1 405B
- Fine-grained Transformer component specific CUDA Graphs used for MoE models
- Megatron-Bridge support for new benchmarks
-
NVIDIA Model Optimization Integration
- Knowledge Distillation
- Post training quantization export
- Quantization aware training
-
- Support for expert layers
- Supported merging adapters for export to HuggingFace @HollowMan6
-
Finetuning dataset improvements: OpenAI messages format conversion, chat template support
-
Integration with Tensor NVIDIA-DLFW-Inspect for tensor statistic collection & monitoring
-
Broader Community Adoption: Integrate the Megatron-Bridge into the training pipelines of VeRL (PR), Slime (PR), and Sky-RL (PR).
-
Special thanks to the community contributors for this release: @HollowMan6, @fzyzcjy, @erictang000, @hawkoli1987.
NVIDIA Megatron-Bridge 0.1.0rc4
- Fix docs build
- Update performance scripts
NVIDIA Megatron-Bridge 0.1.0rc3
- Model Collection Support
- Llama
- Qwen 2, Qwen 3, Qwen 3 MoE
- DeepSeek
- Mamba
- Migration guide from NeMo 2 to Megatron-Bridge
- Contribution guide for adding a new model
- Checkpoint conversion from Hugging Face to Megatron
- Performance
- MoE LLM
- Change the model to dropless with balanced gating
- Fusion of operators in router function
- Global permutation fusion with A2A dispatcher
- EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training
- Precision-aware optimizer update to support BF16 states
- Megatron FSDP
- Migration from mcore FSDP to megatron FSDP
- Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM
- Removed redundant optimizer operations
- Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage
- IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28
- MXFP8
- Improved act grad all-gather overlap performance via userbuffer
- Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter
- Fusion of MXFP8 scaling factor swizzling kernels
- Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead
- Others
- Full iteration cuda graph for dense model without pipelining
- Fusion of activation and cast fusion (currently tensor-wise scaling only)
- Store SwiGLU input in FP8 to save activation memory
- MoE LLM
NVIDIA Megatron-Bridge 0.1.0a0
- Llama and Qwen
- Pretrain/SFT
- PeFT
- Recipe structure with examples for plain python & NeMo Run usage