Releases · NVIDIA-NeMo/Megatron-Bridge

26 Feb 03:51

svcnvidia-nemo-ci

v0.3.0

21b02e0

NVIDIA Megatron-Bridge 0.3.0 Latest

Latest

Highlights

Model Collection Support
- Nano v3 (PR#1858)
- GLM 4.5v (PR#1798)
- Ministral 3 (PR#1580)
Performance
- NVFP4 support for LLama3 models.
- HybridEP support for NVL8 systems (PR#494)
- MLA performance improvement with cudnn layernorm and cudnn 9.18
- LN+MXFP8 quantization fusion with TE.sequence and cudnn backend
- Supports FSDP for MoE models with MXFP8 (PR#2135, PR#2239)
- Support Muon Optimizer (PR#683)
- NVFP4 Llama Playbook (PR#1409)
Training & Functionality
- LoRA Bridge (initial): RL LoRA support for VeRL / nemo-rl (PR#1766)
- Multi-token prediction (MTP): Qwen3 dense examples (PR#2138)
- Decentralized parallel group (M4) end to end support and examples (PR#2011, examples)
- Context Parallelism (CP) with sequence packing in LLMs (PR#1867)
- Context Parallelism (CP) with sequence packing in VLMs (PR#1997)
- Callbacks integration (PR#2063)
- Low memory save for model importing from HF (fix Deepseek V3 and Kimi-K2 import) (PR#1949)
Community Contributions
- @HollowMan6: MoE router weight adapter wrapper (PR#1834), temporary disable adapter support (PR#1811), flexible LoRA target_modules (PR#1799), separate layernorm mappings (PR#1808), shared_experts MoE fix (PR#1800), LoRA split QKV with GQA fix (PR#1818), Moonlight/Kimi rotary_emb export fix (PR#1838), configurable use_arbitrary_attention_mask (PR#1807)
- @Hayak3: Fix Qwen3-VL unsupported normalization arg (PR#1970)
- @shaltielshmid: Disable FP8 during CPU initialization for export (PR#1815)
- @therealnaveenkamal: MLFlow integration (PR#2112)
- @kannankumar: Fill-in-the-Middle (FIM) dataset support (PR#2066)
- A big thank you to our community contributors for their valuable support!

Changelog Details

concise naming | weak scaling | save cfg to file by @malay-nagda :: PR: #1246
cg_scope valid list and default none by @malay-nagda :: PR: #1264
chore: Merge fp8 args by @ko3n1g :: PR: #1279
cg and nan grad norm fix by @malay-nagda :: PR: #1309
feat: Support PEFT weight mapping and merge LoRA adapters when export to hf by @HollowMan6 :: PR: #1310
Add Nemotron nano v2 vl by @cuichenx :: PR: #1136
Replay "Ko3n1g/ci/cleanup recipe evaluator (#1349)" by @ko3n1g :: PR: #1377
Gemma3 VL LoRA Recipe + Documentations by @suiyoubi :: PR: #1388
Add GLM4.5 FT Recipe by @suiyoubi :: PR: #1382
Adding FLA as dependency for Qwen3-Next by @adityavavreNVDA :: PR: #1359
fix: default to nccl comm overlap bootstrap backend by @ananthsub :: PR: #1395
Add Qwen2/2.5 FT recipes by @ananthsub :: PR: #1385
[PEFT/LoRA] fix: using ETP instead of TP for expert layers by @HollowMan6 :: PR: #1380
Llama3 PEFT- 8B, 70B by @malay-nagda :: PR: #1381
Add option for LoRA with Transformer Engine op fuser by @michal2409 :: PR: #1324
[OMNIML-2937] Support Megatron Bridge quantized checkpoint export to HF unified checkpoint by @yueshen2016 :: PR: #1302
HybridEP support by @erhoo82 :: PR: #1367
expose option to dump config to file during end to end tests by @ananthsub :: PR: #1400
[OMNIML-2935] PTQ support of MOE model (Qwen-3) on Megatron-Bridge by @yueshen2016 :: PR: #1405
Revert "feat: Dependabot automerge if successful (#1051)" by @pablo-garay :: PR: #1428
Update perf docs by @gautham-kollu :: PR: #1426
Add Qwen3VL support (dense and moe) by @yashaswikarnati :: PR: #1174
Fix llama3-8b NVFP4 recipe by @adityavavreNVDA :: PR: #1347
fix GPT-OSS perf scripts by @erhoo82 :: PR: #1438
Add functional test for finetuning with sequence packing by @ananthsub :: PR: #861
feat: Pass custom srun args into Run by @ko3n1g :: PR: #1440
Fix typo in dataclass from callable => typing.Callable in nemotron_h_provider.py by @shaltielshmid :: PR: #1442
pass the support of deepep for B200 and B300 GPUs by @erhoo82 :: PR: #1436
cuda graph fine grained scope | hybridEP | a2a overlap by @malay-nagda :: PR: #1348
nvfp4 for dense models by @sanandaraj5597 :: PR: #1453
Added Qwen 3 next perf scripts by @sanandaraj5597 :: PR: #1451
reset gradient_accumulation_fusion with megatron fsdp by @ananthsub :: PR: #1386
guard trust_remote_code by @dimapihtar :: PR: #1291
fix lint checks on main by @ananthsub :: PR: #1463
DSv3- gb200 base cfg fix | b200 no a2a overlap by @malay-nagda :: PR: #1476
sequence_length -> seq_length by @dimapihtar :: PR: #1023
feat: Add whitelist support for mismatched params in load_hf_weights by @yaoyu-33 :: PR: #1447
[docs] Update readme with supported models/recipes by @ananthsub :: PR: #1455
Add Gemma2 recipes by @ananthsub :: PR: #1383
[docs] Add release section for changelog and software component versions by @ananthsub :: PR: #1490
[docs] Add 0.2.0 version picker by @ananthsub :: PR: #1488
Reduced precision (BF16, FP8, MXFP8, NVFP4) training tutorial using Megatron-Bridge by @sergiopperez :: PR: #1409
Update conversion compare script and add accelerate dependency by @yaoyu-33 :: PR: #1344
[main] Fix functional conftest to handle optional nvdlfw-inspect dependency by @ananthsub :: PR: #1496
[docs] Update supported model docs by @ananthsub :: PR: #1503
fix: Escape user inputs in data tutorials by @ananthsub :: PR: #1465
Bridge instantiate_utils: drop unexpected config keys with warning by @yaoyu-33 :: PR: #1203
Make container image point to last known release container by @gautham-kollu :: PR: #1443
Revamp recipe tutorials by @ananthsub :: PR: #1308
[docs] 25.11 release notes by @ananthsub :: PR: #1504
Add generic scripts for training by @ananthsub :: PR: #1390
Nemotron nano v2 finetune by @cuichenx :: PR: #1391
Replay: M4 Remove parallel state usage in train loops, train steps and utils #1175 + Bug fix by @yaoyu-33 :: PR: #1445
track dtype in scatter to tp ranks by @ananthsub :: PR: #1509
Update performance scripts to align with llmb requirements by @scsudhakaran :: PR: #1416
fix qwen3_vl by changing sequence_length to seq_length by @shifangx :: PR: #1511
Update GPT-OSS pretrain config parameters by @cuichenx :: PR: #1375
feat: mcore trigger mbridge by @pablo-garay :: PR: #1441
fix: cleanup by @pablo-garay :: PR: #1540
Revert strong-scaling support for DeepSeek-V3 by @scsudhakaran :: PR: #1548
Add fallback for shared embedding flag by @yaoyu-33 :: PR: #1521
Wan Bridge (checkpoints conversion) by @huvunvidia :: PR: #1550
feat: defer flop calculation to model_provider "get_num_floating_point_operations" if provided by @yaoyu-33 :: PR: #1446
refactor: Unify launchers by @ko3n1g :: PR: #1519
bug fixes- unify launchers by @malay-nagda :: PR: #1573
ci: Bump MCore and ModelOpt by @chtruong814 :: PR: #1551
docs: Update documentation.md to include install submodules command by @chenopis :: PR: #1576
fix: Fix load failure when load_megatron_model from a model trained with uneven pp by @yaoyu-33 :: PR: #1579
Added 25.11 starter pack by @sanandaraj5597 :: PR: #1596
fix: Wandb mocking by @ko3n1g :: PR: #1587
fix: Use model seq length as default if no CLI is provided by @ko3n1g :: PR: #1600
scripts: Update help string of args.detach by @ko3n1g :: PR: #1589
ci: Add DGXC executor by @ko3n1g :: PR: #1584
fix: Fix model parallel initialization ordering by @yaoyu-33 :: PR: #1574
fix: Missing return of parse_additional_slurm_params by @ko3n1g :: PR: #1619
Add fix for users who want to provide a path on disk to a custom HF tokenizer by @jstjohn :: PR: #1594
fix: wandb exp name in recipe path by @ko3n1g :: PR: #1623
Rename TensorRT Model Optimizer to Model Optimizer by @AAnoosheh :: PR: #1484
Cleanup partial CG objects by @gautham-kollu :: PR: #1615
[Canonical LoRA] fix: use correct q_out_features for linear_q by @HollowMan6 :: PR: #1627
[Canonical LoRA] fix: forward under expert layers by @HollowMan6 :: PR: #1628
qwen3 235b config update by @malay-nagda :: PR: #1613
chore: Update codeowners of performance scripts by @ko3n1g :: PR: #1641
Re-use higher-level config override util in tutorials by @ananthsub :: PR: #1524
docs: add wayfinder readme.md files for each docs directory by @chenopis :: PR: #1617
ci: Fix DGXC env vars by @ko3n1g :: PR: #1629
Support strong scaling ...

Contributors

jstjohn, yfw, and 46 other contributors

Assets 2

09 Jan 18:14

chtruong814

v0.2.2

0465189

NVIDIA Megatron-Bridge 0.2.2

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com

Assets 2

18 Dec 00:04

ko3n1g

v0.2.1

1c43b39

NVIDIA Megatron-Bridge 0.2.1

Performance
- Activation offloading to host memory support with pipelining
  - Supports the high activation memory needs of MoE models training with dynamic shapes
  - Fixed Nemotron FLOPS calculation model
Model Collection Support
- Ministral 3
Enhanced LoRA support
- LoRA support for Mamba layers (for Nemotron Nano V2 and NemotronH finetuning)

Assets 2

04 Dec 23:56

ko3n1g

v0.2.0

7af9601

NVIDIA Megatron-Bridge 0.2.0

Model Collection Support
- LLM
  - HuggingFace Conversion + training recipes:
    - GPT-OSS
    - Qwen3 Next
    - Nemotron-H
    - Nemotron Nano v2
    - Moonlight
    - OlMoE
    - GLM 4.5
    - Gemma 3
  - HuggingFace conversion support:
    - Llama Nemotron
    - Mistral
    - Gemma
    - Gemma 2
- VLM
  - Nemotron Nano v2 VL
  - Qwen 3 VL
  - Qwen2.5 VL
  - Gemma3 VL
Performance
- Megatron-Bridge support for new benchmarks
  - Benchmarks (same workloads as GB200 system) for GB300 system
  - GPT-OSS 120B
  - Qwen3-Next 80B_A3B
  - Support for linear attention on Blackwell - Gated Delta Networks
  - Pre-training with NVFP4 precision: Llama3 8B, Lama3 70B, Llama3.1 405B
- Megatron-Bridge support for benchmarks previously existing only for NeMo 2.0
  - Nemotron-H 56B
  - Fine-tuning (SFT and LoRA): Llama3 8B and Llama3 70B
- HybridEP: DeepSeek V3 benchmarks on GB200 and GB300 systems now use HybridEP
- CUDA Graphs
  - Full-model iteration CUDA graph used for dense models- Llama3 8B, Llama3 70B, Llama3.1 405B
  - Fine-grained Transformer component specific CUDA Graphs used for MoE models
NVIDIA Model Optimization Integration
- Knowledge Distillation
- Post training quantization export
- Quantization aware training
Enhanced LoRA support
- Support for expert layers
- Supported merging adapters for export to HuggingFace @HollowMan6
Finetuning dataset improvements: OpenAI messages format conversion, chat template support
Integration with Tensor NVIDIA-DLFW-Inspect for tensor statistic collection & monitoring
Support for sample-based training
Broader Community Adoption: Integrate the Megatron-Bridge into the training pipelines of VeRL (PR), Slime (PR), and Sky-RL (PR).
Special thanks to the community contributors for this release: @HollowMan6, @fzyzcjy, @erictang000, @hawkoli1987.

Contributors

fzyzcjy, HollowMan6, and 2 other contributors

Assets 2

23 Oct 20:35

chtruong814

v0.1.0rc4

6725f70

NVIDIA Megatron-Bridge 0.1.0rc4 Pre-release

Pre-release

Fix docs build
Update performance scripts

Assets 2

08 Oct 01:05

chtruong814

v0.1.0rc3

bf71eba

NVIDIA Megatron-Bridge 0.1.0rc3 Pre-release

Pre-release

Model Collection Support
- Llama
- Qwen 2, Qwen 3, Qwen 3 MoE
- DeepSeek
- Mamba
Migration guide from NeMo 2 to Megatron-Bridge
Contribution guide for adding a new model
Checkpoint conversion from Hugging Face to Megatron
Performance
- MoE LLM
  - Change the model to dropless with balanced gating
  - Fusion of operators in router function
  - Global permutation fusion with A2A dispatcher
  - EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training
  - Precision-aware optimizer update to support BF16 states
- Megatron FSDP
  - Migration from mcore FSDP to megatron FSDP
  - Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM
  - Removed redundant optimizer operations
  - Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage
  - IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28
- MXFP8
  - Improved act grad all-gather overlap performance via userbuffer
  - Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter
  - Fusion of MXFP8 scaling factor swizzling kernels
  - Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead
- Others
  - Full iteration cuda graph for dense model without pipelining
  - Fusion of activation and cast fusion (currently tensor-wise scaling only)
  - Store SwiGLU input in FP8 to save activation memory

Assets 2

15 Aug 13:59

ko3n1g

v0.1.0a0

c6976d9

NVIDIA Megatron-Bridge 0.1.0a0 Pre-release

Pre-release

Llama and Qwen
Pretrain/SFT
PeFT
Recipe structure with examples for plain python & NeMo Run usage

Assets 2

Releases: NVIDIA-NeMo/Megatron-Bridge

NVIDIA Megatron-Bridge 0.3.0

Highlights

Contributors

Uh oh!

NVIDIA Megatron-Bridge 0.2.2

Uh oh!

NVIDIA Megatron-Bridge 0.2.1

Uh oh!

NVIDIA Megatron-Bridge 0.2.0

Contributors

Uh oh!

NVIDIA Megatron-Bridge 0.1.0rc4

Uh oh!

NVIDIA Megatron-Bridge 0.1.0rc3

Uh oh!

NVIDIA Megatron-Bridge 0.1.0a0

Uh oh!