This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com
- Bug fixes
Yanked release.
- Features
- Performance
- Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
- Use new TE interface for user buffers (MR !3886)
- Add CPU activation offloading via TE (MR !4286)
- Add configurable double buffering (MR !4026)
- Add Muon optimizer and distributed optimizer support (MR !4106)
- Add setting to support Adam or AdamW optimizer (MR !3866)
- MoE
- Model support
- Add YaRN support for GPT-OSS (MR !4044)
- Add support for Qwen3-Next arguments (MR !4070)
- Add FP8 init for MTP (MR !3958)
- Add fp8_dpa option for FP8 scaling (MR !4053)
- Add RADIO-g support to converter and tester (MR !4371)
- Add audio semantic reasoning data for voice chat and speech instructions (MR !4397)
- FSDP
- Inference
- Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
- Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
- Dynamic audio shapes with variable sequence lengths (2.5x throughput improvement) (MR !4274)
- Integrate unified memory for dynamic inference context (MR !3985)
- Post-training
- RL
- Ease of use
- Performance
- Bug fixes
- Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
- Fix loss mask cloning to prevent incorrect updates (MR !4164)
- Fix metadata loss in checkpoints (MR !4182)
- Fix FSDP grad accum fusion support (MR !4018)
- Fix non-TE optimizer checkpoint issue (MR !3931)
- Fix BERT virtual pipeline parallelism (MR !3993)
- Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
- Fix full iteration CUDA graph non-tensor handling (MR !4019)
- Fix model_auto_sync mis-set and add gradient assertion (MR !4062)
- Fix HF import dtype and checkpoint loading issues (MR !4095)
- Fix missing initialization in ProcessGroupCollection (MR !4159)
- Fix sink attention TP (MR !4173)
- Fix num_microbatches calculation (MR !4199)
- Fix 1f1b overlap unit tests for MTP standalone (MR !4210)
- Fix stale state dict handling (MR !4226)
- Fix dataset divergence with tokenizer PAD handling (MR !4231)
- Fix parameter initialization (MR !4296)
- Ensure tensor-parallel attributes set regardless of initialization flag (MR !4312)
- Known issues
- Features
- Inference
- Post-training
- ModelOpt updates (MR !3268)
- Add speculative decoding AR validation feature
- Add DeepSeek and Qwen model configs
- ModelOpt updates (MR !3268)
- Performance
- MoE
- We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
- Features:
- Memory Optimization
- Performance Optimization
- Bug fixes:
- Fix router input jitter dtype (MR !3774)
- Model support
- Ease of use
- Bug fixes
- Use mscale_all_dim for softmax_factor (MR !2800)
- Fix FP8 param blockwise scaling unit test (MR !3480)
- Fix unit test blockwise scaling (MR !3491)
- Optimize prefill for token-less requests (MR !3499)
- Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
- Fix CUDA graph logic for flexible pp layout (MR !3505)
- Load FP8 models with strict=False (MR !3508)
- Skip rope check for torch < 1.4.0 (MR !3528)
- Disable Apex tests for stability (MR !3539)
- Fix typo in parallel_state expert parallelism (MR !3548)
- Guard modelopt on macOS (MR !3549)
- Retry on CUDA function failure (MR !3554)
- Fix NCCL mem pool creation error (MR !3557)
- Fix get_rotary_seq_len return type (MR !3559)
- Retry on CUDA function failure (MR !3560)
- Fix NCCL allocator attribute error (MR !3565)
- Ensure multi-prompt inference works (MR !3568)
- Fix MD5 on FIPS systems (MR !3577)
- Fixes dynamic context and inference bugs (MR !3582)
- Fix TE version for interleaved fused RoPE (MR !3586)
- Fix MTP with MoE and TP logging (MR !3594)
- Guard TE import fix (MR !3596)
- Add assertion for NCCL UB case (MR !3599)
- Remove Encoder PP related Functions (MR !3604)
- Fix segfaults in tests (MR !3605)
- Fix TE error in distributed optimizer (MR !3625)
- Remove redundant barrier in checkpoint flow (MR !3626)
- Support VPP MTP, fix logging (MR !3630)
- Retry mechanism for free(): invalid pointer errors (MR !3632)
- Fix test_replication.py issues (MR !3633)
- Fix typo in parallel_state (MR !3634)
- Fix CUDA graph logic determination (MR !3635)
- Fix TE installation error (MR !3636)
- Ensure correct sharding type in local tests (MR !3643)
- Fix cudagraphed backward buffer reuse for last layer (MR !3645)
- Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
- Fix dynamic example script errors (MR !3653)
- Guard TE import fix (MR !3666)
- Breaking changes:
megatron.core.distributed.custom_fsdprefactored as breaking change tomegatron.core.distributed.fsdp.src.megatron_fsdp
- Known issues
- Support bf16 dtype for optimizer states to use precision-aware optimizer in TransformerEngine
- MoE
- Features:
- Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout (--pipeline-model-parallel-layout)
- Add support to pass custom parallelism groups to MoE modules.
- Add Hybrid Shard Data-Parallel support for MoE models (--num-distributed-optimizer-instances)
- Support EP + custom FSDP training for DeepSeek-V3
- FP8 support for Multi-Token-Prediction
- Memory Optimization
- Fine-grained recomputation to reduce activation memory. (--recompute-modules with --recompute-granularity selective)
- Memory efficient token permutation by moving the probs multiplication from unpermutation to activation function of GroupedMLP.
- Performance Optimization
- MLA RoPE fusion kernel and YARN embedding cache.
- FP8 padding optimization of MoE models by padding the routing map.
- Bug fixes:
- Fix the aux loss calculation when expert_bias or group limited routing is used. This leads to load_balancing_loss values change compared to the previous version.
- Fix packed sequence support for MLA
- Known Issues:
- MTP is not compatible with flexible pipeline layout, will be fixed at !3594.
- MTP convergence issue with TP2, will be fixed at !3594.
- Features:
- Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
- Context parallel: fix loss scaling when calculate_per_token_loss=True
- Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
- Inference
- Support in-flight batching and chunked KV cache
- Reduce memory usage,
- by not materializing full attention mask
- by only materializing logits for the last token during decode
- by removing an obsolete tensor reference
- Hybrid Model
- Inference
- Add CUDA graph support
- Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
- Fix a shape issue when materializing logits for Mamba model
- Improve initialization of Mamba layers
- Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
- Make num_floating_point_operations work with hybrid model
- Make hybrid_conversion.py work with mixer that uses TE linear
- Add FP8 support
- Fix Mamba dt_bias tensor parallelism
- Support multimodal tokenizer
- Improve data parallelism scaling
- Inference
- MoE
- Features:
- DeepEP support, compatible with all the parallelisms and token drop / dropless
- Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
- CUDA Graph support for MoE
- Multi-Token Prediction (MTP) Support
- Fused indices_to_multihot kernel for DeepEP dispatcher
- Bug fixes:
- Fix Hang Issue with MoE+Dense Hybrid models
- Update theoretical memory and tflops estimation for MoE and MLA
- Fix MoE Aux loss scaling for per token loss
- Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
- Known issues:
- The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.
- Features:
- Add multi datacenter training support though N/S connection
- MoE
- Features
- Support DeepSeek-V3 fine-tuning
- Aux-loss-free load balancing strategy
- Node-limited routing and Device-limited routing support.
- Tensor Parallelism support for MLA and Sequence Auxiliary Loss
- MTP (with TP and PP support) is coming soon.
- Permutation / Unpermutation fusion kernel from TransformerEngine.
- Uneven virtual pipeline parallel split support in first and last PP stage.
- Support DeepSeek-V3 fine-tuning
- Bug fixes:
- Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
- Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
- Known Issues:
- When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.
- Features
- Add MX-FP16 support for optimizer and master weights
- CUDA Graph memory optimizations
- Enable UCC backend for PP communication
- Optimizer CPU offload support for memory savings
- Models
- Initial RADIO/CRADIO implementation
- llama3.2 support
- Hybrid Model
- Support quantization via TensorRT Model Optimizer
- Adding MLA to MCore
- Enable FP8 for GroupedMLP
- MoE Parallel Folding
- Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size
- Multimodal: NVLM training and evaluation support in MCore
- Mamba Hybrid
- Increase performance and reduce memory footprint of Triton language/compiler distributed caching
- Add more unit testing and fix bugs
- Uneven pipeline parallelism
- Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
- Per layer CUDAGraph support for GPT training with Transformer Engine modules
- Enable different TP sizes for the vision encoder
- Enable pipeline parallelism for T5 & Llava models
- Support multi-tile multi-image input in Llava models
- MoE
- FP8 support
- Runtime upcycling support
- Dispatcher implementation optimizations
- Shared expert support with overlapping optimizations
- Qwen Model support
- Known Issues
- When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.
- NVRx / Fault tolerance
- fault and hang detection in addition to existing straggler detection
- graceful exit and auto restart
- Multimodal
- Added initial support for training vision language models using the LLaVA architecture
- Added initial support for inference with multimodal inputs
- End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
- MoE
- Context Parallel support.
- Distributed checkpoint support for grouped GEMM.
- Mamba
- MoE
- Token drop support
- Several efficiency optimizations
- Improved model parallelism
- Memory optimizations
- Distributed checkpointing
- Enabled for Retro
- Asynchronous checkpoint saving
- Several minor bug fixes, speed improvements, and memory optimizations
- MoE (Mixture of Experts)
- Performance optimization
- Communication optimization for multi GPU and Single GPU
- 23% improvement (323 TFLOPS/GPU) over MCore 0.5.0 on Mixtral with Hopper BF16
- GroupedMLP enhancement for Hopper
- DP Overlapping. Support overlapping computation with gradient reduction and parameter gathering.
- All-to-All based Token Dispatcher
- Layer-wise logging for load balancing loss.
- Improved expert parallel support including distributed optimizer.
- Performance optimization
- Distributed optimizer
- RETRO
- Data processing
- BERT
- Distributed checkpointing
- Dist checkpointing
- PyTorch native distributed backend
- Improved saving/loading speed
- TensorRT-LLM Export
- Integration with TensorRT Model Optimizer Post-training quantization (PTQ)
- Text generation driver to perform PTQ in Megatron-LM
- Llama2 and Nemotron3-8b examples to use TensorRT-LLM unified build API to build engine after training.
- Several minor enhancements, bug fixes, and documentation updates
Megatron core documentation is now live!
- MoE (Mixture of Experts)
- Support for Z-loss, Load balancing and Sinkhorn
- Layer and communications refactor
- Richer parallelism mappings and EP can be combined with other model parallel techniques for larger MoE variants, e.g. EP + TP + DP + SP + PP
- Token dropless architecture with Top-K routing
- Performance optimization with with GroupedGEMM when number of local experts is > 1
- Distributed checkpointing
- Interleaved rotary embedding
- Masked WordPiece datasets for BERT and T5
- Raw and mock datasets
- Activation offloading to CPU
- Rope and Swiglu fusion
- Sliding window attention (via Transformer Engine)
- Timers
- BERT
- RETRO
- T5
- Mixture of Experts support for GPT
- Model parallel efficient Distributed Data Parallel (DDP)
- Context Parallel (2D Tensor Parallel) support
- GPT Dataset
- Blended Dataset