Skip to content

Latest commit

 

History

History
383 lines (337 loc) · 26 KB

File metadata and controls

383 lines (337 loc) · 26 KB

Changelog

NVIDIA Megatron Core 0.15.3

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com

NVIDIA Megatron Core 0.15.2

  • Bug fixes
    • Various small fixes for Megatron-FSDP. #2346
    • [Megatron-FSDP] Support both old and new DeviceMesh APIs. #2575
    • [Megatron-FSDP] Build default FSDP DeviceMesh, and remove model arg from fully_shard_optimizer(). #2471

NVIDIA Megatron Core 0.15.1

Yanked release.

NVIDIA Megatron Core 0.15.0

  • Features
    • Performance
      • Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
      • Use new TE interface for user buffers (MR !3886)
      • Add CPU activation offloading via TE (MR !4286)
      • Add configurable double buffering (MR !4026)
      • Add Muon optimizer and distributed optimizer support (MR !4106)
      • Add setting to support Adam or AdamW optimizer (MR !3866)
    • MoE
      • Add DTensor support for EP and DSv3 modules (MR !3955)
      • Add HybridEP backend to Flex Dispatcher (MR !4237)
      • Support FP8 recomputation for MoE components (MR !4030)
      • Implement NVFP4 Zero Padding for MoE (MR !4225)
      • Compute shared experts before router (MR !4068)
      • Enable bias in expert MLP (MR !3858)
    • Model support
      • Add YaRN support for GPT-OSS (MR !4044)
      • Add support for Qwen3-Next arguments (MR !4070)
      • Add FP8 init for MTP (MR !3958)
      • Add fp8_dpa option for FP8 scaling (MR !4053)
      • Add RADIO-g support to converter and tester (MR !4371)
      • Add audio semantic reasoning data for voice chat and speech instructions (MR !4397)
    • FSDP
      • Enable joint training of parallel modules (MR !3850)
      • Add support for multimodule communication (MR !4235)
    • Inference
      • Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
      • Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
      • Dynamic audio shapes with variable sequence lengths (2.5x throughput improvement) (MR !4274)
      • Integrate unified memory for dynamic inference context (MR !3985)
    • Post-training
      • Add GPT-OSS ModelOpt support with quantization, import/export (MR !4169)
      • Enable KD support with hybrid training loop (MR !4021)
      • Add ModelOpt pruning example (MR !4022)
    • RL
      • Add importance sampling and partial rollouts to Megatron RL (MR !4000)
      • Add sequence packing for RL (MR !4191)
    • Ease of use
      • Handle CUDA absence during import (MR !4120)
      • Add granary dataloader functionality (MR !4291)
      • Enable SWA mixing with attention (MR !3855)
  • Bug fixes
    • Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
    • Fix loss mask cloning to prevent incorrect updates (MR !4164)
    • Fix metadata loss in checkpoints (MR !4182)
    • Fix FSDP grad accum fusion support (MR !4018)
    • Fix non-TE optimizer checkpoint issue (MR !3931)
    • Fix BERT virtual pipeline parallelism (MR !3993)
    • Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
    • Fix full iteration CUDA graph non-tensor handling (MR !4019)
    • Fix model_auto_sync mis-set and add gradient assertion (MR !4062)
    • Fix HF import dtype and checkpoint loading issues (MR !4095)
    • Fix missing initialization in ProcessGroupCollection (MR !4159)
    • Fix sink attention TP (MR !4173)
    • Fix num_microbatches calculation (MR !4199)
    • Fix 1f1b overlap unit tests for MTP standalone (MR !4210)
    • Fix stale state dict handling (MR !4226)
    • Fix dataset divergence with tokenizer PAD handling (MR !4231)
    • Fix parameter initialization (MR !4296)
    • Ensure tensor-parallel attributes set regardless of initialization flag (MR !4312)
  • Known issues

NVIDIA Megatron Core 0.14.0

  • Features
    • Inference
      • Add async support for DynamicInferenceEngine (MR !3187)
      • Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
      • Force inference to always gather logits with tensor parallelism (MR !3442)
      • Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
    • Post-training
      • ModelOpt updates (MR !3268)
        • Add speculative decoding AR validation feature
        • Add DeepSeek and Qwen model configs
    • Performance
      • ModelCommProcessGroup integration (MR !3391)
      • Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
        • Flexible creation and management of communication groups
      • Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
    • MoE
      • We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
      • Features:
      • Memory Optimization
        • Support recomputation for FP8 layernorm/moe_act/shared_experts (MR !3465)
        • Support optimizer offloading for DSV3 FP8 training (MR !3659)
      • Performance Optimization
      • Bug fixes:
        • Fix router input jitter dtype (MR !3774)
    • Model support
    • Ease of use
      • Add uv support for source installs (MR !3615)
      • Automated weekly prereleases (MR !3574)
  • Bug fixes
    • Use mscale_all_dim for softmax_factor (MR !2800)
    • Fix FP8 param blockwise scaling unit test (MR !3480)
    • Fix unit test blockwise scaling (MR !3491)
    • Optimize prefill for token-less requests (MR !3499)
    • Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
    • Fix CUDA graph logic for flexible pp layout (MR !3505)
    • Load FP8 models with strict=False (MR !3508)
    • Skip rope check for torch < 1.4.0 (MR !3528)
    • Disable Apex tests for stability (MR !3539)
    • Fix typo in parallel_state expert parallelism (MR !3548)
    • Guard modelopt on macOS (MR !3549)
    • Retry on CUDA function failure (MR !3554)
    • Fix NCCL mem pool creation error (MR !3557)
    • Fix get_rotary_seq_len return type (MR !3559)
    • Retry on CUDA function failure (MR !3560)
    • Fix NCCL allocator attribute error (MR !3565)
    • Ensure multi-prompt inference works (MR !3568)
    • Fix MD5 on FIPS systems (MR !3577)
    • Fixes dynamic context and inference bugs (MR !3582)
    • Fix TE version for interleaved fused RoPE (MR !3586)
    • Fix MTP with MoE and TP logging (MR !3594)
    • Guard TE import fix (MR !3596)
    • Add assertion for NCCL UB case (MR !3599)
    • Remove Encoder PP related Functions (MR !3604)
    • Fix segfaults in tests (MR !3605)
    • Fix TE error in distributed optimizer (MR !3625)
    • Remove redundant barrier in checkpoint flow (MR !3626)
    • Support VPP MTP, fix logging (MR !3630)
    • Retry mechanism for free(): invalid pointer errors (MR !3632)
    • Fix test_replication.py issues (MR !3633)
    • Fix typo in parallel_state (MR !3634)
    • Fix CUDA graph logic determination (MR !3635)
    • Fix TE installation error (MR !3636)
    • Ensure correct sharding type in local tests (MR !3643)
    • Fix cudagraphed backward buffer reuse for last layer (MR !3645)
    • Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
    • Fix dynamic example script errors (MR !3653)
    • Guard TE import fix (MR !3666)
  • Breaking changes:
    • megatron.core.distributed.custom_fsdp refactored as breaking change to megatron.core.distributed.fsdp.src.megatron_fsdp
  • Known issues

NVIDIA Megatron Core 0.13.0

  • Support bf16 dtype for optimizer states to use precision-aware optimizer in TransformerEngine
  • MoE
    • Features:
      • Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout (--pipeline-model-parallel-layout)
      • Add support to pass custom parallelism groups to MoE modules.
      • Add Hybrid Shard Data-Parallel support for MoE models (--num-distributed-optimizer-instances)
      • Support EP + custom FSDP training for DeepSeek-V3
      • FP8 support for Multi-Token-Prediction
    • Memory Optimization
      • Fine-grained recomputation to reduce activation memory. (--recompute-modules with --recompute-granularity selective)
      • Memory efficient token permutation by moving the probs multiplication from unpermutation to activation function of GroupedMLP.
    • Performance Optimization
      • MLA RoPE fusion kernel and YARN embedding cache.
      • FP8 padding optimization of MoE models by padding the routing map.
    • Bug fixes:
      • Fix the aux loss calculation when expert_bias or group limited routing is used. This leads to load_balancing_loss values change compared to the previous version.
      • Fix packed sequence support for MLA
    • Known Issues:
      • MTP is not compatible with flexible pipeline layout, will be fixed at !3594.
      • MTP convergence issue with TP2, will be fixed at !3594.

NVIDIA Megatron Core 0.12.0

  • Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
  • Context parallel: fix loss scaling when calculate_per_token_loss=True
  • Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
  • Inference
    • Support in-flight batching and chunked KV cache
    • Reduce memory usage,
      • by not materializing full attention mask
      • by only materializing logits for the last token during decode
      • by removing an obsolete tensor reference
  • Hybrid Model
    • Inference
      • Add CUDA graph support
      • Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
      • Fix a shape issue when materializing logits for Mamba model
    • Improve initialization of Mamba layers
    • Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
    • Make num_floating_point_operations work with hybrid model
    • Make hybrid_conversion.py work with mixer that uses TE linear
    • Add FP8 support
    • Fix Mamba dt_bias tensor parallelism
    • Support multimodal tokenizer
    • Improve data parallelism scaling
  • MoE
    • Features:
      • DeepEP support, compatible with all the parallelisms and token drop / dropless
      • Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
      • CUDA Graph support for MoE
      • Multi-Token Prediction (MTP) Support
      • Fused indices_to_multihot kernel for DeepEP dispatcher
    • Bug fixes:
      • Fix Hang Issue with MoE+Dense Hybrid models
      • Update theoretical memory and tflops estimation for MoE and MLA
      • Fix MoE Aux loss scaling for per token loss
      • Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
    • Known issues:
      • The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.

NVIDIA Megatron Core 0.11.0

  • Add multi datacenter training support though N/S connection
  • MoE
    • Features
      • Support DeepSeek-V3 fine-tuning
        • Aux-loss-free load balancing strategy
        • Node-limited routing and Device-limited routing support.
        • Tensor Parallelism support for MLA and Sequence Auxiliary Loss
        • MTP (with TP and PP support) is coming soon.
      • Permutation / Unpermutation fusion kernel from TransformerEngine.
      • Uneven virtual pipeline parallel split support in first and last PP stage.
    • Bug fixes:
      • Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
      • Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
    • Known Issues:
      • When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.
  • Add MX-FP16 support for optimizer and master weights
  • CUDA Graph memory optimizations
  • Enable UCC backend for PP communication
  • Optimizer CPU offload support for memory savings
  • Models
    • Initial RADIO/CRADIO implementation
    • llama3.2 support
  • Hybrid Model
    • Support quantization via TensorRT Model Optimizer

NVIDIA Megatron Core 0.10.0

  • Adding MLA to MCore
  • Enable FP8 for GroupedMLP
  • MoE Parallel Folding
  • Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size
  • Multimodal: NVLM training and evaluation support in MCore
  • Mamba Hybrid
    • Increase performance and reduce memory footprint of Triton language/compiler distributed caching
    • Add more unit testing and fix bugs

NVIDIA Megatron Core 0.9.0

  • Uneven pipeline parallelism
    • Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
  • Per layer CUDAGraph support for GPT training with Transformer Engine modules
  • Enable different TP sizes for the vision encoder
  • Enable pipeline parallelism for T5 & Llava models
  • Support multi-tile multi-image input in Llava models
  • MoE
    • FP8 support
    • Runtime upcycling support
    • Dispatcher implementation optimizations
    • Shared expert support with overlapping optimizations
      • Qwen Model support
  • Known Issues
    • When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.
  • NVRx / Fault tolerance
    • fault and hang detection in addition to existing straggler detection
    • graceful exit and auto restart

NVIDIA Megatron Core 0.8.0

  • Multimodal
    • Added initial support for training vision language models using the LLaVA architecture
    • Added initial support for inference with multimodal inputs
    • End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
  • MoE
    • Context Parallel support.
    • Distributed checkpoint support for grouped GEMM.
  • Mamba

NVIDIA Megatron Core 0.7.0

  • MoE
    • Token drop support
    • Several efficiency optimizations
    • Improved model parallelism
    • Memory optimizations
  • Distributed checkpointing
    • Enabled for Retro
    • Asynchronous checkpoint saving
  • Several minor bug fixes, speed improvements, and memory optimizations

NVIDIA Megatron Core 0.6.0

  • MoE (Mixture of Experts)
    • Performance optimization
      • Communication optimization for multi GPU and Single GPU
      • 23% improvement (323 TFLOPS/GPU) over MCore 0.5.0 on Mixtral with Hopper BF16
      • GroupedMLP enhancement for Hopper
      • DP Overlapping. Support overlapping computation with gradient reduction and parameter gathering.
    • All-to-All based Token Dispatcher
    • Layer-wise logging for load balancing loss.
    • Improved expert parallel support including distributed optimizer.
  • Distributed optimizer
  • RETRO
    • Data processing
  • BERT
    • Distributed checkpointing
  • Dist checkpointing
    • PyTorch native distributed backend
    • Improved saving/loading speed
  • TensorRT-LLM Export
    • Integration with TensorRT Model Optimizer Post-training quantization (PTQ)
    • Text generation driver to perform PTQ in Megatron-LM
    • Llama2 and Nemotron3-8b examples to use TensorRT-LLM unified build API to build engine after training.
  • Several minor enhancements, bug fixes, and documentation updates

NVIDIA Megatron Core 0.5.0

Key Features and Enhancements

Megatron core documentation is now live!

Model Features

  • MoE (Mixture of Experts)
    • Support for Z-loss, Load balancing and Sinkhorn
    • Layer and communications refactor
    • Richer parallelism mappings and EP can be combined with other model parallel techniques for larger MoE variants, e.g. EP + TP + DP + SP + PP
    • Token dropless architecture with Top-K routing
    • Performance optimization with with GroupedGEMM when number of local experts is > 1
    • Distributed checkpointing
  • Interleaved rotary embedding

Datasets

  • Masked WordPiece datasets for BERT and T5
  • Raw and mock datasets

Parallelism

Performance

  • Activation offloading to CPU
  • Rope and Swiglu fusion
  • Sliding window attention (via Transformer Engine)

General Improvements

  • Timers

NVIDIA Megatron Core 0.4.0

Key Features and Enhancements

Models

  • BERT
  • RETRO
  • T5

Parallelism

  • Mixture of Experts support for GPT
  • Model parallel efficient Distributed Data Parallel (DDP)
  • Context Parallel (2D Tensor Parallel) support

Datasets

  • GPT Dataset
  • Blended Dataset