Release ModelOpt 0.40.0 Release · NVIDIA/Model-Optimizer

Bug Fixes

Fix a bug in FastNAS pruning (computer vision models) where the model parameters were sorted twice, messing up the ordering.
Fix Q/DQ/Cast node placements in 'FP32 required' tensors in custom ops in the ONNX quantization workflow.

Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for num_moe_experts, moe_ffn_hidden_size, and moe_shared_expert_intermediate_size parameters in Minitron pruning (mcore_minitron).
Add specdec_bench example to benchmark speculative decoding performance. See examples/specdec_bench/README.md for more details.
Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
Add KL Divergence loss-based auto_quantize method. See auto_quantize API docs for more details.
Add support for saving and resuming auto_quantize search state. This speeds up the auto_quantize process by skipping the score estimation step if the search state is provided.
Add flag trt_plugins_precision in ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow.
Add support for PyTorch Geometric quantization.
Add per tensor and per channel MSE calibrator support.
Added support for PTQ/QAT checkpoint export and loading for running fakequant evaluation in vLLM. See examples/vllm_serve/README.md for more details.

Deprecate examples/megatron-lm in favor of more detailed documentation in Megatron-LM/examples/post_training/modelopt.

NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer. GitHub will automatically redirect the old repository path (NVIDIA/TensorRT-Model-Optimizer) to the new one (NVIDIA/Model-Optimizer). Documentation URL is also changed to nvidia.github.io/Model-Optimizer.
Bump TensorRT-LLM test docker to 1.2.0rc4.
Bump minimum recommended transformers version to 4.53.
Replace ONNX simplification package from onnxsim to onnxslim.