Skip to content

ModelOpt 0.40.0 Release

Latest

Choose a tag to compare

@kevalmorabia97 kevalmorabia97 released this 12 Dec 10:27
411912e

Bug Fixes

  • Fix a bug in FastNAS pruning (computer vision models) where the model parameters were sorted twice, messing up the ordering.
  • Fix Q/DQ/Cast node placements in 'FP32 required' tensors in custom ops in the ONNX quantization workflow.

New Features

  • Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for num_moe_experts, moe_ffn_hidden_size, and moe_shared_expert_intermediate_size parameters in Minitron pruning (mcore_minitron).
  • Add specdec_bench example to benchmark speculative decoding performance. See examples/specdec_bench/README.md for more details.
  • Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
  • Add KL Divergence loss-based auto_quantize method. See auto_quantize API docs for more details.
  • Add support for saving and resuming auto_quantize search state. This speeds up the auto_quantize process by skipping the score estimation step if the search state is provided.
  • Add flag trt_plugins_precision in ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow.
  • Add support for PyTorch Geometric quantization.
  • Add per tensor and per channel MSE calibrator support.
  • Added support for PTQ/QAT checkpoint export and loading for running fakequant evaluation in vLLM. See examples/vllm_serve/README.md for more details.

Documentation

Misc

  • NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer. GitHub will automatically redirect the old repository path (NVIDIA/TensorRT-Model-Optimizer) to the new one (NVIDIA/Model-Optimizer). Documentation URL is also changed to nvidia.github.io/Model-Optimizer.
  • Bump TensorRT-LLM test docker to 1.2.0rc4.
  • Bump minimum recommended transformers version to 4.53.
  • Replace ONNX simplification package from onnxsim to onnxslim.