You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.
Backward Breaking Changes
Default --kv_cache_qformat in hf_ptq.py changed from fp8 to fp8_cast. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass --kv_cache_qformat fp8.
Removed KV cache scale clamping (clamp_(min=1.0)) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (--kv_cache_qformat fp8 or nvfp4), consider using the casting methods (fp8_cast or nvfp4_cast) instead.
New Features
Add fp8_cast and nvfp4_cast modes for --kv_cache_qformat in hf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new use_constant_amax field in QuantizerAttributeConfig controls this behavior.
User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
hf_ptq.py now saves the quantization summary and moe expert token count table to the export directory.
Add --moe_calib_experts_ratio flag in hf_ptq.py to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
Add sparse attention optimization for transformer models (modelopt.torch.sparsity.attention_sparsity). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See examples/llm_sparsity/attention_sparsity/README.md for usage.
Add support for rotating the input before quantization for RHT.
Add support for advanced weight scale search for NVFP4 quantization and its export path.
Enable PTQ workflow for Qwen3.5 MoE models.
Enable PTQ workflow for the Kimi-K2.5 model.
Add nvfp4_omlp_only quantization format for NVFP4 quantization. This is similar to nvfp4_mlp_only but also quantizes the output projection layer in attention.
Add nvfp4_experts_only quantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization.
pass_through_bwd in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
Add compute_quantization_mse API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
Autotune: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: python -m modelopt.onnx.quantization.autotune. See the Autotune guide in the documentation.
Add get_auto_quantize_config API to extract a flat quantization config from auto_quantize search results, enabling re-quantization at different effective bit targets without re-running calibration.
Improve auto_quantize checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in auto_quantize grouping and scoring rules.
Add support for block-granular RHT for non-power-of-2 dimensions.
Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
Deprecations
Removed MT-Bench (FastChat) support from examples/llm_eval. The run_fastchat.sh and gen_model_answer.py scripts have been deleted, and the mtbench task has been removed from the llm_ptq example scripts.
Remove deprecated NeMo-2.0 Framework references.
Misc
Migrated project metadata from setup.py to a fully declarative pyproject.toml.
Enable experimental Python 3.13 wheel support and unit tests in CI/CD.