Passes
- SelectiveMixedPrecision: Sensitivity score based algorithms
- Add Stable Diffusion lora pass
- Add ONNX conversion pass support for diffusers model
- MatMulNBitsToQDQ: Support 2bit
Quantization
- Quantization: Keep embeddings tied in SelectiveMixedPrecision, Clean overrides
- Add support for tensor-wise mixed precision in Quark onnx quantization
- Support multiple modes of ignored scopes simultaneously in Intel® OpenVINO weight compression and quantization passes
- Automatically tie kv cache i/o quantizers in AimetQuantization
- Quantization: Fix annotation
- Quantization: Utils to pack and unpack uint8 storage
- Quantization: Generalize olive quantized model loading in model builder
- VitisAI AMD NPU LLM Quantization - Add Windows + CUDA support for Quark Quantizer
- Quantization: tie_quant_modules util and new tests
- Few fixes for benchmark cli and quark quantization
- Quantization: Enable 2-bit in QuantModules
CLI
Evaluation
Bug Fixes and other updates
- HfModelHandler: Check for tokenizer_config.json instead of try/else
- Fix cache output model name bug
- Fix a bug with an incorrect parameter type
- Update MatMulAddToGemm Graph Surgery when ReLU is present after Add
- UT: Remove azureml-evaluate-mlflow, Update optimum, autoawq dependencies
- Add dict check for HF model patch in Conversion pass
- UT: Use same input pytorch model in openvino test
- Remove gidx input from MatMulNBits graph surgery
- Add diffusers model handler
- Add sd lora data container and preprocessing funcs
- Add MB bf16 support for caption-onnx-graph cli
- Replace TRANSFORMERS_CACHE with HF_HUB_CACHE
- Disable lmhead during prefill phase in genai config
- Update device selection for bf16 for caption-onnx cli
- Add flag to apply DeduplicateHashedInitializersPass post graph surgery
- Add with_prior_preservation option for dreambooth
- Add RenameOutputDims, PackedAttentionToPackedMHA and PackedAttentionToLoopMHA surgeon. These will be used for Qwen VL model.
- Add popular model IO configs