This folder contains examples of Olive recipes for DeepSeek-R1-Distill-Qwen-14B optimization.
The olive recipe DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json produces INT4 AWQ quantized model using NVIDIA's TensorRT Model Optimizer toolkit.
-
Install Olive with NVIDIA TensorRT Model Optimizer toolkit
- Run following command to install Olive with TensorRT Model Optimizer.
pip install olive-ai[nvmo]
-
If TensorRT Model Optimizer needs to be installed from a local wheel, then follow below steps.
pip install olive-ai pip install <modelopt-wheel>[onnx]
-
Make sure that TensorRT Model Optimizer is installed correctly.
python -c "from modelopt.onnx.quantization.int4 import quantize as quantize_int4" -
Refer TensorRT Model Optimizer documentation for its detailed installation instructions and setup dependencies.
-
Install suitable onnxruntime and onnxruntime-genai packages
- Install the onnxruntime and onnxruntime-genai packages that have NvTensorRTRTXExecutionProvider support. Refer documentation for NvTensorRtRtx execution-provider to setup its dependencies/requirements.
- Note that by default, TensorRT Model Optimizer comes with onnxruntime-directml. And onnxrutime-genai-cuda package comes with onnxruntime-gpu. So, in order to use onnxruntime package with NvTensorRTRTXExecutionProvider support, one might need to uninstall existing other onnxruntime packages.
- Make sure that at the end, there is only one onnxruntime package installed. Use command like following for validating the onnxruntime package installation.
python -c "import onnxruntime as ort; print(ort.get_available_providers())" -
Install additional requirements.
- Install packages provided in requirements text file.
pip install -r requirements-nvmo-awq.txt
olive run --config DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.jsonThe olive recipe DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json has 2 passes: (a) ModelBuilder and (b) NVModelOptQuantization. The ModelBuilder pass is used to generate the FP16 model for NvTensorRTRTXExecutionProvider (aka NvTensorRtRtx EP). Subsequently, the NVModelOptQuantization pass performs INT4 AWQ quantization to produce the 4-bit optimized model. In the quantization pass, execution-providers from the available/installed onnxruntime execution-providers is used for calibration. The field calibration_providers can be used to select any specific execution provider for calibration (assuming it is available/installed).
- Note that while using NvTensorRTRTXExecutionProvider for INT4 AWQ quantization, profile (min/max/opt ranges) of shapes of the model-inputs is created internally using the details from the model's config (e.g. config.json in HuggingFace model card). This input-shapes-profile is used during onnxruntime session creation. Make sure that config.json is available in the model-directory if
tokenizer_diris a model path (instead of model-name).
In case of any issue related to quantization using TensorRT Model Optimizer toolkit, refer its FAQs for potential help or suggestions.