Name	Name	Last commit message	Last commit date
parent directory ..
DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json	DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json
README.md	README.md
info.yml	info.yml
requirements-nvmo-awq.txt	requirements-nvmo-awq.txt

DeepSeek-R1-Distill-Qwen-14B optimization

This folder contains examples of Olive recipes for DeepSeek-R1-Distill-Qwen-14B optimization.

INT4 AWQ Quantized Model Generation

The olive recipe DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json produces INT4 AWQ quantized model using NVIDIA's TensorRT Model Optimizer toolkit.

Setup

Install Olive with NVIDIA TensorRT Model Optimizer toolkit
- Run following command to install Olive with TensorRT Model Optimizer.
```
pip install olive-ai[nvmo]
```
- If TensorRT Model Optimizer needs to be installed from a local wheel, then follow below steps.
```
pip install olive-ai
pip install <modelopt-wheel>[onnx]
```
- Make sure that TensorRT Model Optimizer is installed correctly.
```
python -c "from modelopt.onnx.quantization.int4 import quantize as quantize_int4"
```
- Refer TensorRT Model Optimizer documentation for its detailed installation instructions and setup dependencies.
Install suitable onnxruntime and onnxruntime-genai packages
- Install the onnxruntime and onnxruntime-genai packages that have NvTensorRTRTXExecutionProvider support. Refer documentation for NvTensorRtRtx execution-provider to setup its dependencies/requirements.
- Note that by default, TensorRT Model Optimizer comes with onnxruntime-directml. And onnxrutime-genai-cuda package comes with onnxruntime-gpu. So, in order to use onnxruntime package with NvTensorRTRTXExecutionProvider support, one might need to uninstall existing other onnxruntime packages.
- Make sure that at the end, there is only one onnxruntime package installed. Use command like following for validating the onnxruntime package installation.
```
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
```
Install additional requirements.
- Install packages provided in requirements text file.
```
pip install -r requirements-nvmo-awq.txt
```

Steps to run

olive run --config DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json

Recipe details

The olive recipe DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json has 2 passes: (a) ModelBuilder and (b) NVModelOptQuantization. The ModelBuilder pass is used to generate the FP16 model for NvTensorRTRTXExecutionProvider (aka NvTensorRtRtx EP). Subsequently, the NVModelOptQuantization pass performs INT4 AWQ quantization to produce the 4-bit optimized model. In the quantization pass, execution-providers from the available/installed onnxruntime execution-providers is used for calibration. The field calibration_providers can be used to select any specific execution provider for calibration (assuming it is available/installed).

Note that while using NvTensorRTRTXExecutionProvider for INT4 AWQ quantization, profile (min/max/opt ranges) of shapes of the model-inputs is created internally using the details from the model's config (e.g. config.json in HuggingFace model card). This input-shapes-profile is used during onnxruntime session creation. Make sure that config.json is available in the model-directory if tokenizer_dir is a model path (instead of model-name).

Troubleshoot

In case of any issue related to quantization using TensorRT Model Optimizer toolkit, refer its FAQs for potential help or suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

DeepSeek-R1-Distill-Qwen-14B optimization

INT4 AWQ Quantized Model Generation

Setup

Steps to run

Recipe details

Troubleshoot

Uh oh!

FilesExpand file tree

NvTensorRtRtx

Directory actions

More options

Directory actions

More options

Latest commit

History

NvTensorRtRtx

Folders and files

parent directory

README.md

DeepSeek-R1-Distill-Qwen-14B optimization

INT4 AWQ Quantized Model Generation

Setup

Steps to run

Recipe details

Troubleshoot