Skip to content

Latest commit

 

History

History

README.md

DeepSeek-R1-Distill-Qwen-14B optimization

This folder contains examples of Olive recipes for DeepSeek-R1-Distill-Qwen-14B optimization.

INT4 AWQ Quantized Model Generation

The olive recipe DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json produces INT4 AWQ quantized model using NVIDIA's TensorRT Model Optimizer toolkit.

Setup

  1. Install Olive with NVIDIA TensorRT Model Optimizer toolkit

    • Run following command to install Olive with TensorRT Model Optimizer.
    pip install olive-ai[nvmo]
    • If TensorRT Model Optimizer needs to be installed from a local wheel, then follow below steps.

      pip install olive-ai
      pip install <modelopt-wheel>[onnx]
    • Make sure that TensorRT Model Optimizer is installed correctly.

      python -c "from modelopt.onnx.quantization.int4 import quantize as quantize_int4"
    • Refer TensorRT Model Optimizer documentation for its detailed installation instructions and setup dependencies.

  2. Install suitable onnxruntime and onnxruntime-genai packages

    • Install the onnxruntime and onnxruntime-genai packages that have NvTensorRTRTXExecutionProvider support. Refer documentation for NvTensorRtRtx execution-provider to setup its dependencies/requirements.
    • Note that by default, TensorRT Model Optimizer comes with onnxruntime-directml. And onnxrutime-genai-cuda package comes with onnxruntime-gpu. So, in order to use onnxruntime package with NvTensorRTRTXExecutionProvider support, one might need to uninstall existing other onnxruntime packages.
    • Make sure that at the end, there is only one onnxruntime package installed. Use command like following for validating the onnxruntime package installation.
    python -c "import onnxruntime as ort; print(ort.get_available_providers())"
  3. Install additional requirements.

    • Install packages provided in requirements text file.
    pip install -r requirements-nvmo-awq.txt

Steps to run

olive run --config DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json

Recipe details

The olive recipe DeepSeek-R1-Distill-Qwen-14B_nvmo_int4_awq.json has 2 passes: (a) ModelBuilder and (b) NVModelOptQuantization. The ModelBuilder pass is used to generate the FP16 model for NvTensorRTRTXExecutionProvider (aka NvTensorRtRtx EP). Subsequently, the NVModelOptQuantization pass performs INT4 AWQ quantization to produce the 4-bit optimized model. In the quantization pass, execution-providers from the available/installed onnxruntime execution-providers is used for calibration. The field calibration_providers can be used to select any specific execution provider for calibration (assuming it is available/installed).

  • Note that while using NvTensorRTRTXExecutionProvider for INT4 AWQ quantization, profile (min/max/opt ranges) of shapes of the model-inputs is created internally using the details from the model's config (e.g. config.json in HuggingFace model card). This input-shapes-profile is used during onnxruntime session creation. Make sure that config.json is available in the model-directory if tokenizer_dir is a model path (instead of model-name).

Troubleshoot

In case of any issue related to quantization using TensorRT Model Optimizer toolkit, refer its FAQs for potential help or suggestions.