Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Whisper-large-v3-turbo Optimization with ONNX Runtime QNN EP

This folder outlines the process for optimizing the Whisper-large-v3-turbo model using ONNX Runtime with the QNN Execution Provider. It includes steps for exporting FP32 models, generating representative data for static quantization, creating QDQ models, model evaluation and performing audio transcription using the optimized models.

Generate data for static quantization

To get better results, we need to generate real data from original FP32 model instead of using random data for static quantization. Here we use 100 samples of librispeech dataset to generate the required real data which requires around 164 GB of disk space.

Additional requirements and considerations:

  1. Memory requirements during conversion The conversion pipeline itself (including model conversion and quantization) is memory-intensive. At least 30 GB of available system memory is required to complete the conversion process successfully. For stability and to avoid out-of-memory failures, it is strongly recommended to run this process on a machine with 64 GB RAM.

  2. Model compilation for non-CPU Execution Providers When using a non-CPU Execution Provider (e.g., QNN, or other accelerators), the model must be compiled before execution. This compilation step happens automatically at first run but can take a noticeable amount of time depending on the backend and model size. Please account for this additional latency when running the calibration or quantization pipeline.

First generate FP32 onnx models:

  1. Encoder FP32 model

    olive run --config whisper_large_v3_turbo_encoder_fp32.json

  2. Decoder FP32 model

    olive run --config whisper_large_v3_turbo_decoder_fp32.json

Then download and generate data:

  1. python .\qnn_run.py --audio-path .\data\librispeech_asr_clean_test --encoder "models\whisper_encoder_fp32\model\model.onnx" --decoder "models\whisper_decoder_fp32\model.onnx" --model_id "openai/whisper-large-v3-turbo" --save_data .\data\quantization_data --num_data 100

Generate QDQ models

  1. olive run --config whisper_large_v3_turbo_encoder_qdq.json
  2. olive run --config whisper_large_v3_turbo_decoder_qdq.json

(Optional) Use whisper_large_v3_turbo_encoder_qdq_ctx.json and whisper_large_v3_turbo_decoder_qdq_ctx.json to create onnx models with QNN context binaries embedded in them.

To transcribe a single sample:

python .\qnn_run.py --audio-path .\data\librispeech_asr_clean_test\1320-122617-0000.npy --encoder "models\whisper_encoder_qdq\model.onnx" --decoder "models\whisper_decoder_qdq\model.onnx" --model_id "openai/whisper-large-v3-turbo" --execution_provider QNNExecutionProvider --device_str npu