olive-recipes/openai-whisper-large-v3-turbo/aitk at main · microsoft/olive-recipes

Name	Name	Last commit message	Last commit date
parent directory ..
.gitignore	.gitignore
README.md	README.md
_copy.json.config	_copy.json.config
audio_processor_config_default.json	audio_processor_config_default.json
convert_whisper_to_ovir.py	convert_whisper_to_ovir.py
info.yml	info.yml
model_project.config	model_project.config
ov_evaluate.py	ov_evaluate.py
ov_npu_workflow.json	ov_npu_workflow.json
ov_npu_workflow.json.config	ov_npu_workflow.json.config
ov_npu_workflow_inference_sample.ipynb	ov_npu_workflow_inference_sample.ipynb
ov_workflow.json	ov_workflow.json
ov_workflow.json.config	ov_workflow.json.config
ov_workflow.py	ov_workflow.py
ov_workflow_inference_sample.ipynb	ov_workflow_inference_sample.ipynb
qnn_app.py	qnn_app.py
qnn_evaluate.py	qnn_evaluate.py
qnn_run.py	qnn_run.py
qnn_workflow.json	qnn_workflow.json
qnn_workflow.json.config	qnn_workflow.json.config
qnn_workflow.py	qnn_workflow.py
qnn_workflow_inference_sample.ipynb	qnn_workflow_inference_sample.ipynb
requirements.txt	requirements.txt
whisper_decoder_load.py	whisper_decoder_load.py
whisper_encoder_load.py	whisper_encoder_load.py
whisper_large_v3_turbo_decoder_fp32.json	whisper_large_v3_turbo_decoder_fp32.json
whisper_large_v3_turbo_decoder_qdq.json	whisper_large_v3_turbo_decoder_qdq.json
whisper_large_v3_turbo_default_ov_npu.json	whisper_large_v3_turbo_default_ov_npu.json
whisper_large_v3_turbo_encapsulate.json	whisper_large_v3_turbo_encapsulate.json
whisper_large_v3_turbo_encoder_fp32.json	whisper_large_v3_turbo_encoder_fp32.json
whisper_large_v3_turbo_encoder_qdq.json	whisper_large_v3_turbo_encoder_qdq.json
winml.py	winml.py

Name

Last commit message

Last commit date

.gitignore

README.md

_copy.json.config

audio_processor_config_default.json

convert_whisper_to_ovir.py

ov_npu_workflow.json.config

ov_npu_workflow_inference_sample.ipynb

ov_workflow.json

ov_workflow.json.config

ov_workflow.py

ov_workflow_inference_sample.ipynb

qnn_workflow.json.config

qnn_workflow.py

qnn_workflow_inference_sample.ipynb

requirements.txt

whisper_decoder_load.py

whisper_encoder_load.py

whisper_large_v3_turbo_decoder_fp32.json

whisper_large_v3_turbo_decoder_qdq.json

whisper_large_v3_turbo_default_ov_npu.json

whisper_large_v3_turbo_encapsulate.json

whisper_large_v3_turbo_encoder_fp32.json

whisper_large_v3_turbo_encoder_qdq.json

winml.py

Whisper-large-v3-turbo Optimization with ONNX Runtime QNN EP

This folder outlines the process for optimizing the Whisper-large-v3-turbo model using ONNX Runtime with the QNN Execution Provider. It includes steps for exporting FP32 models, generating representative data for static quantization, creating QDQ models, model evaluation and performing audio transcription using the optimized models.

Generate data for static quantization

To get better results, we need to generate real data from original FP32 model instead of using random data for static quantization. Here we use 100 samples of librispeech dataset to generate the required real data which requires around 164 GB of disk space.

Additional requirements and considerations:

Memory requirements during conversion The conversion pipeline itself (including model conversion and quantization) is memory-intensive. At least 30 GB of available system memory is required to complete the conversion process successfully. For stability and to avoid out-of-memory failures, it is strongly recommended to run this process on a machine with 64 GB RAM.
Model compilation for non-CPU Execution Providers When using a non-CPU Execution Provider (e.g., QNN, or other accelerators), the model must be compiled before execution. This compilation step happens automatically at first run but can take a noticeable amount of time depending on the backend and model size. Please account for this additional latency when running the calibration or quantization pipeline.

First generate FP32 onnx models:

Encoder FP32 model

olive run --config whisper_large_v3_turbo_encoder_fp32.json
Decoder FP32 model

olive run --config whisper_large_v3_turbo_decoder_fp32.json

Then download and generate data:

python .\qnn_run.py --audio-path .\data\librispeech_asr_clean_test --encoder "models\whisper_encoder_fp32\model\model.onnx" --decoder "models\whisper_decoder_fp32\model.onnx" --model_id "openai/whisper-large-v3-turbo" --save_data .\data\quantization_data --num_data 100

Generate QDQ models

olive run --config whisper_large_v3_turbo_encoder_qdq.json
olive run --config whisper_large_v3_turbo_decoder_qdq.json

(Optional) Use whisper_large_v3_turbo_encoder_qdq_ctx.json and whisper_large_v3_turbo_decoder_qdq_ctx.json to create onnx models with QNN context binaries embedded in them.

To transcribe a single sample:

python .\qnn_run.py --audio-path .\data\librispeech_asr_clean_test\1320-122617-0000.npy --encoder "models\whisper_encoder_qdq\model.onnx" --decoder "models\whisper_decoder_qdq\model.onnx" --model_id "openai/whisper-large-v3-turbo" --execution_provider QNNExecutionProvider --device_str npu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Whisper-large-v3-turbo Optimization with ONNX Runtime QNN EP

Generate data for static quantization

Generate QDQ models

To transcribe a single sample:

Uh oh!

FilesExpand file tree

aitk

Directory actions

More options

Directory actions

More options

Latest commit

History

aitk

Folders and files

parent directory

README.md

Whisper-large-v3-turbo Optimization with ONNX Runtime QNN EP

Generate data for static quantization

Generate QDQ models

To transcribe a single sample: