|
Unlock fast, on-device speech recognition with RyzenAI and OpenAI’s Whisper. This demo walks you through preparing and running OpenAI's Whisper (base, small, medium) for fast, local ASR on AMD NPU.
- 🚀 Download NPU Optimized Whisper ONNX models from HF
- ⚡ Run ASR locally on CPU or NPU
- 📊 Evaluate ASR on LibriSpeech samples and report WER/CER
- 🎧 Supports transcription of audio files and microphone input
- ⏱️ Reports Performance using RTF and TTFT
-
Install Ryzen AI SDK Follow RyzenAI documentation to install SDK and drivers.
-
Activate environment
conda activate ryzen-ai-<version>
-
Clone repository
git clone https://github.com/amd/RyzenAI-SW.git cd RyzenAI-SW/demo/ASR/Whisper -
Install dependencies
pip install -r requirements.txt
- Offloads compute from CPU onto NPU, freeing up CPU for other tasks.
- Delivers higher throughput and lower power consumption when running AI workloads
- Optimized execution of Whisper’s encoder and decoder models.
- Runs models with BFP16 precision for near-FP32 accuracy and INT8-like performance.
When running inference on the NPU, 100% of the encoder operators and 93.4% of the decoder operators are executed on the NPU.
#encoder operations
[Vitis AI EP] No. of Operators : VAIML 225
[Vitis AI EP] No. of Subgraphs : VAIML 1
#decoder operations
[Vitis AI EP] No. of Operators : CPU 24 VAIML 341
[Vitis AI EP] No. of Subgraphs : VAIML 2-
Edit
config/model_config.jsonto specify Execution Providers. -
For NPU:
- Set
cache_keyandcache_dir - Use corresponding
vitisai_configfromconfig/
- Set
Example:
{
"config_file": "config/vitisai_config_whisper_decoder.json",
"cache_dir": "./cache",
"cache_key": "whisper_medium_decoder"
}When running whisper-medium on NPU, it is recommended to add the following flags to configs\vitisai_config_whisper_encoder.json incase of compilation issues.
"vaiml_config": {
"optimize_level": 3,
"aiecompiler_args": "--system-stack-size=512"
}These settings:
- optimize_level=3: Enables aggressive optimizations for larger models.
- --system-stack-size=512: Increases the AI Engine system stack size to handle Whisper-Medium’s higher resource demand.
Use this to transcribe a pre-recorded .wav file into text using the Whisper mode
python run_whisper.py \
--model-type <whisper-type> \
--device npu \
--input path/to/audio.wav-
Replace with whisper-base, whisper-small, or whisper-medium.
-
Replace path/to/audio.wav with your audio file.
For example, run whisper-large-v3-turbo
python run_whisper.py --model-type whisper-large-v3-turbo --device npu --input audio_files\1089-134686-0000.wavRun real-time speech-to-text by capturing audio from your microphone. This allows you to speak and see live transcription:
python run_whisper.py \
--model-type <whisper-type> \
--device npu \
--input mic \
--duration 0-
--duration 0 means continuous recording until stopped (Ctrl+C) or detects silence for a set duration
-
Ideal for demos and testing live ASR performance.
Run batch evaluation on a dataset (e.g., LibriSpeech samples) to measure model performance with metrics like WER, CER, and RTF:
python run_whisper.py \
--model-type <whisper-type> \
--device npu \
--eval-dir eval_dataset/LibriSpeech-samples \
--results-dir results-
--eval-dir specifies the dataset directory.
-
--results-dir is where evaluation reports (WER, CER, TTFT, RTF) will be saved.
-
Useful for benchmarking and validating models.
- First run on NPU may take ~15 min for model compilation.
- Ensure paths for encoder, decoder, and config files are correct.
- Supports CPU and NPU devices.
