Speech-to-text transcription using NVIDIA's Parakeet TDT 0.6B model, optimized for the AMD Ryzen AI NPU. Achieves 35-43x real-time transcription by running the Conformer encoder on the NPU, LSTM decoder on the integrated Radeon GPU, and mel features on the CPU -- all three processors working in parallel.
Includes an OpenAI Whisper-compatible REST API, a CLI benchmark tool, and a real-time microphone transcription demo.
| Configuration | Speed | Hardware |
|---|---|---|
| CPU INT8 | 17-18x real-time | Zen 5 CPU only |
| NPU BF16 (default power) | 35x real-time | NPU + iGPU + CPU |
| NPU BF16 (performance mode) | 43x real-time | NPU + iGPU + CPU |
Tested on 16.5 minutes of audio (RTF=0.023-0.030). See OPTIMIZATION.md for the full optimization journey.
To set NPU performance mode: C:\Windows\System32\AMD\xrt-smi.exe configure --pmode performance
python download_models.py --precision fp32Downloads FP32 models (~2.4GB) from HuggingFace. For INT8 (CPU-only, smaller):
python download_models.pyconda activate ryzen-ai-1.7.1
# Static shapes + NPU compiler fixes (Pad->Conv fuse, attention mask patch)
python preprocess_for_npu.py --precision fp32Benchmark (NPU + iGPU):
conda activate ryzen-ai-1.7.1
python test_transcribe.py audio.wav --device npu --decoder-device gpu --runs 3Live microphone transcription:
pip install sounddevice
python live_transcribe.py --device npuAPI server:
pip install -r requirements.txt
python server.py --device npuCPU-only (no Ryzen AI needed):
pip install onnxruntime
python test_transcribe.py audio.wav --device cpuNote: The first NPU run triggers VAIML compilation which is cached at
C:\temp\<user>\vaip\.cache\. Subsequent runs load from cache in ~4-6 seconds. The cache is keyed by model signature, so it is shared across directories using the same model.
┌─────────────────────────────────────────────────────────┐
│ Audio (WAV) │
│ ↓ │
│ Mel Filterbank (CPU, vectorized numpy) ~25ms/chunk │
│ ↓ │
│ Conformer Encoder (NPU, BF16) ~300ms/chunk │
│ ↓ │
│ TDT LSTM Decoder (iGPU, DirectML) ~1.0ms/step ×188 │
│ ↓ │
│ Text output │
└─────────────────────────────────────────────────────────┘
For multi-chunk audio, encoder and decoder run in parallel:
NPU encodes chunk N+1 while iGPU decodes chunk N
server.py FastAPI server (Whisper-compatible API)
test_transcribe.py Benchmark with per-stage timing breakdown
live_transcribe.py Real-time microphone transcription
benchmark_npu.py Multi-config VAIML parameter sweep
inference/
__init__.py
transcriber.py ONNX Runtime pipeline (NPU encoder + iGPU decoder)
mel.py Vectorized 128-bin mel filterbank
audio.py WAV parsing
preprocess_for_npu.py Static shapes + NPU Pad/mask fixes (FP32 encoder -> .static.npu.onnx)
fuse_pads_direct.py Optional: fuse Pad->Conv on a legacy .static.onnx only
optimize_model.py Experimental ORT fold + fusion (needs unfused static.onnx)
fuse_attn_pads.py Analyze attention Pad ops
models/
vai_ep_config.json VitisAI EP config (optimize_level=3)
static_config.json Static shape config (15s chunks)
config.json Model parameters
vocab.txt SentencePiece vocabulary (8193 tokens)
encoder-model.fp32.static.npu.onnx Static encoder (Pad-fused, for NPU)
decoder_joint-model.fp32.static.onnx Static decoder
POST /v1/audio/transcriptions
Content-Type: multipart/form-data
| Parameter | Type | Required | Description |
|---|---|---|---|
| file | file | Yes | Audio file (WAV, max 25MB) |
| model | string | No | Model name (accepted but ignored) |
| language | string | No | ISO-639-1 code (default: en) |
| response_format | string | No | json, text, srt, vtt, verbose_json |
GET /v1/models-- List modelsGET /v1/info-- Execution provider infoGET /health-- Health check
python test_transcribe.py audio.wav [options]
--device {cpu,npu,gpu} Encoder device (default: cpu)
--decoder-device {auto,cpu,gpu} Decoder device (default: auto)
--models-dir DIR Models directory (default: ./models)
--runs N Benchmark runs (default: 1)
--debug Verbose logging
python live_transcribe.py [options]
--device {cpu,npu,gpu} Execution device
--test-mic Test microphone levels
--list-devices Show audio devices
python server.py [options]
--device {cpu,npu} Execution device
--port PORT Server port (default: 5092)
--host HOST Server host (default: 0.0.0.0)
CPU mode:
- Python 3.10+
- onnxruntime
- numpy, fastapi, uvicorn
NPU mode (Ryzen AI):
- AMD Ryzen AI processor (Strix/XDNA2)
- Windows 11
- Miniforge with
ryzen-ai-1.7.0orryzen-ai-1.7.1conda environment - onnxruntime-vitisai, flexml-lite (included in Ryzen AI SDK)
- sounddevice (for live microphone mode)
VitisAI EP not available:
conda activate ryzen-ai-1.7.1
python -c "import onnxruntime; print(onnxruntime.get_available_providers())"
# Should show: ['VitisAIExecutionProvider', 'DmlExecutionProvider', 'CPUExecutionProvider']NPU startup takes ~4-6 seconds: This is normal -- VAIML loads the compiled encoder from its cache at C:\temp\<user>\vaip\.cache\. If the cache is missing (first run or new model), compilation will take longer.
All ops falling back to CPU: If you see unknown type 9 errors or CPU 1434 in the log, the VAIML compiler failed to partition the model. Re-run python preprocess_for_npu.py --precision fp32 so the encoder includes Pad->Conv fusion and the VAIML 1.7.x attention-mask rewrite. This is currently tested on Strix NPUs; Strix Halo may have compatibility issues with the VAIML frontend.
vaiml.dll not found: Ensure flexml-lite is installed and conda env is activated. The transcriber auto-discovers it via sys.prefix.
Audio format not supported: Convert with ffmpeg:
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav- NVIDIA -- Parakeet TDT 0.6B model
- Ivan Stupakov (@istupakov) -- ONNX conversion
- achetronic -- Original Go implementation
- AMD -- Ryzen AI NPU, VitisAI EP, DirectML EP