Skip to content

Conversation

@iswaryaalex
Copy link
Contributor

@iswaryaalex iswaryaalex commented Jan 28, 2026

Core Update

This PR adds whispercpp NPU support to lemonade server, leveraging key updates from Ryzen AI 1.7

  • For whispercpp NPU backend, automatically downloads its Release binaries from the correct GitHub repository (NPU uses lemonade-sdk/whisper.cpp-npu, CPU uses ggml-org/whisper.cpp) and so on. Extendable to other backends
  • NPU backend automatically downloads required .rai cache files from HuggingFace AMD repo https://huggingface.co/collections/amd/ryzen-ai-17-whisper-npu-optimized-onnx-models and places them alongside model checkpoints for NPU runtime
  • Updated backend_versions.json to support per-backend versioning

To Test this PR

  • Start Lemonade server with whispercpp NPU backend
    .\build\Release\lemonade-server.exe serve

  • Download sample .wav :
    curl -o test.wav "https://raw.githubusercontent.com/lemonade-sdk/assets/main/audio/test_speech.wav"

  • Load NPU Whisper model
    curl -X POST http://localhost:8000/api/v1/load -H "Content-Type: application/json" -d "{\"model_name\": \"Whisper-Tiny\", \"whispercpp_backend\": \"npu\"}"

  • To test with Whisper-Tiny
    curl -X POST http://localhost:8000/api/v1/audio/transcriptions -F "[email protected]" -F "model=Whisper-Tiny"

You can also test with other Whisper models from here:

  • Whisper-Tiny
  • Whisper-Base
  • Whisper-Small
  • Whisper-Medium
  • Whisper-Large-v3

@iswaryaalex iswaryaalex marked this pull request as draft January 28, 2026 03:06
@iswaryaalex iswaryaalex marked this pull request as ready for review January 28, 2026 19:30
@iswaryaalex
Copy link
Contributor Author

iswaryaalex commented Jan 30, 2026

@ramkrishna2910 Great feedback on catching partial downloads and model search naming.
I have addressed all the comments and ensured all whisper models have associated .rai cache runs at least 2-3x faster than CPU

There is this one pending test that is never triggered (because I changed/removed the test to separate cpu and npu tests)
However it is still in the required checks, perhaps because main does not have my changes?
image

@jeremyfowers
Copy link
Contributor

@ramkrishna2910 Great feedback on catching partial downloads and model search naming. I have addressed all the comments and ensured all whisper models have associated .rai cache runs at least 2-3x faster than CPU

There is this one pending test that is never triggered (because I changed/removed the test to separate cpu and npu tests) However it is still in the required checks, perhaps because main does not have my changes? image

@iswaryaalex the test name is encoded into github settings, not the git codebase itself. When we are about to merge this PR I will change the github settings to point to your new tests instead.

@ramkrishna2910
Copy link
Contributor

I am getting this error during download of rai cache

map_rai_file: 32: Failed to open rai file 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
whisper_vitisai_init: Failed to mmap rai file 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
whisper_init_state: failed to load Vitis AI model from 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
error: failed to initialize whisper context
[ERROR] whisper-server process has terminated with exit code: 3
[ERROR] This usually means:
  - Missing required drivers or dependencies
  - Incompatible model file
  - Try running the server manually to see the actual error
[WhisperServer] Stopping server (PID: 126960)
[Router] Retry also failed: whisper-server failed to start or become ready
[Router ERROR] Failed to load model: whisper-server failed to start or become ready
[Server ERROR] Failed to load model: whisper-server failed to start or become ready

@iswaryaalex
Copy link
Contributor Author

iswaryaalex commented Feb 3, 2026

I am getting this error during download of rai cache

map_rai_file: 32: Failed to open rai file 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
whisper_vitisai_init: Failed to mmap rai file 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
whisper_init_state: failed to load Vitis AI model from 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
error: failed to initialize whisper context
[ERROR] whisper-server process has terminated with exit code: 3
[ERROR] This usually means:
  - Missing required drivers or dependencies
  - Incompatible model file
  - Try running the server manually to see the actual error
[WhisperServer] Stopping server (PID: 126960)
[Router] Retry also failed: whisper-server failed to start or become ready
[Router ERROR] Failed to load model: whisper-server failed to start or become ready
[Server ERROR] Failed to load model: whisper-server failed to start or become ready

Can you check your server logs for me ? This is how it should be -

[Router] Creating WhisperServer backend
[Router] Starting backend (this may take a moment)...
[WhisperServer] Loading model: Whisper-Tiny
[WhisperServer] Per-model settings: whispercpp_backend=npu
[WhisperServer] Using npu version from config: v1.8.2
[WhisperServer] Found whisper-server at: C:\Users\User\.cache\lemonade/bin\whisper\npu\whisper-server.exe
[WhisperServer] Using model: C:\Users\User\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\5359861c739e955e79d9a303bcbc70fb988958b1\ggml-tiny.bin
[WhisperServer] Using backend: npu
[WhisperServer] Using NPU cache from server_models.json: amd/whisper-tiny-onnx-npu / ggml-tiny-encoder-vitisai.rai
[WhisperServer] Downloading NPU compiled cache: ggml-tiny-encoder-vitisai.rai
[WhisperServer] From repository: amd/whisper-tiny-onnx-npu
[WhisperServer] Downloading from: https://huggingface.co/amd/whisper-tiny-onnx-npu/resolve/main/ggml-tiny-encoder-vitisai.rai
[Server PRE-ROUTE] GET /api/v1/models
[Server] GET /api/v1/models - 200
  Progress: 100% (0.0/0.0 MB)
[WhisperServer] NPU cache ready at: "C:\\Users\\User\\.cache\\huggingface\\hub/models--ggerganov--whisper.cpp\\snapshots\\5359861c739e955e79d9a303bcbc70fb988958b1\\ggml-tiny-encoder-vitisai.rai"
whisper-server will use port: 8001
[WhisperServer] Starting server on port 8001
[ProcessManager] Starting process with inherited output: "C:\Users\User\.cache\lemonade/bin\whisper\npu\whisper-server.exe" "-m" "C:\Users\User\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\5359861c739e955e79d9a303bcbc70fb988958b1\ggml-tiny.bin" "--port" "8001"
[ProcessManager] Process started successfully, PID: 51688
[WhisperServer] Process started with PID: 51688
Waiting for whisper-server to be ready...
whisper_init_from_file_with_params_no_state: loading model from 'C:\Users\User\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\5359861c739e955e79d9a303bcbc70fb988958b1\ggml-tiny.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:          CPU total size =    77.11 MB
whisper_model_load: model size    =   77.11 MB
whisper_backend_init_gpu: device 0: CPU (type: 0)
whisper_backend_init_gpu: no GPU found
whisper_init_state: kv self size  =    3.15 MB
whisper_init_state: kv cross size =    9.44 MB
whisper_init_state: kv pad  size  =    2.36 MB
whisper_init_state: Vitis AI model loaded
whisper_init_state: compute buffer (conv)   =    4.85 MB
whisper_init_state: compute buffer (cross)  =    3.89 MB
whisper_init_state: compute buffer (decode) =   95.91 MB
whisper-server is ready!
[WhisperServer] Server is ready!
[Router] Backend started successfully

Mostly the cache didn't download ? It should get logged if that was the case

Copy link
Contributor

@ramkrishna2910 ramkrishna2910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested! Works well!

@iswaryaalex iswaryaalex enabled auto-merge February 4, 2026 00:34
@iswaryaalex iswaryaalex added this pull request to the merge queue Feb 4, 2026
Merged via the queue into main with commit fad0dfe Feb 4, 2026
34 checks passed
@iswaryaalex iswaryaalex deleted the iswarya/whisper-cpp-multi-backend branch February 4, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants