Add NPU support for whisper.cpp on Lemonade #956

iswaryaalex · 2026-01-28T02:01:38Z

Core Update

This PR adds whispercpp NPU support to lemonade server, leveraging key updates from Ryzen AI 1.7

For whispercpp NPU backend, automatically downloads its Release binaries from the correct GitHub repository (NPU uses lemonade-sdk/whisper.cpp-npu, CPU uses ggml-org/whisper.cpp) and so on. Extendable to other backends
NPU backend automatically downloads required .rai cache files from HuggingFace AMD repo https://huggingface.co/collections/amd/ryzen-ai-17-whisper-npu-optimized-onnx-models and places them alongside model checkpoints for NPU runtime
Updated backend_versions.json to support per-backend versioning

To Test this PR

Start Lemonade server with whispercpp NPU backend
.\build\Release\lemonade-server.exe serve
Download sample .wav :
curl -o test.wav "https://raw.githubusercontent.com/lemonade-sdk/assets/main/audio/test_speech.wav"
Load NPU Whisper model
curl -X POST http://localhost:8000/api/v1/load -H "Content-Type: application/json" -d "{\"model_name\": \"Whisper-Tiny\", \"whispercpp_backend\": \"npu\"}"
To test with Whisper-Tiny
curl -X POST http://localhost:8000/api/v1/audio/transcriptions -F "[email protected]" -F "model=Whisper-Tiny"

You can also test with other Whisper models from here:

Whisper-Tiny
Whisper-Base
Whisper-Small
Whisper-Medium
Whisper-Large-v3

src/cpp/server/backends/whisper_server.cpp

…m/lemonade-sdk/lemonade into iswarya/whisper-cpp-multi-backend

src/cpp/server/backends/whisper_server.cpp

.github/workflows/cpp_server_build_test_release.yml

test/utils/test_models.py

src/cpp/server/backends/whisper_server.cpp

iswaryaalex · 2026-01-30T08:40:32Z

@ramkrishna2910 Great feedback on catching partial downloads and model search naming.
I have addressed all the comments and ensured all whisper models have associated .rai cache runs at least 2-3x faster than CPU

There is this one pending test that is never triggered (because I changed/removed the test to separate cpu and npu tests)
However it is still in the required checks, perhaps because main does not have my changes?

jeremyfowers · 2026-01-30T14:36:29Z

@ramkrishna2910 Great feedback on catching partial downloads and model search naming. I have addressed all the comments and ensured all whisper models have associated .rai cache runs at least 2-3x faster than CPU

There is this one pending test that is never triggered (because I changed/removed the test to separate cpu and npu tests) However it is still in the required checks, perhaps because main does not have my changes?

@iswaryaalex the test name is encoded into github settings, not the git codebase itself. When we are about to merge this PR I will change the github settings to point to your new tests instead.

…m/lemonade-sdk/lemonade into iswarya/whisper-cpp-multi-backend

ramkrishna2910 · 2026-02-03T05:25:45Z

I am getting this error during download of rai cache

map_rai_file: 32: Failed to open rai file 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
whisper_vitisai_init: Failed to mmap rai file 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
whisper_init_state: failed to load Vitis AI model from 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
error: failed to initialize whisper context
[ERROR] whisper-server process has terminated with exit code: 3
[ERROR] This usually means:
  - Missing required drivers or dependencies
  - Incompatible model file
  - Try running the server manually to see the actual error
[WhisperServer] Stopping server (PID: 126960)
[Router] Retry also failed: whisper-server failed to start or become ready
[Router ERROR] Failed to load model: whisper-server failed to start or become ready
[Server ERROR] Failed to load model: whisper-server failed to start or become ready

iswaryaalex · 2026-02-03T09:18:49Z

I am getting this error during download of rai cache

map_rai_file: 32: Failed to open rai file 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
whisper_vitisai_init: Failed to mmap rai file 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
whisper_init_state: failed to load Vitis AI model from 'C:\Users\ramkr\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\main\ggml-tiny-encoder-vitisai.rai'
error: failed to initialize whisper context
[ERROR] whisper-server process has terminated with exit code: 3
[ERROR] This usually means:
  - Missing required drivers or dependencies
  - Incompatible model file
  - Try running the server manually to see the actual error
[WhisperServer] Stopping server (PID: 126960)
[Router] Retry also failed: whisper-server failed to start or become ready
[Router ERROR] Failed to load model: whisper-server failed to start or become ready
[Server ERROR] Failed to load model: whisper-server failed to start or become ready

Can you check your server logs for me ? This is how it should be -

[Router] Creating WhisperServer backend
[Router] Starting backend (this may take a moment)...
[WhisperServer] Loading model: Whisper-Tiny
[WhisperServer] Per-model settings: whispercpp_backend=npu
[WhisperServer] Using npu version from config: v1.8.2
[WhisperServer] Found whisper-server at: C:\Users\User\.cache\lemonade/bin\whisper\npu\whisper-server.exe
[WhisperServer] Using model: C:\Users\User\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\5359861c739e955e79d9a303bcbc70fb988958b1\ggml-tiny.bin
[WhisperServer] Using backend: npu
[WhisperServer] Using NPU cache from server_models.json: amd/whisper-tiny-onnx-npu / ggml-tiny-encoder-vitisai.rai
[WhisperServer] Downloading NPU compiled cache: ggml-tiny-encoder-vitisai.rai
[WhisperServer] From repository: amd/whisper-tiny-onnx-npu
[WhisperServer] Downloading from: https://huggingface.co/amd/whisper-tiny-onnx-npu/resolve/main/ggml-tiny-encoder-vitisai.rai
[Server PRE-ROUTE] GET /api/v1/models
[Server] GET /api/v1/models - 200
  Progress: 100% (0.0/0.0 MB)
[WhisperServer] NPU cache ready at: "C:\\Users\\User\\.cache\\huggingface\\hub/models--ggerganov--whisper.cpp\\snapshots\\5359861c739e955e79d9a303bcbc70fb988958b1\\ggml-tiny-encoder-vitisai.rai"
whisper-server will use port: 8001
[WhisperServer] Starting server on port 8001
[ProcessManager] Starting process with inherited output: "C:\Users\User\.cache\lemonade/bin\whisper\npu\whisper-server.exe" "-m" "C:\Users\User\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\5359861c739e955e79d9a303bcbc70fb988958b1\ggml-tiny.bin" "--port" "8001"
[ProcessManager] Process started successfully, PID: 51688
[WhisperServer] Process started with PID: 51688
Waiting for whisper-server to be ready...
whisper_init_from_file_with_params_no_state: loading model from 'C:\Users\User\.cache\huggingface\hub/models--ggerganov--whisper.cpp\snapshots\5359861c739e955e79d9a303bcbc70fb988958b1\ggml-tiny.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:          CPU total size =    77.11 MB
whisper_model_load: model size    =   77.11 MB
whisper_backend_init_gpu: device 0: CPU (type: 0)
whisper_backend_init_gpu: no GPU found
whisper_init_state: kv self size  =    3.15 MB
whisper_init_state: kv cross size =    9.44 MB
whisper_init_state: kv pad  size  =    2.36 MB
whisper_init_state: Vitis AI model loaded
whisper_init_state: compute buffer (conv)   =    4.85 MB
whisper_init_state: compute buffer (cross)  =    3.89 MB
whisper_init_state: compute buffer (decode) =   95.91 MB
whisper-server is ready!
[WhisperServer] Server is ready!
[Router] Backend started successfully

Mostly the cache didn't download ? It should get logged if that was the case

ramkrishna2910

Tested! Works well!

Iswarya Alex added 4 commits January 27, 2026 13:19

whisper_server modify backend

b8f22aa

Merge main

6607a70

test hf cache management

357a78d

download .rai cache

c58c3d3

iswaryaalex marked this pull request as draft January 28, 2026 03:06

Iswarya Alex added 2 commits January 28, 2026 10:02

Add NPU tests

42e9898

Trigger CI with updated workflow

6a50889

iswaryaalex commented Jan 28, 2026

View reviewed changes

src/cpp/server/backends/whisper_server.cpp Show resolved Hide resolved

iswaryaalex and others added 3 commits January 28, 2026 10:44

Merge branch 'main' into iswarya/whisper-cpp-multi-backend

a1b763e

Specific NPU Test

7b0cd01

Merge branch 'iswarya/whisper-cpp-multi-backend' of https://github.co…

e4ac701

…m/lemonade-sdk/lemonade into iswarya/whisper-cpp-multi-backend

iswaryaalex requested a review from ramkrishna2910 January 28, 2026 19:30

iswaryaalex marked this pull request as ready for review January 28, 2026 19:30

iswaryaalex commented Jan 28, 2026

View reviewed changes

src/cpp/server/backends/whisper_server.cpp Show resolved Hide resolved

jeremyfowers reviewed Jan 29, 2026

View reviewed changes

.github/workflows/cpp_server_build_test_release.yml Show resolved Hide resolved

ramkrishna2910 requested changes Jan 29, 2026

View reviewed changes

Iswarya Alex added 4 commits January 29, 2026 23:40

Removed other backends; Added full model name for search

7b85777

typo

94e2062

Merge branch 'main' into iswarya/whisper-cpp-multi-backend

19b0f7a

cleanup partial download

c50d008

update docs to highlight npu support

df72a22

jeremyfowers assigned iswaryaalex Jan 30, 2026

iswaryaalex and others added 6 commits January 30, 2026 15:01

Merge branch 'main' into iswarya/whisper-cpp-multi-backend

d3af39b

Update new location of dev-getting-started.md

d18b0cd

Merge branch 'main' into iswarya/whisper-cpp-multi-backend

64d2249

Move whisper model mapping to .json

961d0ea

Merge branch 'iswarya/whisper-cpp-multi-backend' of https://github.co…

2cfb268

…m/lemonade-sdk/lemonade into iswarya/whisper-cpp-multi-backend

trigger CI/CD

ad181c9

Iswarya Alex and others added 2 commits February 2, 2026 16:21

trigger CI/CD

2db987e

Merge branch 'main' into iswarya/whisper-cpp-multi-backend

b40d189

ramkrishna2910 approved these changes Feb 4, 2026

View reviewed changes

Merge branch 'main' into iswarya/whisper-cpp-multi-backend

cda7aa9

iswaryaalex enabled auto-merge February 4, 2026 00:34

iswaryaalex added this pull request to the merge queue Feb 4, 2026

Merged via the queue into main with commit fad0dfe Feb 4, 2026
34 checks passed

iswaryaalex deleted the iswarya/whisper-cpp-multi-backend branch February 4, 2026 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NPU support for whisper.cpp on Lemonade #956

Add NPU support for whisper.cpp on Lemonade #956

Uh oh!

iswaryaalex commented Jan 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iswaryaalex commented Jan 30, 2026 •

edited

Loading

Uh oh!

jeremyfowers commented Jan 30, 2026

Uh oh!

ramkrishna2910 commented Feb 3, 2026

Uh oh!

iswaryaalex commented Feb 3, 2026 •

edited

Loading

Uh oh!

ramkrishna2910 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add NPU support for whisper.cpp on Lemonade #956

Add NPU support for whisper.cpp on Lemonade #956

Uh oh!

Conversation

iswaryaalex commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core Update

To Test this PR

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iswaryaalex commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremyfowers commented Jan 30, 2026

Uh oh!

ramkrishna2910 commented Feb 3, 2026

Uh oh!

iswaryaalex commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ramkrishna2910 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iswaryaalex commented Jan 28, 2026 •

edited

Loading

iswaryaalex commented Jan 30, 2026 •

edited

Loading

iswaryaalex commented Feb 3, 2026 •

edited

Loading