Summary
Local NPU execution works with ONNX Runtime QNN on this machine, but Foundry Local NPU LLMs crash during text generation. The crash is reproducible with two different Foundry QNN NPU models:
phi-3-mini-4k-instruct-qnn-npu:3
qwen2.5-1.5b-instruct-qnn-npu:2
The failure is not a general NPU failure: a minimal ONNX model runs successfully on QNNExecutionProvider with device type NPU, and individual LLM ONNX components can run on NPU. The crash appears when the decoder context model is run with a prompt shorter than the fixed sliding window size of 64 tokens, and also when using onnxruntime_genai.Generator.append_tokens(...).
Environment
- Device: Surface Pro 11th Edition For Business, Model 2085
- CPU/SoC: Snapdragon X Elite,
X1E80100
- OS: Windows 11 Enterprise ARM64, build
26200.8457
- RAM: 16 GB
- NPU device:
Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Hexagon(TM) NPU
- NPU driver observed from PowerShell:
30.0.219.1000
- Foundry Local:
Microsoft.FoundryLocal 0.8.119.102 arm64
- Python: official ARM64 Python 3.13
- Python packages:
onnxruntime==1.24.4
onnxruntime-qnn==2.1.1
onnxruntime-genai==0.13.2
What works
Minimal QNN NPU model works
Using a tiny ONNX QLinearMatMul model with:
onnxruntime-qnn
QNNExecutionProvider
QnnHtp.dll
- selected ORT EP device type
NPU
Result:
Selected device: npu
Creating InferenceSession...
Session created.
Running inference...
Output: [[128 128 128]]
Minimal QNN test completed.
This shows the NPU and the QNN plugin can execute at least a simple ONNX graph.
Foundry model ONNX components can load and partly run on NPU
For phi-3-mini-4k-instruct-qnn-npu:3:
embeddings.onnx runs on NPU
lm_head.onnx runs on NPU
context_ctx.onnx runs on NPU when total_seq_len=64
iterator_ctx.onnx runs on NPU after a 64-token context prefill
For qwen2.5-1.5b-instruct-qnn-npu:2:
context_ctx.onnx runs on NPU when total_seq_len=64
What fails
ORT GenAI Generator crashes
For both models, using onnxruntime_genai.Generator crashes during:
generator.append_tokens(input_tokens)
The Python faulthandler output shows:
Windows fatal exception: access violation
Current thread:
File "...test_ort_genai_direct.py", line 171 in main
where line 171 is:
generator.append_tokens(input_tokens)
Windows Event Viewer reports crashes such as:
Application: python.exe
Faulting module: ntdll.dll
Exception code: 0xc0000409
Foundry Local service/CLI crashes show the same class of failure when calling chat completions against NPU models.
Direct context ONNX run crashes for short prompt length
For phi-3-mini-4k-instruct-qnn-npu:3:
context_ctx.onnx with total_seq_len=64: succeeds
context_ctx.onnx with total_seq_len=4: crashes with access violation
For qwen2.5-1.5b-instruct-qnn-npu:2:
context_ctx.onnx with total_seq_len=64: succeeds
context_ctx.onnx with total_seq_len=30: crashes with access violation
Example faulthandler output for Qwen:
Windows fatal exception: access violation
Current thread:
File "...onnxruntime_inference_collection.py", line 296 in run
File "<stdin>", line 32 in <module>
This points to a native crash inside InferenceSession.run(...).
Reproduction Notes
The models both use:
"sliding_window": {
"window_size": 64,
"pad_value": 0,
"alignment": "left",
"slide_key_value_cache": false
}
The crash occurs when the prompt/prefill length is shorter than the 64-token window, while a full 64-token context run succeeds.
Expected Behavior
Foundry Local / ORT GenAI should be able to run NPU LLM generation for ordinary short prompts, e.g. "ok" or "Rispondi solo: ok", without crashing the hosting process.
Actual Behavior
The hosting process crashes with native access violation / 0xc0000409 during the first generation path:
- Foundry Local: service process terminates during
/v1/chat/completions
- ORT GenAI direct:
generator.append_tokens(...) crashes
- ONNX Runtime direct:
context_ctx.onnx crashes when total_seq_len < 64
Local Repro Artifacts
Scripts used locally:
test_qnn_ep_minimal.py
test_ort_genai_direct.py
Relevant local logs:
ort-genai-qwen15-faulthandler.log
qwen15-context30-faulthandler.log
ort-genai-faulthandler.log
context-short-faulthandler.log
ort-genai-faulthandler.log
ort-genai-qwen15-faulthandler.log
qwen15-context30-faulthandler.log
test_ort_genai_direct.py
test_qnn_ep_minimal.py
Summary
Local NPU execution works with ONNX Runtime QNN on this machine, but Foundry Local NPU LLMs crash during text generation. The crash is reproducible with two different Foundry QNN NPU models:
phi-3-mini-4k-instruct-qnn-npu:3qwen2.5-1.5b-instruct-qnn-npu:2The failure is not a general NPU failure: a minimal ONNX model runs successfully on
QNNExecutionProviderwith device typeNPU, and individual LLM ONNX components can run on NPU. The crash appears when the decoder context model is run with a prompt shorter than the fixed sliding window size of 64 tokens, and also when usingonnxruntime_genai.Generator.append_tokens(...).Environment
X1E8010026200.8457Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Hexagon(TM) NPU30.0.219.1000Microsoft.FoundryLocal 0.8.119.102 arm64onnxruntime==1.24.4onnxruntime-qnn==2.1.1onnxruntime-genai==0.13.2What works
Minimal QNN NPU model works
Using a tiny ONNX
QLinearMatMulmodel with:onnxruntime-qnnQNNExecutionProviderQnnHtp.dllNPUResult:
This shows the NPU and the QNN plugin can execute at least a simple ONNX graph.
Foundry model ONNX components can load and partly run on NPU
For
phi-3-mini-4k-instruct-qnn-npu:3:embeddings.onnxruns on NPUlm_head.onnxruns on NPUcontext_ctx.onnxruns on NPU whentotal_seq_len=64iterator_ctx.onnxruns on NPU after a 64-token context prefillFor
qwen2.5-1.5b-instruct-qnn-npu:2:context_ctx.onnxruns on NPU whentotal_seq_len=64What fails
ORT GenAI Generator crashes
For both models, using
onnxruntime_genai.Generatorcrashes during:The Python faulthandler output shows:
where line 171 is:
Windows Event Viewer reports crashes such as:
Foundry Local service/CLI crashes show the same class of failure when calling chat completions against NPU models.
Direct context ONNX run crashes for short prompt length
For
phi-3-mini-4k-instruct-qnn-npu:3:context_ctx.onnxwithtotal_seq_len=64: succeedscontext_ctx.onnxwithtotal_seq_len=4: crashes with access violationFor
qwen2.5-1.5b-instruct-qnn-npu:2:context_ctx.onnxwithtotal_seq_len=64: succeedscontext_ctx.onnxwithtotal_seq_len=30: crashes with access violationExample faulthandler output for Qwen:
This points to a native crash inside
InferenceSession.run(...).Reproduction Notes
The models both use:
The crash occurs when the prompt/prefill length is shorter than the 64-token window, while a full 64-token context run succeeds.
Expected Behavior
Foundry Local / ORT GenAI should be able to run NPU LLM generation for ordinary short prompts, e.g.
"ok"or"Rispondi solo: ok", without crashing the hosting process.Actual Behavior
The hosting process crashes with native access violation /
0xc0000409during the first generation path:/v1/chat/completionsgenerator.append_tokens(...)crashescontext_ctx.onnxcrashes whentotal_seq_len < 64Local Repro Artifacts
Scripts used locally:
test_qnn_ep_minimal.pytest_ort_genai_direct.pyRelevant local logs:
ort-genai-qwen15-faulthandler.logqwen15-context30-faulthandler.logort-genai-faulthandler.logcontext-short-faulthandler.log
ort-genai-faulthandler.log
ort-genai-qwen15-faulthandler.log
qwen15-context30-faulthandler.log
test_ort_genai_direct.py
test_qnn_ep_minimal.py