Skip to content

Foundry/QNN NPU LLM crash on Surface Pro 11 Snapdragon X Elite #719

@ingcasucci

Description

@ingcasucci

Summary

Local NPU execution works with ONNX Runtime QNN on this machine, but Foundry Local NPU LLMs crash during text generation. The crash is reproducible with two different Foundry QNN NPU models:

  • phi-3-mini-4k-instruct-qnn-npu:3
  • qwen2.5-1.5b-instruct-qnn-npu:2

The failure is not a general NPU failure: a minimal ONNX model runs successfully on QNNExecutionProvider with device type NPU, and individual LLM ONNX components can run on NPU. The crash appears when the decoder context model is run with a prompt shorter than the fixed sliding window size of 64 tokens, and also when using onnxruntime_genai.Generator.append_tokens(...).

Environment

  • Device: Surface Pro 11th Edition For Business, Model 2085
  • CPU/SoC: Snapdragon X Elite, X1E80100
  • OS: Windows 11 Enterprise ARM64, build 26200.8457
  • RAM: 16 GB
  • NPU device: Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Hexagon(TM) NPU
  • NPU driver observed from PowerShell: 30.0.219.1000
  • Foundry Local: Microsoft.FoundryLocal 0.8.119.102 arm64
  • Python: official ARM64 Python 3.13
  • Python packages:
    • onnxruntime==1.24.4
    • onnxruntime-qnn==2.1.1
    • onnxruntime-genai==0.13.2

What works

Minimal QNN NPU model works

Using a tiny ONNX QLinearMatMul model with:

  • onnxruntime-qnn
  • QNNExecutionProvider
  • QnnHtp.dll
  • selected ORT EP device type NPU

Result:

Selected device: npu
Creating InferenceSession...
Session created.
Running inference...
Output: [[128 128 128]]
Minimal QNN test completed.

This shows the NPU and the QNN plugin can execute at least a simple ONNX graph.

Foundry model ONNX components can load and partly run on NPU

For phi-3-mini-4k-instruct-qnn-npu:3:

  • embeddings.onnx runs on NPU
  • lm_head.onnx runs on NPU
  • context_ctx.onnx runs on NPU when total_seq_len=64
  • iterator_ctx.onnx runs on NPU after a 64-token context prefill

For qwen2.5-1.5b-instruct-qnn-npu:2:

  • context_ctx.onnx runs on NPU when total_seq_len=64

What fails

ORT GenAI Generator crashes

For both models, using onnxruntime_genai.Generator crashes during:

generator.append_tokens(input_tokens)

The Python faulthandler output shows:

Windows fatal exception: access violation

Current thread:
  File "...test_ort_genai_direct.py", line 171 in main

where line 171 is:

generator.append_tokens(input_tokens)

Windows Event Viewer reports crashes such as:

Application: python.exe
Faulting module: ntdll.dll
Exception code: 0xc0000409

Foundry Local service/CLI crashes show the same class of failure when calling chat completions against NPU models.

Direct context ONNX run crashes for short prompt length

For phi-3-mini-4k-instruct-qnn-npu:3:

  • context_ctx.onnx with total_seq_len=64: succeeds
  • context_ctx.onnx with total_seq_len=4: crashes with access violation

For qwen2.5-1.5b-instruct-qnn-npu:2:

  • context_ctx.onnx with total_seq_len=64: succeeds
  • context_ctx.onnx with total_seq_len=30: crashes with access violation

Example faulthandler output for Qwen:

Windows fatal exception: access violation

Current thread:
  File "...onnxruntime_inference_collection.py", line 296 in run
  File "<stdin>", line 32 in <module>

This points to a native crash inside InferenceSession.run(...).

Reproduction Notes

The models both use:

"sliding_window": {
    "window_size": 64,
    "pad_value": 0,
    "alignment": "left",
    "slide_key_value_cache": false
}

The crash occurs when the prompt/prefill length is shorter than the 64-token window, while a full 64-token context run succeeds.

Expected Behavior

Foundry Local / ORT GenAI should be able to run NPU LLM generation for ordinary short prompts, e.g. "ok" or "Rispondi solo: ok", without crashing the hosting process.

Actual Behavior

The hosting process crashes with native access violation / 0xc0000409 during the first generation path:

  • Foundry Local: service process terminates during /v1/chat/completions
  • ORT GenAI direct: generator.append_tokens(...) crashes
  • ONNX Runtime direct: context_ctx.onnx crashes when total_seq_len < 64

Local Repro Artifacts

Scripts used locally:

  • test_qnn_ep_minimal.py
  • test_ort_genai_direct.py

Relevant local logs:

  • ort-genai-qwen15-faulthandler.log
  • qwen15-context30-faulthandler.log
  • ort-genai-faulthandler.log

context-short-faulthandler.log
ort-genai-faulthandler.log
ort-genai-qwen15-faulthandler.log
qwen15-context30-faulthandler.log
test_ort_genai_direct.py
test_qnn_ep_minimal.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions