Foundry/QNN NPU LLM crash on Surface Pro 11 Snapdragon X Elite

## Summary

Local NPU execution works with ONNX Runtime QNN on this machine, but Foundry Local NPU LLMs crash during text generation. The crash is reproducible with two different Foundry QNN NPU models:

- `phi-3-mini-4k-instruct-qnn-npu:3`
- `qwen2.5-1.5b-instruct-qnn-npu:2`

The failure is not a general NPU failure: a minimal ONNX model runs successfully on `QNNExecutionProvider` with device type `NPU`, and individual LLM ONNX components can run on NPU. The crash appears when the decoder context model is run with a prompt shorter than the fixed sliding window size of 64 tokens, and also when using `onnxruntime_genai.Generator.append_tokens(...)`.

## Environment

- Device: Surface Pro 11th Edition For Business, Model 2085
- CPU/SoC: Snapdragon X Elite, `X1E80100`
- OS: Windows 11 Enterprise ARM64, build `26200.8457`
- RAM: 16 GB
- NPU device: `Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Hexagon(TM) NPU`
- NPU driver observed from PowerShell: `30.0.219.1000`
- Foundry Local: `Microsoft.FoundryLocal 0.8.119.102 arm64`
- Python: official ARM64 Python 3.13
- Python packages:
  - `onnxruntime==1.24.4`
  - `onnxruntime-qnn==2.1.1`
  - `onnxruntime-genai==0.13.2`

## What works

### Minimal QNN NPU model works

Using a tiny ONNX `QLinearMatMul` model with:

- `onnxruntime-qnn`
- `QNNExecutionProvider`
- `QnnHtp.dll`
- selected ORT EP device type `NPU`

Result:

```text
Selected device: npu
Creating InferenceSession...
Session created.
Running inference...
Output: [[128 128 128]]
Minimal QNN test completed.
```

This shows the NPU and the QNN plugin can execute at least a simple ONNX graph.

### Foundry model ONNX components can load and partly run on NPU

For `phi-3-mini-4k-instruct-qnn-npu:3`:

- `embeddings.onnx` runs on NPU
- `lm_head.onnx` runs on NPU
- `context_ctx.onnx` runs on NPU when `total_seq_len=64`
- `iterator_ctx.onnx` runs on NPU after a 64-token context prefill

For `qwen2.5-1.5b-instruct-qnn-npu:2`:

- `context_ctx.onnx` runs on NPU when `total_seq_len=64`

## What fails

### ORT GenAI Generator crashes

For both models, using `onnxruntime_genai.Generator` crashes during:

```python
generator.append_tokens(input_tokens)
```

The Python faulthandler output shows:

```text
Windows fatal exception: access violation

Current thread:
  File "...test_ort_genai_direct.py", line 171 in main
```

where line 171 is:

```python
generator.append_tokens(input_tokens)
```

Windows Event Viewer reports crashes such as:

```text
Application: python.exe
Faulting module: ntdll.dll
Exception code: 0xc0000409
```

Foundry Local service/CLI crashes show the same class of failure when calling chat completions against NPU models.

### Direct context ONNX run crashes for short prompt length

For `phi-3-mini-4k-instruct-qnn-npu:3`:

- `context_ctx.onnx` with `total_seq_len=64`: succeeds
- `context_ctx.onnx` with `total_seq_len=4`: crashes with access violation

For `qwen2.5-1.5b-instruct-qnn-npu:2`:

- `context_ctx.onnx` with `total_seq_len=64`: succeeds
- `context_ctx.onnx` with `total_seq_len=30`: crashes with access violation

Example faulthandler output for Qwen:

```text
Windows fatal exception: access violation

Current thread:
  File "...onnxruntime_inference_collection.py", line 296 in run
  File "<stdin>", line 32 in <module>
```

This points to a native crash inside `InferenceSession.run(...)`.

## Reproduction Notes

The models both use:

```json
"sliding_window": {
    "window_size": 64,
    "pad_value": 0,
    "alignment": "left",
    "slide_key_value_cache": false
}
```

The crash occurs when the prompt/prefill length is shorter than the 64-token window, while a full 64-token context run succeeds.

## Expected Behavior

Foundry Local / ORT GenAI should be able to run NPU LLM generation for ordinary short prompts, e.g. `"ok"` or `"Rispondi solo: ok"`, without crashing the hosting process.

## Actual Behavior

The hosting process crashes with native access violation / `0xc0000409` during the first generation path:

- Foundry Local: service process terminates during `/v1/chat/completions`
- ORT GenAI direct: `generator.append_tokens(...)` crashes
- ONNX Runtime direct: `context_ctx.onnx` crashes when `total_seq_len < 64`

## Local Repro Artifacts

Scripts used locally:

- `test_qnn_ep_minimal.py`
- `test_ort_genai_direct.py`

Relevant local logs:

- `ort-genai-qwen15-faulthandler.log`
- `qwen15-context30-faulthandler.log`
- `ort-genai-faulthandler.log`

[context-short-faulthandler.log](https://github.com/user-attachments/files/28112174/context-short-faulthandler.log)
[ort-genai-faulthandler.log](https://github.com/user-attachments/files/28112173/ort-genai-faulthandler.log)
[ort-genai-qwen15-faulthandler.log](https://github.com/user-attachments/files/28112175/ort-genai-qwen15-faulthandler.log)
[qwen15-context30-faulthandler.log](https://github.com/user-attachments/files/28112176/qwen15-context30-faulthandler.log)
[test_ort_genai_direct.py](https://github.com/user-attachments/files/28112172/test_ort_genai_direct.py)
[test_qnn_ep_minimal.py](https://github.com/user-attachments/files/28112177/test_qnn_ep_minimal.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Foundry/QNN NPU LLM crash on Surface Pro 11 Snapdragon X Elite #719

Summary

Environment

What works

Minimal QNN NPU model works

Foundry model ONNX components can load and partly run on NPU

What fails

ORT GenAI Generator crashes

Direct context ONNX run crashes for short prompt length

Reproduction Notes

Expected Behavior

Actual Behavior

Local Repro Artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Foundry/QNN NPU LLM crash on Surface Pro 11 Snapdragon X Elite #719

Description

Summary

Environment

What works

Minimal QNN NPU model works

Foundry model ONNX components can load and partly run on NPU

What fails

ORT GenAI Generator crashes

Direct context ONNX run crashes for short prompt length

Reproduction Notes

Expected Behavior

Actual Behavior

Local Repro Artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions