You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|'1'|Enable the float32 model to be inferenced with fp16 precision.|
127
+
|'0'|disabled. Inferenced with fp32 precision if it's fp32 model.|
128
+
|'1'|default. Enable the float32 model to be inferenced with fp16 precision.|
129
+
130
+
|`"offload_graph_io_quantization"`|Description|
131
+
|---|---|
132
+
|'0'|default. Disabled. QNN EP will handle quantization and dequantization of graph I/O.|
133
+
|'1'|Enabled. Offload quantization and dequantization of graph I/O to CPU EP.|
129
134
130
135
## Supported ONNX operators
131
136
@@ -459,20 +464,20 @@ If user creates the QNN context binary .bin file weight sharing from QNN toolcha
459
464
460
465
### Inference with QNN resource sharing workflow
461
466
OnnxRuntime inference session need to have resource sharing enabled (set session option ep.share_ep_contexts to 1) to use the dumped Qnn context model with weight sharing enabled.
462
-
1. Create OnnxRuuntime inference session with ep.share_ep_contexts=1, loads the model1.onnx_ctx.onnx model.
463
-
1.1 The session loads the model1.onnx_ctx.onnx model.
464
-
1.2 The shared place is empty.
465
-
1.3 EPContext node1 in model1.onnx_ctx.onnx specifies that it uses Qnn_graph1
466
-
1.4 QNN EP loads the qnn_ctx.bin and deserialize the binary to get Qnn graphs (Qnn_graph1, Qnn_graph2).
467
-
1.5 Uses Qnn_graph1 for this OnnxRuntime session.
468
-
1.6 Put the Qnn_graph2 into the shared place.
469
-
2. Create OnnxRuuntime inference session with ep.share_ep_contexts=1, loads the model2.onnx_ctx.onnx model.
470
-
2.1 The session loads the model2.onnx_ctx.onnx model.
471
-
2.2 The EPContext node2 in model2.onnx_ctx.onnx specifies that it uses Qnn_graph2.
472
-
2.3 The shared place has Qnn_graph2.
473
-
2.4 QNN EP skips loading qnn_ctx.bin since it gets what it wants from the shared place.
474
-
2.5 Uses Qnn_graph2 from the shared place for this session.
475
-
3. To avoid issues while existing execution, user needs to destroy the 2nd session first, then the 1st session.
467
+
- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model1.onnx_ctx.onnx model.
468
+
- The session loads the model1.onnx_ctx.onnx model.
469
+
- The shared place is empty.
470
+
- EPContext node1 in model1.onnx_ctx.onnx specifies that it uses Qnn_graph1
471
+
- QNN EP loads the qnn_ctx.bin and deserialize the binary to get Qnn graphs (Qnn_graph1, Qnn_graph2).
472
+
- Uses Qnn_graph1 for this OnnxRuntime session.
473
+
- Put the Qnn_graph2 into the shared place.
474
+
- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model2.onnx_ctx.onnx model.
475
+
- The session loads the model2.onnx_ctx.onnx model.
476
+
- The EPContext node2 in model2.onnx_ctx.onnx specifies that it uses Qnn_graph2.
477
+
- The shared place has Qnn_graph2.
478
+
- QNN EP skips loading qnn_ctx.bin since it gets what it wants from the shared place.
479
+
- Uses Qnn_graph2 from the shared place for this session.
480
+
- To avoid issues while existing execution, user needs to destroy the 2nd session first, then the 1st session.
QNN EP returns StatusCode::ENGINE_ERROR regarding QNN HTP SSR issue. Uppper level framework/application should recreate Onnxruntime session if this error detected during session run.
510
+
511
+
512
+
## Add new operator support in QNN EP
513
+
To enable new operator support in EP, areas to visit:
514
+
- QDQ script support this Op? [code example](https://github.com/microsoft/onnxruntime/pull/14867/files#diff-b1ea073c326fef46054382117c256f106d39bd7c34539d44c6e6d9e9eacc059c)
515
+
- Onnxruntime QDQ node unit support this Op? [code example](https://github.com/microsoft/onnxruntime/pull/14867/files#diff-ce0281aaf63e03ecadd592240e41f18742bf8eb095b3725c0e55e589c890946f)
Example error message: <lambda_acc29b18d21b7c13448c4952cd957a60>::operator ()] Model face_det_qdq failed to load:Fatal error: com.ms.internal.nhwc:BatchNormalization(9) is not a registered function/op
A mixed precision QDQ model consists of regions with different activation/weight quantization data types. The boundary between regions converts between activation quantization data types (e.g., uint8 to uint16) using a DQ to Q sequence.
533
+
534
+
The ability to specify regions with different quantization data types enables exploring the tradeoffs between accuracy and latency. A higher integer precision may improve accuracy at the expense of latency, so selectively promoting certain regions to a higher precision can aid in achieving a desirable balance in key metrics.
535
+
536
+
The following figure shows a model with a region that has been promoted to 16-bit from the default 8-bit activation type.
This model is quantized to uint8 precision, but tensor "Op4_out" is quantized to 16-bit. This can be achieved by specifying the following initial tensor quantization overrides:
540
+
541
+
```
542
+
# Op4_out could be an inaccurate tensor that should be upgraded to 16bit
0 commit comments