Skip to content

Commit d9dcc8e

Browse files
authored
add instruction to enable new Ops for QNN EP (#22647)
### Description add instruction to enable new Ops for QNN EP
1 parent 3733e39 commit d9dcc8e

File tree

3 files changed

+83
-16
lines changed

3 files changed

+83
-16
lines changed

docs/execution-providers/QNN-ExecutionProvider.md

Lines changed: 83 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -124,8 +124,13 @@ Alternatively to setting profiling_level at compile time, profiling can be enabl
124124

125125
|`"enable_htp_fp16_precision"`|Description [Example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/c_cxx/QNN_EP/mobilenetv2_classification)|
126126
|---|---|
127-
|'0'|default.|
128-
|'1'|Enable the float32 model to be inferenced with fp16 precision.|
127+
|'0'|disabled. Inferenced with fp32 precision if it's fp32 model.|
128+
|'1'|default. Enable the float32 model to be inferenced with fp16 precision.|
129+
130+
|`"offload_graph_io_quantization"`|Description|
131+
|---|---|
132+
|'0'|default. Disabled. QNN EP will handle quantization and dequantization of graph I/O.|
133+
|'1'|Enabled. Offload quantization and dequantization of graph I/O to CPU EP.|
129134

130135
## Supported ONNX operators
131136

@@ -459,20 +464,20 @@ If user creates the QNN context binary .bin file weight sharing from QNN toolcha
459464

460465
### Inference with QNN resource sharing workflow
461466
OnnxRuntime inference session need to have resource sharing enabled (set session option ep.share_ep_contexts to 1) to use the dumped Qnn context model with weight sharing enabled.
462-
1. Create OnnxRuuntime inference session with ep.share_ep_contexts=1, loads the model1.onnx_ctx.onnx model.
463-
1.1 The session loads the model1.onnx_ctx.onnx model.
464-
1.2 The shared place is empty.
465-
1.3 EPContext node1 in model1.onnx_ctx.onnx specifies that it uses Qnn_graph1
466-
1.4 QNN EP loads the qnn_ctx.bin and deserialize the binary to get Qnn graphs (Qnn_graph1, Qnn_graph2).
467-
1.5 Uses Qnn_graph1 for this OnnxRuntime session.
468-
1.6 Put the Qnn_graph2 into the shared place.
469-
2. Create OnnxRuuntime inference session with ep.share_ep_contexts=1, loads the model2.onnx_ctx.onnx model.
470-
2.1 The session loads the model2.onnx_ctx.onnx model.
471-
2.2 The EPContext node2 in model2.onnx_ctx.onnx specifies that it uses Qnn_graph2.
472-
2.3 The shared place has Qnn_graph2.
473-
2.4 QNN EP skips loading qnn_ctx.bin since it gets what it wants from the shared place.
474-
2.5 Uses Qnn_graph2 from the shared place for this session.
475-
3. To avoid issues while existing execution, user needs to destroy the 2nd session first, then the 1st session.
467+
- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model1.onnx_ctx.onnx model.
468+
- The session loads the model1.onnx_ctx.onnx model.
469+
- The shared place is empty.
470+
- EPContext node1 in model1.onnx_ctx.onnx specifies that it uses Qnn_graph1
471+
- QNN EP loads the qnn_ctx.bin and deserialize the binary to get Qnn graphs (Qnn_graph1, Qnn_graph2).
472+
- Uses Qnn_graph1 for this OnnxRuntime session.
473+
- Put the Qnn_graph2 into the shared place.
474+
- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model2.onnx_ctx.onnx model.
475+
- The session loads the model2.onnx_ctx.onnx model.
476+
- The EPContext node2 in model2.onnx_ctx.onnx specifies that it uses Qnn_graph2.
477+
- The shared place has Qnn_graph2.
478+
- QNN EP skips loading qnn_ctx.bin since it gets what it wants from the shared place.
479+
- Uses Qnn_graph2 from the shared place for this session.
480+
- To avoid issues while existing execution, user needs to destroy the 2nd session first, then the 1st session.
476481

477482
[Code example](https://github.com/microsoft/onnxruntime/blob/291a5352b27ded5714e5748b381f2efb88f28fb9/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L979-L992).
478483

@@ -502,3 +507,65 @@ sess = ort.InferenceSession(model_path, providers=['QNNExecutionProvider'], prov
502507
## Error handling
503508
### HTP SubSystem Restart - [SSR](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#subsystem-restart-ssr-)
504509
QNN EP returns StatusCode::ENGINE_ERROR regarding QNN HTP SSR issue. Uppper level framework/application should recreate Onnxruntime session if this error detected during session run.
510+
511+
512+
## Add new operator support in QNN EP
513+
To enable new operator support in EP, areas to visit:
514+
- QDQ script support this Op? [code example](https://github.com/microsoft/onnxruntime/pull/14867/files#diff-b1ea073c326fef46054382117c256f106d39bd7c34539d44c6e6d9e9eacc059c)
515+
- Onnxruntime QDQ node unit support this Op? [code example](https://github.com/microsoft/onnxruntime/pull/14867/files#diff-ce0281aaf63e03ecadd592240e41f18742bf8eb095b3725c0e55e589c890946f)
516+
- Is it layout sensitive operator?
517+
- Registered in LayoutTransformer?
518+
[code example](https://github.com/microsoft/onnxruntime/blob/6d464748ba7fed2275ecba3a7406298cabc93438/onnxruntime/core/optimizer/transpose_optimizer/transpose_optimizer.cc#L2168)
519+
- NHWC op schema registered?
520+
Example error message: <lambda_acc29b18d21b7c13448c4952cd957a60>::operator ()] Model face_det_qdq failed to load:Fatal error: com.ms.internal.nhwc:BatchNormalization(9) is not a registered function/op
521+
[Example PR](https://github.com/microsoft/onnxruntime/pull/15278)
522+
523+
### Example PRs to enable new operators:
524+
- Non-layout sensitive operator. [Enable Hardsigmoid for QNN EP using SDK support direct support](https://github.com/microsoft/onnxruntime/pull/20956)
525+
526+
- Layout sensitive operator. [Add InstanceNormalization operator to QNN EP](https://github.com/microsoft/onnxruntime/pull/14867)
527+
528+
529+
## Mixed precision support
530+
The following figure demonstrates an example of mixed precision model.
531+
<p align="center"><img width="100%" src="../../images/quantization_mixed_precision_1.png" alt="mixed precision model"/></p>
532+
A mixed precision QDQ model consists of regions with different activation/weight quantization data types. The boundary between regions converts between activation quantization data types (e.g., uint8 to uint16) using a DQ to Q sequence.
533+
534+
The ability to specify regions with different quantization data types enables exploring the tradeoffs between accuracy and latency. A higher integer precision may improve accuracy at the expense of latency, so selectively promoting certain regions to a higher precision can aid in achieving a desirable balance in key metrics.
535+
536+
The following figure shows a model with a region that has been promoted to 16-bit from the default 8-bit activation type.
537+
<p align="center"><img width="60%" src="../../images/quantization_mixed_precision_2.png" alt="mixed precision layers"/></p>
538+
539+
This model is quantized to uint8 precision, but tensor "Op4_out" is quantized to 16-bit. This can be achieved by specifying the following initial tensor quantization overrides:
540+
541+
```
542+
# Op4_out could be an inaccurate tensor that should be upgraded to 16bit
543+
initial_overrides = {"Op4_out": [{"quant_type": QuantType.QUInt16}]}
544+
545+
qnn_config = get_qnn_qdq_config(
546+
float_model_path,
547+
data_reader,
548+
activation_type=QuantType.QUInt8,
549+
weight_type=QuantType.QUInt8,
550+
init_overrides=initial_overrides, # These initial overrides will be "fixed"
551+
)
552+
```
553+
554+
The above snippet generates the following "fixed" overrides (get via qnn_config.extra_options["TensorQuantOverrides"]):
555+
556+
```
557+
overrides = {
558+
“Op2_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op4”}}}],
559+
“Op3_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op5”}}}],
560+
“Op4_out”: [{“quant_type”: QUInt16}],
561+
“Op5_out”: [{“quant_type”: QUInt16, “convert”: {“quant_type”: QUInt8, “recv_nodes”: {“Op6”}}}]
562+
}
563+
```
564+
565+
After the override, the model works like this:
566+
567+
- Op2’s output is consumed by Op4, Op7, and Op8. Op4 consumes the converted u16 type, while Op7 and Op8 consume the original u8 type.
568+
- Op3’s output is converted from u8 to u16. Op5 consumes the converted u16 type.
569+
- Op4’s output is just u16 (not converted).
570+
- Op5’s output is converted from u16 to u8. Op6 consumes the u8 type.
571+
68.8 KB
Loading
42.3 KB
Loading

0 commit comments

Comments
 (0)