Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 21 additions & 7 deletions docs/execution-providers/EP-Context-Design.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ redirect_from: /docs/reference/execution-providers/EP-Context-Design

## Background

OnnxRuntime Execution Providers enable users to inference Onnx model on different kinds of hardware accelerators empowered by backend SDKs (like QNN, OpenVINO, Vitis AI, etc). The Execution Providers converts the Onnx model into graph format required by the backend SDK, and compiles it into the format required by the hardware. Specific to NPU world, the converting and compiling process takes a long time to complete, especially for LLM models. The session creation time costs tens of minutes for some cases which impacts the user experience badly.
OnnxRuntime Execution Providers enable users to inference Onnx model on different kinds of hardware accelerators empowered by backend SDKs (like QNN, OpenVINO, Vitis AI, etc). The Execution Providers convert the Onnx model into graph format required by the backend SDK, and compile it into the format required by the hardware. Specific to NPU world, the converting and compiling process takes a long time to complete, especially for LLM models. The session creation time costs tens of minutes for some cases which impacts the user experience badly.
To avoid the converting and compiling cost, most of the backend SDKs provide the feature to dump the pre-compiled model into binary file. The pre-compiled model can be loaded by backend SDK directly and executed on the target device. It improves the session creation time greatly by using this way. In order to achieve this, OnnxRuntime defined a contribute Op called EPContext in MS domain.

## EPContext Op Schema
Expand All @@ -32,12 +32,13 @@ Atrribures:
|main_context |int64 |1 (default): This node points to an EP context content that contains the graph referred to by this node.<br/>0: The node does not point to any EP context content. Expect to get the graph from node with this field is 1.<br/>Some EPs support 1 single context contains multiple graphs. The EPContext node with main_context=1 refers to the real context. And the context contains graphs that are referred by other nodes with main_context=0.|
|ep_cache_context |string |Payload of the EP context if embed_mode=1, or path to the context file if embed_mode=0.<br/>The path is a relative path to the Onnx model file. It can be a file name, or subfolder/filename|
|embed_mode |int64 |1(default): ep_cache_context contains the payload of context content.<br/>0: ep_cache_context is the context binary file path.|
|ep_sdk_version |string |Optional. SDK version that used to generate the node. |
|onnx_model_filename |string |Optional. Original Onnx model file name. |
|ep_sdk_version |string |Optional. SDK version that used to generate the node.|
|onnx_model_filename |string |Optional. Original Onnx model file name.|
|hardware_architecture|string |Optional. Hardware architecture.|
|partition_name |string |Optional. OnnxRuntime partitioned graph name.|
|source |string |Optional. The source used to generate the node. Should be a key identified by the EP so that OnnxRuntime can support multiple EPContext nodes run with different EPs. For example, QNN EP only accepts nodes with source=QNN or QnnExecutionProvider, OpenVINO EP only accepts nodes with source=OpenVINOExecutionProvider.|
|notes |string |Optional. Additional information required by specific EP. |
|notes |string |Optional. Additional information required by specific EP.|
|max_size |int64 |Optional. Max size in the context. Usage depend on the EP. Default to 0.|

<p align="center"><img width="60%" src="../../images/EP_context_node.png" alt="EP Context node example"/></p>

Expand All @@ -46,13 +47,14 @@ Atrribures:
|Session option |Description |
|---------------------------|----------------------------------------------------------------------------------------------------------|
|ep.context_enable |Used for context model generation only.<br/>1: Enable OnnxRuntime to dump the context cache model.<br/>0 (default): disable.|
|ep.context_file_path |Specify the file path for the dump model.<br/>Default to original_file_name.onnx_ctx.onnx for context model generation.<br/>For model inference, if user loads model from memory buffer and the EP context binary is outside the Onnx model, user need to set this option. OnnxRuntime EP use this path to get the folder path together with the ep_cache_context (which point to the contex binary path) to get the absoluate path for the context binary file.|
|ep.context_embed_mode |Used for context model generation only.<br/>1 (default): dump the EP context content into the Onnx model, inside ep_cache_context node attribute.<br/>0: dump the EP context content into a separate file, keep the file name in the Onnx model. File path tracked in ep_cache_context node attribute.|
|ep.context_file_path |Specify the file path for the dump model.<br/>Default to original_file_name_ctx.onnx for context model generation.<br/>For model inference, if user loads model from memory buffer and the EP context binary is outside the Onnx model, user need to set this option. OnnxRuntime EP use this path to get the folder path together with the ep_cache_context (which point to the contex binary path) to get the absoluate path for the context binary file.|
|ep.context_embed_mode |Used for context model generation only.<br/>1: dump the EP context content into the Onnx model, inside ep_cache_context node attribute.<br/>0 (default): dump the EP context content into a separate file, keep the file name in the Onnx model. File path tracked in ep_cache_context node attribute.|
|ep.context_node_name_prefix|Used for context model generation only.<br/>Specify the EPContext node name (also the partition_name attribute, internal graph name) prefix to make it unique across nodes in case user glue multiple EPContext nodes in one model to avoid conflict.|
|ep.context_model_external_initializers_file_name|This is for the case that some nodes partitioned on CPU EP, and those nodes has external initializers. When generating EP context model, the new generated model should NOT depend on old external data file used for source Onnx model.<br/>Use this config when dumping EP context model with an external initializers file. All initializers will be inside the external data file if specified, otherwise all inside generated Onnx file.<br/>It is not set by default, so all initializers will be inside the Onnx file.|

## EP Context cache model generation workflow

OnnxRuntime EPs should flows these rules to create the EP context cache model to maintain a unified user interface.
OnnxRuntime EPs should follow these rules to create the EP context cache model to maintain a unified user interface.
1. ep.context_enable
OnnxRuntime create the EP context cache model if ep.context_enable = 1. Otherwise, ep.context_enable = 0 (default), just do the normal workflow.
2. ep.context_file_path
Expand Down Expand Up @@ -80,3 +82,15 @@ OnnxRuntime EPs which support loading from Onnx model with EPContext nodes shoul
c. If the user loads the model from memory buffer, user needs to provide session option ep.context_file_path. EP gets the folder path from ep.context_file_path, and combines it with the relative path got from step a) as the context binary file full path.

<p align="center"><img width="60%" src="../../images/EP_context_nodes_with_different_eps.png" alt="EP Context nodes with different EPs"/></p>

## New ExecutionProvider interface GetEpContextNodes() to help generate the EP Context cache model

It is hard for Execution Providers to generate the partitioned graph within the Execution Provider code since an Execution Provider does not have a good picture of the whole partitioned graph. New ExecutionProvider interface GetEpContextNodes() is added to support this.

```
virtual const InlinedVector<const Node*> GetEpContextNodes() const {
return InlinedVector<const Node*>();
}
```

This API returns the array of pointers for EPContext nodes. Execution Provider needs to implement this interface if it has the requirement to generate the context cache model. Otherwise leave it. It is the Execution Provider's responsibility to create the EPContext nodes with its dependencies (like the context binary file if it's not embed_mode). The OnnxRuntime GraphPartitioner use this interface to get the EPContext nodes and generate the partitioned Onnx model. [code details here](https://github.com/microsoft/onnxruntime/blob/544bdd60730270f49f6a5baafdff54065f626776/onnxruntime/core/framework/graph_partitioner.cc#L646-L750)
24 changes: 15 additions & 9 deletions docs/execution-providers/QNN-ExecutionProvider.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,13 +124,19 @@ Alternatively to setting profiling_level at compile time, profiling can be enabl

|`"enable_htp_fp16_precision"`|Description [Example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/c_cxx/QNN_EP/mobilenetv2_classification)|
|---|---|
|'0'|disabled. Inferenced with fp32 precision if it's fp32 model.|
|'1'|default. Enable the float32 model to be inferenced with fp16 precision.|
|'0'|Disabled. Inferenced with fp32 precision if it's fp32 model.|
|'1'|Default. Enable the float32 model to be inferenced with fp16 precision.|

|`"offload_graph_io_quantization"`|Description|
|---|---|
|'0'|default. Disabled. QNN EP will handle quantization and dequantization of graph I/O.|
|'1'|Enabled. Offload quantization and dequantization of graph I/O to CPU EP.|
|'0'|Disabled. QNN EP will handle quantization and dequantization of graph I/O.|
|'1'|Default. Enabled. Offload quantization and dequantization of graph I/O to CPU EP.|

|`"enable_htp_shared_memory_allocator"`|Description|
|---|---|
|'0'|Default. Disabled.|
|'1'|Enable the QNN HTP shared memory allocator. Requires libcdsprpc.so/dll to be available. [Code example](https://github.com/microsoft/onnxruntime/blob/544bdd60730270f49f6a5baafdff54065f626776/onnxruntime/test/shared_lib/test_inference.cc#L2262-L2354)|


## Supported ONNX operators

Expand Down Expand Up @@ -420,20 +426,20 @@ g_ort->AddSessionConfigEntry(session_options, kOrtSessionOptionEpContextFilePath
options.add_session_config_entry("ep.context_file_path", "./model_a_ctx.onnx")
```

### Disable the embed mode
The QNN context binary content is embeded in the generated Onnx model by default. User can to disable it by setting "ep.context_embed_mode" to "0". In that case, a bin file will be generated separately. The file name looks like [ctx.onnx]_QNNExecutionProvider_QNN_[hash_id]_x_x.bin. The name is provided by Ort and tracked in the generated Onnx model. It will cause problems if any changes to the bin file. This bin file needs to sit together with the generated Onnx file.
### Enable the embed mode
The QNN context binary content is not embedded in the generated Onnx model by default. A bin file will be generated separately. The file name looks like [ctx.onnx]_QNNExecutionProvider_QNN_[hash_id]_x_x.bin. The name is provided by Ort and tracked in the generated Onnx model. It will cause problems if any changes are made to the bin file. This bin file needs to sit together with the generated Onnx file. User can enable it by setting "ep.context_embed_mode" to "1". In that case the content of the context binary is embedded inside the Onnx model.

```
// C++
so.AddConfigEntry(kOrtSessionOptionEpContextEmbedMode, "0");
so.AddConfigEntry(kOrtSessionOptionEpContextEmbedMode, "1");

// C
g_ort->AddSessionConfigEntry(session_options, kOrtSessionOptionEpContextEmbedMode, "0");
g_ort->AddSessionConfigEntry(session_options, kOrtSessionOptionEpContextEmbedMode, "1");
```

```python
# Python
options.add_session_config_entry("ep.context_embed_mode", "0")
options.add_session_config_entry("ep.context_embed_mode", "1")
```

## QNN EP weight sharing
Expand Down
Loading