diff --git a/docs/execution-providers/EP-Context-Design.md b/docs/execution-providers/EP-Context-Design.md index 9a6578e4b23d5..85b26c003a7bd 100644 --- a/docs/execution-providers/EP-Context-Design.md +++ b/docs/execution-providers/EP-Context-Design.md @@ -55,35 +55,35 @@ Atrribures: ## EP Context cache model generation workflow OnnxRuntime EPs should follow these rules to create the EP context cache model to maintain a unified user interface. -1. ep.context_enable - OnnxRuntime create the EP context cache model if ep.context_enable = 1. Otherwise, ep.context_enable = 0 (default), just do the normal workflow. -2. ep.context_file_path - OnnxRuntime just append “_ctx.onnx” to the input file name as the output file name if no ep.context_file_path provided. Otherwise just use the user provided file path. - ep.context_file_path is required if user loads the model from memory buffer, since there’s no way for OnnxRuntime to get the input file path for this scenario. -3. ep.context_embed_mode - 1 (default): dump the EP context context content into the Onnx model. - 0: dump the EP context content as a separate file. EP decides the file name and tracks the file name in EPContext node attribute ep_cache_context. The separate file should always at the same location as the dumped Onnx model file. And the file path tracked in EPContext node is a relative path to the Onnx model file. Note: subfolder is allowed. -4. ep.context_node_name_prefix - In case the user wants to add special tag inside the EPContext node name (also the partition_name attribute, and graph name), EP should provide this capability when EP creates the EPContext nodes. - This is useful if the user wants to glue multiple EPContext nodes from multiple models into one model and there’s risk that node name (graph name) confliction happens across models. Dependes on EP implementation. QNN EP supports multiple EPContext nodes, so user can merge and re-connect EPContext nodes from different models. +- ep.context_enable + - OnnxRuntime create the EP context cache model if ep.context_enable = 1. Otherwise, ep.context_enable = 0 (default), just do the normal workflow. +- ep.context_file_path + - OnnxRuntime just change the origitnal input file name by replacing ".onnx" to “_ctx.onnx” as the output file name if no ep.context_file_path provided. Otherwise just use the user provided file path. + - ep.context_file_path is required if user loads the model from memory buffer, since there’s no way for OnnxRuntime to get the input file path for this scenario. +- ep.context_embed_mode + - 1 (default): dump the EP context context content into the Onnx model. + - 0: dump the EP context content as a separate file. EP decides the file name and tracks the file name in EPContext node attribute ep_cache_context. The separate file should always at the same location as the dumped Onnx model file. And the file path tracked in EPContext node is a relative path to the Onnx model file. Note: subfolder is allowed. +- ep.context_node_name_prefix + - In case the user wants to add special tag inside the EPContext node name (also the partition_name attribute, and graph name), EP should provide this capability when EP creates the EPContext nodes. + - This is useful if the user wants to glue multiple EPContext nodes from multiple models into one model and there’s risk that node name (graph name) confliction happens across models. Dependes on EP implementation. QNN EP supports multiple EPContext nodes, so user can merge and re-connect EPContext nodes from different models. ## Inference from EP Context cache model workflow OnnxRuntime EPs which support loading from Onnx model with EPContext nodes should follow the workflow/rules for model inference. -1. EP should be able to identify the model which has EPContext node. - a. EP follows its normal workflow if there’s no EPContext nodes inside the model. - b. If it is the Onnx model has EPContext nodes. - i. EP should check the source node attribute from all EPContext nodes to make sure there is any EPContext node for this EP (the source node attribute matches the key required by the EP). - ii. EP only partition in the EPContext nodes which has source node attribute matches the key required by the EP. - iii. EP loads from the cached context inside EPContext node -2. If the context cache Onnx model is dumped with embed_mode = 1, so there is separate context binary file beside the Onnx model in the same folder. - a. OnnxRuntime EP gets the context binary file relative path from EPContext ep_cache_context node attribute. - b. If the user loads the model from a Onnx model file path, then EP should get the input model folder path, and combine it with the relative path got from step a) as the context binary file full path. - c. If the user loads the model from memory buffer, user needs to provide session option ep.context_file_path. EP gets the folder path from ep.context_file_path, and combines it with the relative path got from step a) as the context binary file full path. +- EP should be able to identify the model which has EPContext node. + - EP follows its normal workflow if there’s no EPContext nodes inside the model. + - If it is the Onnx model has EPContext nodes. + - EP should check the source node attribute from all EPContext nodes to make sure there is any EPContext node for this EP (the source node attribute matches the key required by the EP). + - EP only partition in the EPContext nodes which has source node attribute matches the key required by the EP. + - EP loads from the cached context inside EPContext node +- If the context cache Onnx model is dumped with embed_mode = 1, so there is separate context binary file beside the Onnx model in the same folder. + - OnnxRuntime EP gets the context binary file relative path from EPContext ep_cache_context node attribute. + - If the user loads the model from a Onnx model file path, then EP should get the input model folder path, and combine it with the relative path got from step a) as the context binary file full path. + - If the user loads the model from memory buffer, user needs to provide session option ep.context_file_path. EP gets the folder path from ep.context_file_path, and combines it with the relative path got from step a) as the context binary file full path.

EP Context nodes with different EPs

-## New ExecutionProvider interface GetEpContextNodes() to help generate the EP Context cache model +## ExecutionProvider interface GetEpContextNodes() to help generate the EP Context cache model It is hard for Execution Providers to generate the partitioned graph within the Execution Provider code since an Execution Provider does not have a good picture of the whole partitioned graph. New ExecutionProvider interface GetEpContextNodes() is added to support this. diff --git a/docs/execution-providers/QNN-ExecutionProvider.md b/docs/execution-providers/QNN-ExecutionProvider.md index 3b3827c2f2e32..e8dbcaf747d51 100644 --- a/docs/execution-providers/QNN-ExecutionProvider.md +++ b/docs/execution-providers/QNN-ExecutionProvider.md @@ -411,7 +411,7 @@ options.add_session_config_entry("ep.context_enable", "1") ``` ### Configure the context binary file path -The generated Onnx model with QNN context binary is default to [input_QDQ_model_path]_ctx.onnx in case user does not specify the path. User can to set the path in the session option with the key "ep.context_file_path". Example code below: +The generated Onnx model with QNN context binary is default to [input_QDQ_model_name]_ctx.onnx in case user does not specify the path. User can to set the path in the session option with the key "ep.context_file_path". Example code below: ``` // C++ @@ -427,7 +427,7 @@ options.add_session_config_entry("ep.context_file_path", "./model_a_ctx.onnx") ``` ### Enable the embed mode -The QNN context binary content is not embedded in the generated Onnx model by default. A bin file will be generated separately. The file name looks like [ctx.onnx]_QNNExecutionProvider_QNN_[hash_id]_x_x.bin. The name is provided by Ort and tracked in the generated Onnx model. It will cause problems if any changes are made to the bin file. This bin file needs to sit together with the generated Onnx file. User can enable it by setting "ep.context_embed_mode" to "1". In that case the content of the context binary is embedded inside the Onnx model. +The QNN context binary content is not embedded in the generated Onnx model by default. A bin file will be generated separately. The file name looks like [input_model_file_name]_QNNExecutionProvider_QNN_[hash_id]_x_x.bin. The name is provided by Ort and tracked in the generated Onnx model. It will cause problems if any changes are made to the bin file. This bin file needs to sit together with the generated Onnx file. User can enable it by setting "ep.context_embed_mode" to "1". In that case the content of the context binary is embedded inside the Onnx model. ``` // C++ @@ -462,24 +462,24 @@ The way OnnxRuntime to convert Onnx model with weight sharing to QNN context bin OnnxRuntime QNN EP provides [OnnxRuntime_qnn_ctx_gen](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/qnn_ctx_gen) tool to complete these steps. Example command line: ``` -./onnxruntime_qnn_ctx_gen -i "soc_model|60 htp_graph_finalization_optimization_mode|3" ./model1.onnx,./model2.onnx +./ep_weight_sharing_ctx_gen -e qnn -i "soc_model|60 htp_graph_finalization_optimization_mode|3" ./model1.onnx,./model2.onnx ``` -It creates 2 Onnx model (model1.onnx_ctx.onnx, model2.onnx_ctx.onnx) and a QNN context binary file (model2.onnx_ctx.onnx_xxx.bin). +It creates 2 Onnx model (model1_ctx.onnx, model2_ctx.onnx) and a QNN context binary file (model2_xxx.bin).

Weight sharing from Onnx to QNN

If user creates the QNN context binary .bin file weight sharing from QNN toolchain (qnn-context-binary-generator). The context binary .bin file looks the same. User needs to create model1.onnx and model2.onnx with EPContext node which points to this .bin file. Each EPContext node should refer (node name and partition_name) to different Qnn graph names from the QNN context. Here’s an example script for reference [gen_qnn_ctx_onnx_model.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py) which wraps one single QNN graph into EPContext node. ### Inference with QNN resource sharing workflow OnnxRuntime inference session need to have resource sharing enabled (set session option ep.share_ep_contexts to 1) to use the dumped Qnn context model with weight sharing enabled. -- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model1.onnx_ctx.onnx model. - - The session loads the model1.onnx_ctx.onnx model. +- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model1_ctx.onnx model. + - The session loads the model1_ctx.onnx model. - The shared place is empty. - - EPContext node1 in model1.onnx_ctx.onnx specifies that it uses Qnn_graph1 + - EPContext node1 in model1_ctx.onnx specifies that it uses Qnn_graph1 - QNN EP loads the qnn_ctx.bin and deserialize the binary to get Qnn graphs (Qnn_graph1, Qnn_graph2). - Uses Qnn_graph1 for this OnnxRuntime session. - Put the Qnn_graph2 into the shared place. -- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model2.onnx_ctx.onnx model. - - The session loads the model2.onnx_ctx.onnx model. - - The EPContext node2 in model2.onnx_ctx.onnx specifies that it uses Qnn_graph2. +- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model2_ctx.onnx model. + - The session loads the model2_ctx.onnx model. + - The EPContext node2 in model2_ctx.onnx specifies that it uses Qnn_graph2. - The shared place has Qnn_graph2. - QNN EP skips loading qnn_ctx.bin since it gets what it wants from the shared place. - Uses Qnn_graph2 from the shared place for this session.