PaddlePaddle · Bobholamovic · Mar 20, 2025 · Mar 20, 2025 · Mar 20, 2025 · Mar 21, 2025
diff --git a/docs/installation/paddlepaddle_install.en.md b/docs/installation/paddlepaddle_install.en.md
@@ -46,7 +46,7 @@ nvidia-docker run --name paddlex -v $PWD:/paddle  --shm-size=8G --network=host -
 To use [Paddle Inference TensorRT Subgraph Engine](https://www.paddlepaddle.org.cn/documentation/docs/en/install/pip/linux-pip_en.html#gpu), install TensorRT by executing the following instructions in the 'paddlex' container that has just been started
 
 ```bash
-python -m pip install /usr/local/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
+python -m pip install /usr/local/TensorRT-*/python/tensorrt-*-cp310-none-linux_x86_64.whl
 ```
 
 ## Installing PaddlePaddle via pip
@@ -94,7 +94,7 @@ tar xvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
 # Install TensorRT wheel package
 python -m pip install TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
 # Add the absolute path of TensorRT's `lib` directory to LD_LIBRARY_PATH
-export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:TensorRT-8.6.1.6/lib
+export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:TensorRT-8.6.1.6/lib"
 ```
 
 > ❗ <b>Note</b>: If you encounter any issues during the installation process, feel free to [submit an issue](https://github.com/PaddlePaddle/Paddle/issues) in the Paddle repository.
diff --git a/docs/installation/paddlepaddle_install.md b/docs/installation/paddlepaddle_install.md
@@ -47,7 +47,7 @@ nvidia-docker run --name paddlex -v $PWD:/paddle --shm-size=8G --network=host -i
 在刚刚启动的 `paddlex` 容器中执行下面指令安装 TensorRT，即可使用 [Paddle Inference TensorRT 子图引擎](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/paddle_trt_cn.html)：
 
 ```bash
-python -m pip install /usr/local/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
+python -m pip install /usr/local/TensorRT-*/python/tensorrt-*-cp310-none-linux_x86_64.whl
 ```
 
 ## 基于 pip 安装飞桨
@@ -94,7 +94,7 @@ tar xvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
 # 安装 TensorRT wheel 包
 python -m pip install TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
 # 添加 TensorRT 的 `lib` 目录的绝对路径到 LD_LIBRARY_PATH 中
-export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:TensorRT-8.6.1.6/lib
+export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:TensorRT-8.6.1.6/lib"
 ```
 
 > ❗ <b>注</b>：如果在安装的过程中，出现任何问题，欢迎在Paddle仓库中[提Issue](https://github.com/PaddlePaddle/Paddle/issues)。
diff --git a/docs/module_usage/instructions/model_python_API.en.md b/docs/module_usage/instructions/model_python_API.en.md
@@ -39,7 +39,6 @@ In short, just three steps:
     * `use_hpip`：`bool` type, whether to enable the high-performance inference plugin;
     * `hpi_config`：`dict | None` type, high-performance inference configuration;
     * _`inference hyperparameters`_: used to set common inference hyperparameters. Please refer to specific model description document for details.
-  * Return Value: `BasePredictor` type.
 
 ### 2. Perform Inference Prediction by Calling the `predict()` Method of the Prediction Model Object
 

diff --git a/docs/module_usage/instructions/model_python_API.md b/docs/module_usage/instructions/model_python_API.md
@@ -40,7 +40,6 @@ for res in output:
     * `use_hpip`：`bool` 类型，是否启用高性能推理插件；
     * `hpi_config`：`dict | None` 类型，高性能推理配置；
     * _`推理超参数`_：支持常见推理超参数的修改，具体参数说明详见具体模型文档；
-  * 返回值：`BasePredictor` 类型。
 
 ### 2. 调用预测模型对象的`predict()`方法进行推理预测
 

diff --git a/docs/pipeline_deploy/edge_deploy.en.md b/docs/pipeline_deploy/edge_deploy.en.md
@@ -190,7 +190,7 @@ This guide applies to 8 models across 6 modules:
     <b>Note</b>:
     - `{Pipeline_Name}` and `{Demo_Name}` are placeholders. Refer to the table at the end of this section for specific values.
     - `download.sh` and `run.sh` support passing in model names to specify models. If not specified, the default model will be used. Refer to the `Model_Name` column in the table at the end of this section for currently supported models.
-    - To use your own trained model, refer to the [Model Conversion Method](https://paddlepaddle.github.io/Paddle-Lite/develop/model_optimize_tool/) to obtain the `.nb` model, place it in the `PaddleX_Lite_Deploy/{Pipeline_Name}/assets/{Model_Name}` directory, where `{Model_Name}` is the model name, e.g., `PaddleX_Lite_Deploy/object_detection/assets/PicoDet-L`.
+    - To use your own trained model, refer to the [Model Conversion Method](https://paddlepaddle.github.io/Paddle-Lite/develop/model_optimize_tool/) to obtain the `.nb` model, place it in the `PaddleX_Lite_Deploy/{Pipeline_Name}/assets/{Model_Name}` directory, where `{Model_Name}` is the model name, e.g., `PaddleX_Lite_Deploy/object_detection/assets/PicoDet-L`. Please note that converting static graph models in `.json` format to `.nb` format is currently not supported. When exporting a static graph model using PaddleX, please set the environment variable `FLAGS_json_format_model` to `0`.
     - Before running the `build.sh` script, change the path specified by `NDK_ROOT` to the actual installed NDK path.
     - Keep ADB connected when running the `build.sh` script.
     - On Windows systems, you can use Git Bash to execute the deployment steps.
@@ -305,6 +305,7 @@ This section describes the deployment steps applicable to the demos listed in th
 </table>
 
 <b>Note</b>
+
 - Currently, there is no demo for deploying the Layout Area Detection module on the edge, so the `picodet_detection` demo is reused to deploy the `PicoDet_layout_1x` model.
 
 ## Reference Materials

diff --git a/docs/pipeline_deploy/edge_deploy.md b/docs/pipeline_deploy/edge_deploy.md
@@ -190,7 +190,7 @@ comments: true
     <b>注意：</b>
     - `Pipeline_Name` 和 `Demo_Name` 为占位符，具体值可参考本节最后的表格。
     - `download.sh` 和 `run.sh` 支持传入模型名来指定模型，若不指定则使用默认模型。目前适配的模型可参考本节最后表格的 `Model_Name` 列。
-    - 若想使用自己训练的模型，参考 [模型转换方法](https://paddlepaddle.github.io/Paddle-Lite/develop/model_optimize_tool/) 得到 `.nb` 模型，放到`PaddleX_Lite_Deploy/{Pipeline_Name}/assets/{Model_Name}`目录下，  `Model_Name`为模型名，例如 `PaddleX_Lite_Deploy/object_detection/assets/PicoDet-L`。
+    - 若想使用自己训练的模型，参考 [模型转换方法](https://paddlepaddle.github.io/Paddle-Lite/develop/model_optimize_tool/) 得到 `.nb` 模型，放到`PaddleX_Lite_Deploy/{Pipeline_Name}/assets/{Model_Name}`目录下，  `Model_Name`为模型名，例如 `PaddleX_Lite_Deploy/object_detection/assets/PicoDet-L`。请注意，目前暂不支持将 `.json` 格式的静态图模型转换为 `.nb` 格式。在使用 PaddleX 导出静态图模型时，请设置环境变量 `FLAGS_json_format_model` 为 `0`。
     - 在运行 `build.sh` 脚本前，需要更改 `NDK_ROOT` 指定的路径为实际安装的 NDK 路径。
     - 在运行 `build.sh` 脚本时需保持 ADB 连接。
     - 在 Windows 系统上可以使用 Git Bash 执行部署步骤。
@@ -307,7 +307,8 @@ detection, image size: 768, 576, detect object: dog, score: 0.731584, location:
 </table>
 
 <b>备注</b>
-- 目前没有版面区域检测模块的端侧部署 demo，因此复用 `picodet_detection`demo 来部署`PicoDet_layout_1x`模型。
+
+- 目前没有版面区域检测模块的端侧部署 demo，因此复用 `picodet_detection` demo 来部署 `PicoDet_layout_1x` 模型。
 
 ## 参考资料
 

diff --git a/docs/pipeline_deploy/high_performance_inference.en.md b/docs/pipeline_deploy/high_performance_inference.en.md
@@ -4,7 +4,7 @@ comments: true
 
 # PaddleX High-Performance Inference Guide
 
-In real production environments, many applications impose strict performance metrics—especially in response time—on deployment strategies to ensure system efficiency and a smooth user experience. To address this, PaddleX offers a high-performance inference plugin that, through automatic configuration and multi-backend inference capabilities, enables users to significantly accelerate model inference without concerning themselves with complex configurations and low-level details.
+In real production environments, many applications impose strict performance metrics—especially in response time—on deployment strategies to ensure system efficiency and a smooth user experience. To address this, PaddleX offers a high-performance inference plugin that, through automatic configuration and multi-backend inference capabilities, enables users to significantly accelerate model inference without concerning themselves with complex configurations and low-level details. In addition to supporting inference acceleration on pipelines, the PaddleX high-performance inference plugin can also be used to accelerate inference when modules are used standalone.
 
 ## Table of Contents
 
@@ -24,7 +24,7 @@ In real production environments, many applications impose strict performance met
 
 Before using the high-performance inference plugin, please ensure that you have completed the PaddleX installation according to the [PaddleX Local Installation Tutorial](../installation/installation.en.md) and have run the quick inference using the PaddleX pipeline command line or the PaddleX pipeline Python script as described in the usage instructions.
 
-The high-performance inference plugin supports handling multiple model formats, including **PaddlePaddle static graph (`.pdmodel`, `.json`)**, **ONNX (`.onnx`)** and **Huawei OM (`.om`)**, among others. For ONNX models, you can convert them using the [Paddle2ONNX Plugin](./paddle2onnx.en.md). If multiple model formats are present in the model directory, PaddleX will automatically choose the appropriate one as needed, and automatic model conversion may be performed. **It is recommended to install the Paddle2ONNX plugin first before installing the high-performance inference plugin, so that PaddleX can convert model formats when needed.**
+The high-performance inference plugin supports handling multiple model formats, including **PaddlePaddle static graph (`.pdmodel`, `.json`)**, **ONNX (`.onnx`)** and **Huawei OM (`.om`)**, among others. For ONNX models, you can convert them using the [Paddle2ONNX Plugin](./paddle2onnx.en.md). If multiple model formats are present in the model directory, PaddleX will automatically choose the appropriate one as needed, and automatic model conversion may be performed.
 
 ### 1.1 Installing the High-Performance Inference Plugin
 
@@ -86,12 +86,14 @@ Refer to [Get PaddleX based on Docker](../installation/installation.en.md#21-obt
   </tbody>
 </table>
 
-In the official PaddleX Docker image, TensorRT is installed by default. The high-performance inference plugin can then accelerate inference using the Paddle Inference TensorRT subgraph engine.
+The official PaddleX Docker images come with the Paddle2ONNX plugin pre-installed, allowing PaddleX to convert model formats on demand. In addition, the GPU version of the image includes TensorRT, so the high-performance inference plugin can leverage the Paddle Inference TensorRT subgraph engine for accelerated inference.
 
 **Please note that the aforementioned Docker image refers to the official PaddleX image described in [Get PaddleX via Docker](../installation/installation.en.md#21-get-paddlex-based-on-docker), rather than the PaddlePaddle official image described in [PaddlePaddle Local Installation Tutorial](../installation/paddlepaddle_install.en.md#installing-paddlepaddle-via-docker). For the latter, please refer to the local installation instructions for the high-performance inference plugin.**
 
 #### 1.1.2 Installing the High-Performance Inference Plugin Locally
 
+**It is recommended to install the Paddle2ONNX plugin first before installing the high-performance inference plugin, so that PaddleX can convert model formats when needed.**
+
 **To install the CPU version of the high-performance inference plugin:**
 
 Run:
@@ -322,7 +324,7 @@ The available configuration items for `backend_config` vary for different backen
 
 ### 2.3 Modifying the High-Performance Inference Configuration
 
-Due to the diversity of actual deployment environments and requirements, the default configuration might not meet all needs. In such cases, manual adjustment of the high-performance inference configuration may be necessary. Users can modify the configuration by editing the **pipeline/module configuration file** or by passing the `hpi_config` field in the parameters via **CLI** or **Python API**. **Parameters passed via CLI or Python API will override the settings in the pipeline/module configuration file.** Different levels of configurations in the config file are automatically merged, and the deepest-level settings take the highest priority. The following examples illustrate how to modify the configuration.
+When the model is initialized, the log will, by default, record the high-performance inference configuration that is about to be used. Due to the diversity of actual deployment environments and requirements, the default configuration might not meet all needs. In such cases, manual adjustment of the high-performance inference configuration may be necessary. Users can modify the configuration by editing the pipeline/module configuration file or by passing the `hpi_config` field in the parameters via CLI or Python API. Parameters passed via CLI or Python API will override the settings in the pipeline/module configuration file. Different levels of configurations in the config file are automatically merged, and the deepest-level settings take the highest priority. The following examples illustrate how to modify the configuration.
 
 **For the general OCR pipeline, use the `onnxruntime` backend for all models:**
 
@@ -566,3 +568,11 @@ For the GPU version of the high-performance inference plugin, the official Paddl
 **4. Why does the program freeze during runtime or display some "WARNING" and "ERROR" messages after using the high-performance inference feature? What should be done in such cases?**
 
 When initializing the model, operations such as subgraph optimization may take longer and may generate some "WARNING" and "ERROR" messages. However, as long as the program does not exit automatically, it is recommended to wait patiently, as the program usually continues to run to completion.
+
+**5. When using GPU for inference, enabling the high-performance inference plugin increases memory usage and causes OOM. How can this be resolved?**
+
+Some acceleration methods trade off memory usage to support a broader range of inference scenarios. If memory becomes a bottleneck, consider the following optimization strategies:
+
+* **Adjust pipeline configurations**: Disable unnecessary features to avoid loading redundant models. Appropriately reduce the batch size based on business requirements to balance throughput and memory usage.
+* **Switch inference backends**: Different inference backends have varying memory management strategies. Try benchmarking various backends to compare memory usage and performance.
+* **Optimize dynamic shape configurations**: For modules using TensorRT or Paddle Inference TensorRT subgraph engine, narrow the dynamic shape range based on the actual distribution of input data.