Skip to content

[Feat] Support multi-device inference and OCR batch inference #3923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
1b6d11e
feat/parallel_computing
Bobholamovic Mar 20, 2025
ce01e2f
Rename functions and parallelize read operations
Bobholamovic Mar 20, 2025
2bdae90
Add full support for predictor processors
Bobholamovic Mar 20, 2025
6243f45
Merge branch 'develop' into feat/parallel_computing
Bobholamovic Mar 21, 2025
aa16d50
Optimize for OCR
Bobholamovic Mar 21, 2025
346cf5b
Merge remote-tracking branch 'official/develop' into feat/parallel_co…
Bobholamovic Mar 26, 2025
1dc9d5e
Remove optimization for rotate_image
Bobholamovic Mar 26, 2025
b5a8ecd
Fix bug
Bobholamovic Mar 26, 2025
c43df36
Merge remote-tracking branch 'official/develop' into feat/parallel_co…
Bobholamovic Mar 31, 2025
caf91c1
Merge remote-tracking branch 'official/develop' into feat/parallel_co…
Bobholamovic Mar 31, 2025
551e289
Merge branch 'feat/parallel_computing' of https://github.com/Bobholam…
Bobholamovic Mar 31, 2025
90c5170
Merge branch 'develop' into feat/parallel_computing
Bobholamovic Apr 8, 2025
a524e6a
Support multi-device inference and pipeline batch inference
Bobholamovic Apr 28, 2025
fdc923b
Fix hpip bug
Bobholamovic Apr 28, 2025
6e7e7fb
Fix hpip bug
Bobholamovic Apr 28, 2025
0e6382c
Fix hpip bug
Bobholamovic Apr 28, 2025
53f7d82
warning->info
Bobholamovic Apr 28, 2025
6d96879
Fix table recognition v2 bugs
Bobholamovic Apr 29, 2025
6698697
Merge branch 'feat/parallel_computing' into feat/optimize_ppstructurev3
Bobholamovic Apr 29, 2025
a40876c
Support seal recognition and PP-StructureV3
Bobholamovic Apr 29, 2025
983d87f
Update doc
Bobholamovic Apr 30, 2025
0b4621e
Fix
Bobholamovic Apr 30, 2025
d400339
PaddlePredictorOption supports copy
Bobholamovic Apr 30, 2025
c458d94
Fix OCR bug
Bobholamovic Apr 30, 2025
5b35046
No parallel if iterable has only one element
Bobholamovic Apr 30, 2025
3e9736a
Add quick return for text det resize
Bobholamovic Apr 30, 2025
c266c60
Fix xycuts bug
Bobholamovic Apr 30, 2025
9401d6d
Merge branch 'feat/optimize_ppstructurev3' of https://github.com/Bobh…
Bobholamovic Apr 30, 2025
25ab205
Revert "Merge branch 'feat/parallel_computing' into feat/optimize_pps…
Bobholamovic May 1, 2025
c7fa3dc
Fix table bug
Bobholamovic May 1, 2025
3bac9ea
Cancel execution permission for regular files in ultra-infer
Bobholamovic May 6, 2025
4f7d590
More pipelines support multi-gpu inference
Bobholamovic May 6, 2025
cbccae0
PP-StructureV3 by default uses bs8
Bobholamovic May 6, 2025
ba9ffe8
Reset ultra-infer
Bobholamovic May 6, 2025
a91c416
Add parallel inference docs
Bobholamovic May 6, 2025
6e0db1e
Add parallel inference doc to mkdocs.yml
Bobholamovic May 6, 2025
95b433a
Update doc
Bobholamovic May 6, 2025
dfc8833
Update opset_version description
Bobholamovic May 6, 2025
8c3bfb1
Fix serving docs
Bobholamovic May 7, 2025
b4d8ee9
PP-FormulaNet-L supports paddle_fp16
Bobholamovic May 8, 2025
40479d2
/workspace -> /app
Bobholamovic May 8, 2025
7bced71
Unset paddle_fp16
Bobholamovic May 8, 2025
4b8a6c2
Add FAQ in hpi doc
Bobholamovic May 9, 2025
3b488d7
Fix docs
Bobholamovic May 9, 2025
40e684c
Update docs
Bobholamovic May 9, 2025
0ba3194
Merge branch 'develop' into feat/optimize_ppstructurev3
Bobholamovic May 12, 2025
76d94a7
Update PP-StructureV3 interface
Bobholamovic May 12, 2025
10eb0aa
Add note on edge deployment
Bobholamovic May 12, 2025
0fc4f9a
Polish docs
Bobholamovic May 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/installation/paddlepaddle_install.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ nvidia-docker run --name paddlex -v $PWD:/paddle --shm-size=8G --network=host -
To use [Paddle Inference TensorRT Subgraph Engine](https://www.paddlepaddle.org.cn/documentation/docs/en/install/pip/linux-pip_en.html#gpu), install TensorRT by executing the following instructions in the 'paddlex' container that has just been started

```bash
python -m pip install /usr/local/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
python -m pip install /usr/local/TensorRT-*/python/tensorrt-*-cp310-none-linux_x86_64.whl
```

## Installing PaddlePaddle via pip
Expand Down Expand Up @@ -94,7 +94,7 @@ tar xvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
# Install TensorRT wheel package
python -m pip install TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
# Add the absolute path of TensorRT's `lib` directory to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:TensorRT-8.6.1.6/lib
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:TensorRT-8.6.1.6/lib"
```

> ❗ <b>Note</b>: If you encounter any issues during the installation process, feel free to [submit an issue](https://github.com/PaddlePaddle/Paddle/issues) in the Paddle repository.
4 changes: 2 additions & 2 deletions docs/installation/paddlepaddle_install.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ nvidia-docker run --name paddlex -v $PWD:/paddle --shm-size=8G --network=host -i
在刚刚启动的 `paddlex` 容器中执行下面指令安装 TensorRT,即可使用 [Paddle Inference TensorRT 子图引擎](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/paddle_trt_cn.html):

```bash
python -m pip install /usr/local/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
python -m pip install /usr/local/TensorRT-*/python/tensorrt-*-cp310-none-linux_x86_64.whl
```

## 基于 pip 安装飞桨
Expand Down Expand Up @@ -94,7 +94,7 @@ tar xvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
# 安装 TensorRT wheel 包
python -m pip install TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
# 添加 TensorRT 的 `lib` 目录的绝对路径到 LD_LIBRARY_PATH 中
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:TensorRT-8.6.1.6/lib
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:TensorRT-8.6.1.6/lib"
```

> ❗ <b>注</b>:如果在安装的过程中,出现任何问题,欢迎在Paddle仓库中[提Issue](https://github.com/PaddlePaddle/Paddle/issues)。
1 change: 0 additions & 1 deletion docs/module_usage/instructions/model_python_API.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ In short, just three steps:
* `use_hpip`:`bool` type, whether to enable the high-performance inference plugin;
* `hpi_config`:`dict | None` type, high-performance inference configuration;
* _`inference hyperparameters`_: used to set common inference hyperparameters. Please refer to specific model description document for details.
* Return Value: `BasePredictor` type.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

实际返回的类型并不是BasePredictor,而是一个单下划线开头的内部类,故去除这里的错误说明。


### 2. Perform Inference Prediction by Calling the `predict()` Method of the Prediction Model Object

Expand Down
1 change: 0 additions & 1 deletion docs/module_usage/instructions/model_python_API.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ for res in output:
* `use_hpip`:`bool` 类型,是否启用高性能推理插件;
* `hpi_config`:`dict | None` 类型,高性能推理配置;
* _`推理超参数`_:支持常见推理超参数的修改,具体参数说明详见具体模型文档;
* 返回值:`BasePredictor` 类型。

### 2. 调用预测模型对象的`predict()`方法进行推理预测

Expand Down
3 changes: 2 additions & 1 deletion docs/pipeline_deploy/edge_deploy.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ This guide applies to 8 models across 6 modules:
<b>Note</b>:
- `{Pipeline_Name}` and `{Demo_Name}` are placeholders. Refer to the table at the end of this section for specific values.
- `download.sh` and `run.sh` support passing in model names to specify models. If not specified, the default model will be used. Refer to the `Model_Name` column in the table at the end of this section for currently supported models.
- To use your own trained model, refer to the [Model Conversion Method](https://paddlepaddle.github.io/Paddle-Lite/develop/model_optimize_tool/) to obtain the `.nb` model, place it in the `PaddleX_Lite_Deploy/{Pipeline_Name}/assets/{Model_Name}` directory, where `{Model_Name}` is the model name, e.g., `PaddleX_Lite_Deploy/object_detection/assets/PicoDet-L`.
- To use your own trained model, refer to the [Model Conversion Method](https://paddlepaddle.github.io/Paddle-Lite/develop/model_optimize_tool/) to obtain the `.nb` model, place it in the `PaddleX_Lite_Deploy/{Pipeline_Name}/assets/{Model_Name}` directory, where `{Model_Name}` is the model name, e.g., `PaddleX_Lite_Deploy/object_detection/assets/PicoDet-L`. Please note that converting static graph models in `.json` format to `.nb` format is currently not supported. When exporting a static graph model using PaddleX, please set the environment variable `FLAGS_json_format_model` to `0`.
- Before running the `build.sh` script, change the path specified by `NDK_ROOT` to the actual installed NDK path.
- Keep ADB connected when running the `build.sh` script.
- On Windows systems, you can use Git Bash to execute the deployment steps.
Expand Down Expand Up @@ -305,6 +305,7 @@ This section describes the deployment steps applicable to the demos listed in th
</table>

<b>Note</b>

- Currently, there is no demo for deploying the Layout Area Detection module on the edge, so the `picodet_detection` demo is reused to deploy the `PicoDet_layout_1x` model.

## Reference Materials
Expand Down
5 changes: 3 additions & 2 deletions docs/pipeline_deploy/edge_deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ comments: true
<b>注意:</b>
- `Pipeline_Name` 和 `Demo_Name` 为占位符,具体值可参考本节最后的表格。
- `download.sh` 和 `run.sh` 支持传入模型名来指定模型,若不指定则使用默认模型。目前适配的模型可参考本节最后表格的 `Model_Name` 列。
- 若想使用自己训练的模型,参考 [模型转换方法](https://paddlepaddle.github.io/Paddle-Lite/develop/model_optimize_tool/) 得到 `.nb` 模型,放到`PaddleX_Lite_Deploy/{Pipeline_Name}/assets/{Model_Name}`目录下, `Model_Name`为模型名,例如 `PaddleX_Lite_Deploy/object_detection/assets/PicoDet-L`。
- 若想使用自己训练的模型,参考 [模型转换方法](https://paddlepaddle.github.io/Paddle-Lite/develop/model_optimize_tool/) 得到 `.nb` 模型,放到`PaddleX_Lite_Deploy/{Pipeline_Name}/assets/{Model_Name}`目录下, `Model_Name`为模型名,例如 `PaddleX_Lite_Deploy/object_detection/assets/PicoDet-L`。请注意,目前暂不支持将 `.json` 格式的静态图模型转换为 `.nb` 格式。在使用 PaddleX 导出静态图模型时,请设置环境变量 `FLAGS_json_format_model` 为 `0`。
- 在运行 `build.sh` 脚本前,需要更改 `NDK_ROOT` 指定的路径为实际安装的 NDK 路径。
- 在运行 `build.sh` 脚本时需保持 ADB 连接。
- 在 Windows 系统上可以使用 Git Bash 执行部署步骤。
Expand Down Expand Up @@ -307,7 +307,8 @@ detection, image size: 768, 576, detect object: dog, score: 0.731584, location:
</table>

<b>备注</b>
- 目前没有版面区域检测模块的端侧部署 demo,因此复用 `picodet_detection`demo 来部署`PicoDet_layout_1x`模型。

- 目前没有版面区域检测模块的端侧部署 demo,因此复用 `picodet_detection` demo 来部署 `PicoDet_layout_1x` 模型。

## 参考资料

Expand Down
18 changes: 14 additions & 4 deletions docs/pipeline_deploy/high_performance_inference.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ comments: true

# PaddleX High-Performance Inference Guide

In real production environments, many applications impose strict performance metrics—especially in response time—on deployment strategies to ensure system efficiency and a smooth user experience. To address this, PaddleX offers a high-performance inference plugin that, through automatic configuration and multi-backend inference capabilities, enables users to significantly accelerate model inference without concerning themselves with complex configurations and low-level details.
In real production environments, many applications impose strict performance metrics—especially in response time—on deployment strategies to ensure system efficiency and a smooth user experience. To address this, PaddleX offers a high-performance inference plugin that, through automatic configuration and multi-backend inference capabilities, enables users to significantly accelerate model inference without concerning themselves with complex configurations and low-level details. In addition to supporting inference acceleration on pipelines, the PaddleX high-performance inference plugin can also be used to accelerate inference when modules are used standalone.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基本上,各种“部署”相关的概念是适用于产线的,但高性能推理比较特别,也适用于模块,所以这里特别提了一下。


## Table of Contents

Expand All @@ -24,7 +24,7 @@ In real production environments, many applications impose strict performance met

Before using the high-performance inference plugin, please ensure that you have completed the PaddleX installation according to the [PaddleX Local Installation Tutorial](../installation/installation.en.md) and have run the quick inference using the PaddleX pipeline command line or the PaddleX pipeline Python script as described in the usage instructions.

The high-performance inference plugin supports handling multiple model formats, including **PaddlePaddle static graph (`.pdmodel`, `.json`)**, **ONNX (`.onnx`)** and **Huawei OM (`.om`)**, among others. For ONNX models, you can convert them using the [Paddle2ONNX Plugin](./paddle2onnx.en.md). If multiple model formats are present in the model directory, PaddleX will automatically choose the appropriate one as needed, and automatic model conversion may be performed. **It is recommended to install the Paddle2ONNX plugin first before installing the high-performance inference plugin, so that PaddleX can convert model formats when needed.**
The high-performance inference plugin supports handling multiple model formats, including **PaddlePaddle static graph (`.pdmodel`, `.json`)**, **ONNX (`.onnx`)** and **Huawei OM (`.om`)**, among others. For ONNX models, you can convert them using the [Paddle2ONNX Plugin](./paddle2onnx.en.md). If multiple model formats are present in the model directory, PaddleX will automatically choose the appropriate one as needed, and automatic model conversion may be performed.

### 1.1 Installing the High-Performance Inference Plugin

Expand Down Expand Up @@ -86,12 +86,14 @@ Refer to [Get PaddleX based on Docker](../installation/installation.en.md#21-obt
</tbody>
</table>

In the official PaddleX Docker image, TensorRT is installed by default. The high-performance inference plugin can then accelerate inference using the Paddle Inference TensorRT subgraph engine.
The official PaddleX Docker images come with the Paddle2ONNX plugin pre-installed, allowing PaddleX to convert model formats on demand. In addition, the GPU version of the image includes TensorRT, so the high-performance inference plugin can leverage the Paddle Inference TensorRT subgraph engine for accelerated inference.

**Please note that the aforementioned Docker image refers to the official PaddleX image described in [Get PaddleX via Docker](../installation/installation.en.md#21-get-paddlex-based-on-docker), rather than the PaddlePaddle official image described in [PaddlePaddle Local Installation Tutorial](../installation/paddlepaddle_install.en.md#installing-paddlepaddle-via-docker). For the latter, please refer to the local installation instructions for the high-performance inference plugin.**

#### 1.1.2 Installing the High-Performance Inference Plugin Locally

**It is recommended to install the Paddle2ONNX plugin first before installing the high-performance inference plugin, so that PaddleX can convert model formats when needed.**

**To install the CPU version of the high-performance inference plugin:**

Run:
Expand Down Expand Up @@ -322,7 +324,7 @@ The available configuration items for `backend_config` vary for different backen

### 2.3 Modifying the High-Performance Inference Configuration

Due to the diversity of actual deployment environments and requirements, the default configuration might not meet all needs. In such cases, manual adjustment of the high-performance inference configuration may be necessary. Users can modify the configuration by editing the **pipeline/module configuration file** or by passing the `hpi_config` field in the parameters via **CLI** or **Python API**. **Parameters passed via CLI or Python API will override the settings in the pipeline/module configuration file.** Different levels of configurations in the config file are automatically merged, and the deepest-level settings take the highest priority. The following examples illustrate how to modify the configuration.
When the model is initialized, the log will, by default, record the high-performance inference configuration that is about to be used. Due to the diversity of actual deployment environments and requirements, the default configuration might not meet all needs. In such cases, manual adjustment of the high-performance inference configuration may be necessary. Users can modify the configuration by editing the pipeline/module configuration file or by passing the `hpi_config` field in the parameters via CLI or Python API. Parameters passed via CLI or Python API will override the settings in the pipeline/module configuration file. Different levels of configurations in the config file are automatically merged, and the deepest-level settings take the highest priority. The following examples illustrate how to modify the configuration.

**For the general OCR pipeline, use the `onnxruntime` backend for all models:**

Expand Down Expand Up @@ -566,3 +568,11 @@ For the GPU version of the high-performance inference plugin, the official Paddl
**4. Why does the program freeze during runtime or display some "WARNING" and "ERROR" messages after using the high-performance inference feature? What should be done in such cases?**

When initializing the model, operations such as subgraph optimization may take longer and may generate some "WARNING" and "ERROR" messages. However, as long as the program does not exit automatically, it is recommended to wait patiently, as the program usually continues to run to completion.

**5. When using GPU for inference, enabling the high-performance inference plugin increases memory usage and causes OOM. How can this be resolved?**

Some acceleration methods trade off memory usage to support a broader range of inference scenarios. If memory becomes a bottleneck, consider the following optimization strategies:

* **Adjust pipeline configurations**: Disable unnecessary features to avoid loading redundant models. Appropriately reduce the batch size based on business requirements to balance throughput and memory usage.
* **Switch inference backends**: Different inference backends have varying memory management strategies. Try benchmarking various backends to compare memory usage and performance.
* **Optimize dynamic shape configurations**: For modules using TensorRT or Paddle Inference TensorRT subgraph engine, narrow the dynamic shape range based on the actual distribution of input data.
Loading