-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed as not planned
Closed as not planned
Copy link
Labels
performanceissues related to performance regressionsissues related to performance regressions
Description
Describe the issue
I am running a swin transformer backbone using onnxruntime python. The inference latency is normal when using sequential execution mode. After I change the execution mode to ORT_PARALLEL, the inference is way slower than before.
From the profiling it can be observed that actually no operation is done in parallel. Instead, operations are separated to difference threads and large amount of idle time is added in between.

Does anyone know what is causing this problem?
To reproduce
import onnxruntime as ort
from mmengine.config import Config
from mmengine.runner import Runner
model_path = "backbone.onnx"
sess_options = ort.SessionOptions()
sess_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
session = ort.InferenceSession(model_path, sess_options, providers=["CUDAExecutionProvider"])
cfg = Config.fromfile("maskrcnn.py")
runner = Runner.from_cfg(cfg)
output_name = session.get_outputs()[0].name
data_iter = iter(runner.val_dataloader)
latencies = []
for i in range(200):
print(i, 200)
batch = next(data_iter)
data = runner.model.data_preprocessor(batch, False)
outputs = session.run([output_name], {'input': data['inputs'].cpu().numpy()})
Urgency
No response
Platform
Linux
OS Version
Red Hat Enterprise Linux release 8.10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
Model File
No response
Is this a quantized model?
No
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
performanceissues related to performance regressionsissues related to performance regressions