Open
Description
Description
In ONNXRuntime, the OpenVINO EP accepts configuration options to set the number of threads and number of streams documented here, but these are ignored when passed to the EP in the Triton model config, for example:
optimization { execution_accelerators {
cpu_execution_accelerator : [ {
name : "openvino"
parameters { key: "num_of_threads" value: "4" }
parameters { key: "num_streams" value: "4" }
} ]
}}
The threading configuration for the ONNXRuntime backend is also ignored (expected)
parameters { key: "intra_op_thread_count" value: { string_value: "4" } }
parameters { key: "inter_op_thread_count" value: { string_value: "2" } }
Triton Information
Last tested with the Triton container 24.05.
To Reproduce
Serving an ONNX model we observe:
- The
intra_op_thread_count
/inter_op_thread_count
affect the number of inference threads used when OpenVINO is disabled - Enabling OpenVINO optimizations, CPU usage jumps to the default/max number of CPU threads
- Attempting to set
num_of_threads
ornum_streams
has no effect
Expected behavior
Expected behaviour would be that the OpenVINO EP ignores intra_op_thread_count
and inter_op_thread_count
but obeys num_of_threads
and num_streams
.
Unless I missed something and the ORT backend with OpenVINO optimizations reads the OpenVINO backend parameters?
Metadata
Metadata
Assignees
Labels
No labels