[Question] Multiple model inputs and GPU allocations

Hi!

> I wasn't sure whether to place this under bug or whether it works as intended

I'm currently facing an issue where a model, deployed via Triton ONNX Backend, with up to a hundred inputs has a relatively high `nv_inference_compute_input_duration_us`, which from my understanding this metric also includes copying tensor data to GPU. Is there a possibility that each input results in a seperate GPU allocator call?

From what I see in `ModelInstanceState::SetInputTensors` https://github.com/triton-inference-server/onnxruntime_backend/blob/main/src/onnxruntime.cc#L2273 inputs are processed sequentially and each input results in a call to `CreateTensorWithDataAsOrtValue` is it possible that this could result in seperate GPU allocations and copies therefore a long `nv_inference_compute_input_duration_us`? Or is copying tensor data to GPU happening before a request is passed to the ONNX Backend?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Multiple model inputs and GPU allocations #269

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Multiple model inputs and GPU allocations #269

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions