Skip to content

[Performance] model inference in onnxruntime is toooooo slow #23282

@Tian14267

Description

@Tian14267

Describe the issue

I covert bge-reranker-v2-m3 model to onnx 。And run it in GPU. But I find it is toooo slow in onnx inference。
I run this model in torch , and get about 4min in 10000 pairs of sentence。
When I run in onnx, I get almost 1h in same data and same server.
Image

Here is device info when run onnx model:
CPU:
Image

GPU:
Image

My device is GPU(NVIDIA GeForce RTX 4090)

Here is versions:

python  3.10

onnx                              1.17.0
onnx-graphsurgeon                 0.5.2
onnx-simplifier                   0.4.36
onnxruntime-gpu                   1.19.2
torch                             2.5.1

Why onnx model so slow?

To reproduce

Here is My inference code:

class OnnxInference():
    def __init__(self):
        import onnxruntime

        self.max_length = 4096
        device = 'gpu'
        model_path = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
        if device == "cpu":
            self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path))
        elif "gpu" in device:
            providers = ['CUDAExecutionProvider']
            self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path), providers=providers)
        ###########
        from transformers import AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("../bge-reranker-v2-m3")

    def inference(self, input_data):
        inputs = self.tokenizer(input_data,
                           padding=True, truncation=True, return_tensors='np', max_length=self.max_length)

        def get_input_feed(input_ids, attention_mask):
            input_feed = {}
            input_feed["input_ids"] = input_ids
            input_feed["attention_mask"] = attention_mask

            return input_feed

        input_feed = get_input_feed(inputs["input_ids"], inputs["attention_mask"])
        outs = self.onnx_model.run(["logits"], input_feed)

        return outs

Here is my covert code (torch 2 onnx):


def covert_to_onnx():
    from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, \
        is_torch_npu_available
    model_name_or_path = "/data/fffan/01_experiment/03_Bge/bge-reranker-v2-m3"

    device = 'cuda:0'
    input_ids_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))
    attention_mask_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))

    input_ids_tf = input_ids_np.type(torch.int64).to(device)
    attention_mask_tf = attention_mask_np.type(torch.int64).to(device)

    model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)
    model = model.to(device)
    model.eval()

    onnx_name = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
    torch.onnx.export(model,  # model being run
                      (input_ids_tf,attention_mask_tf),  # model input (or a tuple for multiple inputs)
                      onnx_name,  # where to save the model
                      opset_version=14,  # the ONNX version to export the model to
                      input_names=['input_ids', 'attention_mask'],
                      output_names=['logits'],
                      dynamic_axes={"input_ids": {0:"batch_size",1:"max_length"},  # 批处理变量
                                    "attention_mask": {0:"batch_size",1:"max_length"}
                                    }
                      )

    print("####  转换完成")

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.19.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.3

Model File

No response

Is this a quantized model?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    model:transformerissues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.performanceissues related to performance regressionsstaleissues that have not been addressed in a while; categorized by a bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions