[Performance]  model inference in onnxruntime is toooooo slow

### Describe the issue

I covert [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) model to onnx 。And run it in GPU. But I find it is toooo slow in onnx inference。
I run this model in torch , and get about **4min** in 10000 pairs of sentence。
When I run in onnx， I  get almost **1h** in same data  and same server. 
![Image](https://github.com/user-attachments/assets/9dec60f0-ec26-4ad6-bc46-5e5a07819565)

Here is device info when run onnx model:
**CPU:**
![Image](https://github.com/user-attachments/assets/f74b8b3c-5b41-4795-9307-0541aadf8a21)

**GPU:**
![Image](https://github.com/user-attachments/assets/6bf2516a-6e0d-4503-a78f-39248d9e5f38)


My device is GPU(NVIDIA GeForce RTX 4090) 



Here is versions:
```
python  3.10

onnx                              1.17.0
onnx-graphsurgeon                 0.5.2
onnx-simplifier                   0.4.36
onnxruntime-gpu                   1.19.2
torch                             2.5.1
```

**Why onnx model so slow?** 


### To reproduce

Here is My inference code:
```
class OnnxInference():
    def __init__(self):
        import onnxruntime

        self.max_length = 4096
        device = 'gpu'
        model_path = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
        if device == "cpu":
            self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path))
        elif "gpu" in device:
            providers = ['CUDAExecutionProvider']
            self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path), providers=providers)
        ###########
        from transformers import AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("../bge-reranker-v2-m3")

    def inference(self, input_data):
        inputs = self.tokenizer(input_data,
                           padding=True, truncation=True, return_tensors='np', max_length=self.max_length)

        def get_input_feed(input_ids, attention_mask):
            input_feed = {}
            input_feed["input_ids"] = input_ids
            input_feed["attention_mask"] = attention_mask

            return input_feed

        input_feed = get_input_feed(inputs["input_ids"], inputs["attention_mask"])
        outs = self.onnx_model.run(["logits"], input_feed)

        return outs
```


Here is my covert code (torch 2 onnx):
```

def covert_to_onnx():
    from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, \
        is_torch_npu_available
    model_name_or_path = "/data/fffan/01_experiment/03_Bge/bge-reranker-v2-m3"

    device = 'cuda:0'
    input_ids_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))
    attention_mask_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))

    input_ids_tf = input_ids_np.type(torch.int64).to(device)
    attention_mask_tf = attention_mask_np.type(torch.int64).to(device)

    model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)
    model = model.to(device)
    model.eval()

    onnx_name = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
    torch.onnx.export(model,  # model being run
                      (input_ids_tf,attention_mask_tf),  # model input (or a tuple for multiple inputs)
                      onnx_name,  # where to save the model
                      opset_version=14,  # the ONNX version to export the model to
                      input_names=['input_ids', 'attention_mask'],
                      output_names=['logits'],
                      dynamic_axes={"input_ids": {0:"batch_size",1:"max_length"},  # 批处理变量
                                    "attention_mask": {0:"batch_size",1:"max_length"}
                                    }
                      )

    print("####  转换完成")

``` 

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 22.04

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

onnxruntime-gpu                   1.19.2

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

CUDA

### Execution Provider Library Version

CUDA 12.3

### Model File

_No response_

### Is this a quantized model?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] model inference in onnxruntime is toooooo slow #23282

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] model inference in onnxruntime is toooooo slow #23282

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions