-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
Labels
model:transformerissues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.performanceissues related to performance regressionsissues related to performance regressionsstaleissues that have not been addressed in a while; categorized by a botissues that have not been addressed in a while; categorized by a bot
Description
Describe the issue
I covert bge-reranker-v2-m3 model to onnx 。And run it in GPU. But I find it is toooo slow in onnx inference。
I run this model in torch , and get about 4min in 10000 pairs of sentence。
When I run in onnx, I get almost 1h in same data and same server.

Here is device info when run onnx model:
CPU:

My device is GPU(NVIDIA GeForce RTX 4090)
Here is versions:
python 3.10
onnx 1.17.0
onnx-graphsurgeon 0.5.2
onnx-simplifier 0.4.36
onnxruntime-gpu 1.19.2
torch 2.5.1
Why onnx model so slow?
To reproduce
Here is My inference code:
class OnnxInference():
def __init__(self):
import onnxruntime
self.max_length = 4096
device = 'gpu'
model_path = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
if device == "cpu":
self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path))
elif "gpu" in device:
providers = ['CUDAExecutionProvider']
self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path), providers=providers)
###########
from transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained("../bge-reranker-v2-m3")
def inference(self, input_data):
inputs = self.tokenizer(input_data,
padding=True, truncation=True, return_tensors='np', max_length=self.max_length)
def get_input_feed(input_ids, attention_mask):
input_feed = {}
input_feed["input_ids"] = input_ids
input_feed["attention_mask"] = attention_mask
return input_feed
input_feed = get_input_feed(inputs["input_ids"], inputs["attention_mask"])
outs = self.onnx_model.run(["logits"], input_feed)
return outs
Here is my covert code (torch 2 onnx):
def covert_to_onnx():
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, \
is_torch_npu_available
model_name_or_path = "/data/fffan/01_experiment/03_Bge/bge-reranker-v2-m3"
device = 'cuda:0'
input_ids_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))
attention_mask_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))
input_ids_tf = input_ids_np.type(torch.int64).to(device)
attention_mask_tf = attention_mask_np.type(torch.int64).to(device)
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)
model = model.to(device)
model.eval()
onnx_name = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
torch.onnx.export(model, # model being run
(input_ids_tf,attention_mask_tf), # model input (or a tuple for multiple inputs)
onnx_name, # where to save the model
opset_version=14, # the ONNX version to export the model to
input_names=['input_ids', 'attention_mask'],
output_names=['logits'],
dynamic_axes={"input_ids": {0:"batch_size",1:"max_length"}, # 批处理变量
"attention_mask": {0:"batch_size",1:"max_length"}
}
)
print("#### 转换完成")
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu 1.19.2
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.3
Model File
No response
Is this a quantized model?
No
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
model:transformerissues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.performanceissues related to performance regressionsissues related to performance regressionsstaleissues that have not been addressed in a while; categorized by a botissues that have not been addressed in a while; categorized by a bot
