BGE Reranker / BERT Crossencoder Onnx model latency issue

I am using the Int8 quantized version of BGE-reranker-base model converted to the Onnx model. I am processing the inputs in batches. Now the scenario is that I am experiencing a latency of 20-30 secs with the original model. With the int8 quantized and onnx optimized model, the latency was reduced to 8-15 secs keeping all the configurations the same like hardware, batch processing, and everything I used with the original torch model. 
I am using Flask as an API server, on a quad-core machine.
I want further to reduce the model latency of the Onnx model. How can I do so?
Also please suggest anything more I can do during the deployment 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BGE Reranker / BERT Crossencoder Onnx model latency issue #2470

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BGE Reranker / BERT Crossencoder Onnx model latency issue #2470

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions