Skip to content

BGE Reranker / BERT Crossencoder Onnx model latency issue #2470

Open
@ojasDM

Description

@ojasDM

I am using the Int8 quantized version of BGE-reranker-base model converted to the Onnx model. I am processing the inputs in batches. Now the scenario is that I am experiencing a latency of 20-30 secs with the original model. With the int8 quantized and onnx optimized model, the latency was reduced to 8-15 secs keeping all the configurations the same like hardware, batch processing, and everything I used with the original torch model.
I am using Flask as an API server, on a quad-core machine.
I want further to reduce the model latency of the Onnx model. How can I do so?
Also please suggest anything more I can do during the deployment

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions