Open
Description
I am using the Int8 quantized version of BGE-reranker-base model converted to the Onnx model. I am processing the inputs in batches. Now the scenario is that I am experiencing a latency of 20-30 secs with the original model. With the int8 quantized and onnx optimized model, the latency was reduced to 8-15 secs keeping all the configurations the same like hardware, batch processing, and everything I used with the original torch model.
I am using Flask as an API server, on a quad-core machine.
I want further to reduce the model latency of the Onnx model. How can I do so?
Also please suggest anything more I can do during the deployment