Open
Description
Situation
- I want to run multiple models in the same GPU, but onnxruntime-genai (
ort-genai
) does not release the GPU RAM after each inference. I want to find a way to optimize the GPU memory after each inference turn.
I have tried
- Add
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
after each inference, but it does not work, I guest the RAM is claimed by ort-genai
not torch.
- Research: as in
onnxruntime
(with out genai), there's some config option to shrink GPU memory after each run to release the memory, but I have not found any way to do the same in ort-genai.
Some of the possible way I can think of:
- Run
ort-genai
in a thread (not sure how) - Use
onnxruntime
config (I guess there should be some way to config the father class?)
In best scenario, I want to release the available memory, keep the model to use later, otherwise, I wonder if I can remove the onnnxruntime-genai.Model
instance to get back the memory.
Hope to hear every suggestion to achieve this.
Thank you for your time!