Skip to content

How to release GPU memory after each inference? #446

Open
@nguyenthekhoig7

Description

@nguyenthekhoig7

Situation

  • I want to run multiple models in the same GPU, but onnxruntime-genai (ort-genai) does not release the GPU RAM after each inference. I want to find a way to optimize the GPU memory after each inference turn.

I have tried

  1. Add
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

after each inference, but it does not work, I guest the RAM is claimed by ort-genai not torch.

  1. Research: as in onnxruntime (with out genai), there's some config option to shrink GPU memory after each run to release the memory, but I have not found any way to do the same in ort-genai.

Some of the possible way I can think of:

  • Run ort-genai in a thread (not sure how)
  • Use onnxruntime config (I guess there should be some way to config the father class?)

In best scenario, I want to release the available memory, keep the model to use later, otherwise, I wonder if I can remove the onnnxruntime-genai.Model instance to get back the memory.

Hope to hear every suggestion to achieve this.
Thank you for your time!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions