How to release GPU memory after each inference?


**Situation**
- I want to run multiple models in the same GPU, but onnxruntime-genai (`ort-genai`) does not release the GPU RAM after each inference. I want to find a way to optimize the GPU memory after each inference turn.

**I have tried**
1. Add
```
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
```
 after each inference, but it does not work, I guest the RAM is claimed by `ort-genai` not torch.


2. Research: as in `onnxruntime` (with out genai), there's some config option to [shrink GPU memory](https://github.com/microsoft/onnxruntime/issues/11801#issuecomment-1303914681) after each run to release the memory, but I have not found any way to do the same in ort-genai.

**Some of the possible way I can think of:**
- Run `ort-genai` in a thread (not sure how)
- Use `onnxruntime `config (I guess there should be some way to config the father class?)

In best scenario, I want to release the available memory, keep the model to use later, otherwise, I wonder if I can remove the `onnnxruntime-genai.Model` instance to get back the memory.


Hope to hear every suggestion to achieve this.
Thank you for your time!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to release GPU memory after each inference? #446

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to release GPU memory after each inference? #446

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions