Load the model on demand #17

rockstar2020 · 2025-07-11T03:33:29Z

rockstar2020
Jul 11, 2025

Hi,
I have a very limited VRAM which I'd like to run few AI apps on.
Is there a way to load the model only when a transcription task is required, and then offload it after it's done?
I'm using uv-gpu docker.
Thanks

lee-b · 2025-07-30T20:32:28Z

lee-b
Jul 30, 2025

https://github.com/mostlygeek/llama-swap should do that for you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load the model on demand #17

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Load the model on demand #17

Uh oh!

rockstar2020 Jul 11, 2025

Replies: 1 comment

Uh oh!

lee-b Jul 30, 2025

rockstar2020
Jul 11, 2025

lee-b
Jul 30, 2025