I have the following in my docker compose file:
environment:
- CHAT_COMPLETION_BASE_URL=http://192.168.11.27:11434/v1
- CHAT_COMPLETION_API_KEY=xxx
# Keep models in memory forever to stop Ollama from hogging all of the VRAM
- STT_MODEL_TTL=-1
- TTS_MODEL_TTL=-1
If I make a request to the API, I see the GPU VRAM usage go back to zero a few minutes after the request.