It would be very useful if lemonade-server could auto-unload models that haven't been used for a while, similar to what Ollama does. In fact, this feature is the only reason I even still use Ollama at this point. It's just super convenient to run the inference engine as a systemd service on boot and have it available whenever I need it, but having to manually unload the models or stop the service if I want to use other VRAM-heavy applications like Blender or ComfyUI, or run a game gets annoying quickly. Lemonade already keeps track of model inactivity if I understand correctly, but only uses this functionality to unload inactive models if you need VRAM to load another model in Lemonade right now.
It would be even nicer if the server would only unload inactive models if some other application starts filling up VRAM. Best of both worlds - unused RAM is wasted RAM after all.
It would be very useful if lemonade-server could auto-unload models that haven't been used for a while, similar to what Ollama does. In fact, this feature is the only reason I even still use Ollama at this point. It's just super convenient to run the inference engine as a systemd service on boot and have it available whenever I need it, but having to manually unload the models or stop the service if I want to use other VRAM-heavy applications like Blender or ComfyUI, or run a game gets annoying quickly. Lemonade already keeps track of model inactivity if I understand correctly, but only uses this functionality to unload inactive models if you need VRAM to load another model in Lemonade right now.
It would be even nicer if the server would only unload inactive models if some other application starts filling up VRAM. Best of both worlds - unused RAM is wasted RAM after all.