Support more remote deployment scenarios #326
Replies: 11 comments 13 replies
-
|
Yes, I think we’ll need to think about how to separate everything that is OpenAI-Compat from /props, which is very specific to llama.cpp but still quite interesting. We could also keep in mind the possibility of having a model selector in the built-in llama.cpp UI, thanks to llama-swap’s strict OpenAI-Compat behavior. And maybe consider multimodality through /v1/models (as a simple cache on llama-swap ?) instead of hitting /props directly. Lots of things to explore for the future! I’m sharing some of my live experiments here: https://www.serveurperso.com/ia/ (Three web interfaces connected to llama-swap legacy+svelte+llama.ui. It’s just my dedicated Debian Netinst LLM server, running 24/7 only for live testing all master branch + patchs) With the community’s steady stream of fixes and improvements, we’re getting pretty close to a kind of all-in-one like a “LM Studio for the web” and even a UI for editing config.yaml in llama-swap could be imagined down the road (though that’s another topic). |
Beta Was this translation helpful? Give feedback.
-
|
Thanks to llama.cpp + llama-swap it’s already easy to: Share fresh models with friends (some of mine are paranoid and prefer using mine, knowing I run them myself and the UIs don’t log anything ; just browser IndexedDB with import/export possible) Quickly publish fresh GGUF builds right after a PR lands in llama.cpp Run a local LLM server for a dev team inside a company, for sensitive projects, and plug it into Continue / Cline / VSCode Access your LLM server from anywhere, which gets even more interesting when a proxy toolcall enables home automation actions or web scraping tasks that ChatGPT and others would normally refuse to do The only thing missing is a small baseUrl option in llama-swap to make it easier to share and observe the logs of a server remotely. Of course it’s trivial to secure this with HTTPS authentication (just like llama.cpp already does with an API key). Even locally I put it behind a first proxy, but with baseUrl support I could pass through a second server and share llama-swap (I could easily cheat with URL rewriting, but it feels less clean!). |
Beta Was this translation helpful? Give feedback.
-
|
I plan to use llama-swap to serve a small team, working with confidential data. It would be great to make it is easier to protect the administration API and web interface from unauthorized access. I need developers to be able to use the regular API, but I don't want them to see logs or access other admin features. |
Beta Was this translation helpful? Give feedback.
-
|
When using only the single endpoint that powers the token counters on the llama-swap UI, and running only one model on the GPU at a time, I haven’t noticed any issue with conversations being cut off even with 2 users talking to 2 different models. It only takes the time needed to load the weights, but everything else works smoothly. |
Beta Was this translation helpful? Give feedback.
-
|
Hi mostlygeek ! Speaking of remote deployment : I had to SSH in and restart manually, because I noticed that if llama-server crashes, there are cases where llama-swap gets stuck waiting indefinitely, and I have to kill -9 everything to recover! |
Beta Was this translation helpful? Give feedback.
-
|
@ServeurpersoCom Hi, In the UI on the models page, the Unload/Unload All buttons should stop/kill the process. These buttons trigger llama-swap to send a SIGINT. If that times out (5 seconds), llama-swap will send a SIGKILL. If that's not working, is there any debug output? Windows and container images are a bit different. Windows doesn't have signals so by default llama-swap will make |
Beta Was this translation helpful? Give feedback.
-
|
That’s exactly why I didn’t open an issue: there’s definitely an unhandled case somewhere, because it happens consistently whenever I trigger a crash with an experimental binary. For context, I’m running on Debian Netinstall (CLI only), on a server fully dedicated to llama.cpp and llama-swap: it does nothing else, just feeding my LLM addiction lol. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @mostlygeek At the moment I'm encountering the following crash on the latest main branch. |
Beta Was this translation helpful? Give feedback.
-
|
I’ve disabled my custom acceleration patch to make sure it wasn’t the cause: the robustness issue is definitely present on the current main branch. That said, the default behavior could probably be improved performance-wise with something along these lines: |
Beta Was this translation helpful? Give feedback.
-
|
It's cool :)
toolcall-sandbox-mcp-kubernetes.mp4 |
Beta Was this translation helpful? Give feedback.
-
FYI I've recently started using llama-swap and I was looking for a way to protect /ui access. A basic authentication with a simple configuration via plain text variable in config.yml would be interesting. Thanks! |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
-
I saw in this comment ggml-org/llama.cpp#16255 (comment)
Plus, there’s been comments here or there about allowing remote access to self hosted chat UIs and APIs.
I’m starting this discussion to get some feedback from the community on use cases they’re solving for and what llama-swap changes could make that easier.
cc @ServeurpersoCom
Beta Was this translation helpful? Give feedback.
All reactions