Support more remote deployment scenarios #326

mostlygeek · 2025-09-27T15:08:11Z

mostlygeek
Sep 27, 2025
Maintainer

I saw in this comment ggml-org/llama.cpp#16255 (comment)

(This matters for self-hosting scenarios: geeks like me want to run this lightweight WebUI without exposing the llama-swap admin endpoint or /ui to the internet (attack surface). On top of that, llama-swap currently does not support relative URLs, which makes it harder to host safely behind a proxy.)

Plus, there’s been comments here or there about allowing remote access to self hosted chat UIs and APIs.

I’m starting this discussion to get some feedback from the community on use cases they’re solving for and what llama-swap changes could make that easier.

cc @ServeurpersoCom

ServeurpersoCom · 2025-09-27T15:28:32Z

ServeurpersoCom
Sep 27, 2025

Yes, I think we’ll need to think about how to separate everything that is OpenAI-Compat from /props, which is very specific to llama.cpp but still quite interesting.

We could also keep in mind the possibility of having a model selector in the built-in llama.cpp UI, thanks to llama-swap’s strict OpenAI-Compat behavior. And maybe consider multimodality through /v1/models (as a simple cache on llama-swap ?) instead of hitting /props directly.

Lots of things to explore for the future! I’m sharing some of my live experiments here: https://www.serveurperso.com/ia/ (Three web interfaces connected to llama-swap legacy+svelte+llama.ui. It’s just my dedicated Debian Netinst LLM server, running 24/7 only for live testing all master branch + patchs)

With the community’s steady stream of fixes and improvements, we’re getting pretty close to a kind of all-in-one like a “LM Studio for the web” and even a UI for editing config.yaml in llama-swap could be imagined down the road (though that’s another topic).

0 replies

ServeurpersoCom · 2025-09-27T16:05:25Z

ServeurpersoCom
Sep 27, 2025

Thanks to llama.cpp + llama-swap it’s already easy to:

Share fresh models with friends (some of mine are paranoid and prefer using mine, knowing I run them myself and the UIs don’t log anything ; just browser IndexedDB with import/export possible)

Quickly publish fresh GGUF builds right after a PR lands in llama.cpp

Run a local LLM server for a dev team inside a company, for sensitive projects, and plug it into Continue / Cline / VSCode

Access your LLM server from anywhere, which gets even more interesting when a proxy toolcall enables home automation actions or web scraping tasks that ChatGPT and others would normally refuse to do

The only thing missing is a small baseUrl option in llama-swap to make it easier to share and observe the logs of a server remotely. Of course it’s trivial to secure this with HTTPS authentication (just like llama.cpp already does with an API key). Even locally I put it behind a first proxy, but with baseUrl support I could pass through a second server and share llama-swap (I could easily cheat with URL rewriting, but it feels less clean!).

0 replies

sousekd · 2025-09-27T19:32:18Z

sousekd
Sep 27, 2025

I plan to use llama-swap to serve a small team, working with confidential data. It would be great to make it is easier to protect the administration API and web interface from unauthorized access. I need developers to be able to use the regular API, but I don't want them to see logs or access other admin features.

3 replies

mostlygeek Sep 27, 2025
Maintainer Author

Would the model swapping be an issue if models are changed frequently?

sousekd Sep 28, 2025

Yes, I was thinking about how to lock the models for some time when in use, to avoid unloading in the middle of work. But in a small team, it is not that difficult to coordinate. If the #304 allows, I would probably code-in something like "don't allow to unload a model used by another client/user in the past couple of minutes". I also plan to mod the hosted front-end (Open WebUI, LibreChat or Lobe Chat) to indicate the currently loaded and in-use models.

But that is all optional. If llama-swap handles the confilicting concurrent requests to swap models gracefully without causing instability, it would be fine in our team to just shout at each other "Don't touch it for a while!"

mostlygeek Sep 28, 2025
Maintainer Author

llama-swap should handle concurrent requests for different models in a first in/first out fashion and not swap in the middle of a request.

Though I haven’t tested it at high currency so it may behave in unexpected ways.

ServeurpersoCom · 2025-09-28T13:03:00Z

ServeurpersoCom
Sep 28, 2025

When using only the single endpoint that powers the token counters on the llama-swap UI, and running only one model on the GPU at a time, I haven’t noticed any issue with conversations being cut off even with 2 users talking to 2 different models. It only takes the time needed to load the weights, but everything else works smoothly.

0 replies

ServeurpersoCom · 2025-10-15T15:48:33Z

ServeurpersoCom
Oct 15, 2025

Hi mostlygeek ! Speaking of remote deployment : I had to SSH in and restart manually, because I noticed that if llama-server crashes, there are cases where llama-swap gets stuck waiting indefinitely, and I have to kill -9 everything to recover!
To be fair, it’s partly my fault : I’ve been experimenting with some Frankenstein-style branching of llama-server builds to test Qwen3-Next support!
Before opening a clean issue, I’d like to dig a bit deeper, but I’m leaving you this quick message here because, given how solid llama-swap is, I’m guessing you might already have a solution : anything from more robust child process handling to simply adding a kill button would do the trick.

0 replies

mostlygeek · 2025-10-15T17:26:56Z

mostlygeek
Oct 15, 2025
Maintainer Author

@ServeurpersoCom Hi,

In the UI on the models page, the Unload/Unload All buttons should stop/kill the process. These buttons trigger llama-swap to send a SIGINT. If that times out (5 seconds), llama-swap will send a SIGKILL. If that's not working, is there any debug output?

Windows and container images are a bit different. Windows doesn't have signals so by default llama-swap will make cmdStop: taskkill .... People using containers should use cmdStop: docker stop ....

0 replies

ServeurpersoCom · 2025-10-15T17:39:59Z

ServeurpersoCom
Oct 15, 2025

That’s exactly why I didn’t open an issue: there’s definitely an unhandled case somewhere, because it happens consistently whenever I trigger a crash with an experimental binary.
But I need to dig deeper before reporting anything concrete; it would be a bit premature right now.

For context, I’m running on Debian Netinstall (CLI only), on a server fully dedicated to llama.cpp and llama-swap: it does nothing else, just feeding my LLM addiction lol.
I’ll investigate further :)

8 replies

ServeurpersoCom Oct 27, 2025

I made a minimal single-instance version of llama-swap "swap.js" compatible with the config.yaml format from llama-swap and found two interesting things:

You really need to handle abrupt SSE stream interruptions properly: that was probably the cause of the crashes I had with llama-swap before the latest commit. I had to fix the same issue in my own implementation.

It’s also possible to send a few swap log lines inside reasoning_content (since the recent streaming refactor) while staying semantically OpenAI-compatible. This way the user no longer perceives the swap as a dead wait: after all, backend loading time can reasonably be seen as part of the “thinking” phase. And CoT content doesn’t pollute the model context anyway.

swap.js.mp4

mostlygeek Oct 27, 2025
Maintainer Author

@ServeurpersoCom that is a really clever idea to use reasoning_content to show loading progress. I think that should be a part of llama-swap.

Any weird behaviour you’ve seen from implementing that?

ServeurpersoCom Oct 27, 2025

Now I need to test two things:

Reproduce the crash I had by simply killing my intermediate toolcall proxy process during an active inference on latest llama-swap.
Measure the timing difference between my minimal llama-swap-like script and the real llama-swap to see if model swapping behaves identically.
If both match, then we’re good!

What I’ve added to the reasoning_content is:

the swap start,
any load_tensors: lines from llama-server,
and the swap end.

It works perfectly on both the SvelteUI and the Legacy 100% OpenAI-Compat fork.
Feel free to try them here (if something breaks, it’s just because I’m mid-dev: the 5090 machine is fully dedicated to this):
https://www.serveurperso.com/ia/new (Native llama.cpp SvelteUI app)
https://www.serveurperso.com/ia/llm (Legacy React improved fork but 100% OAI-Compat)

This is a hardcore monolithic version that includes exactly what I use from llama-swap and nothing more. I built it because I was getting annoyed having to restart processes several times a day. Now it can serve as a benchmark, llama-swap should match it: no crashes, same swap speed, and optional CoT enrichment :
swap.js

ServeurpersoCom Oct 27, 2025

@ServeurpersoCom that is a really clever idea to use reasoning_content to show loading progress. I think that should be a part of llama-swap.

Any weird behaviour you’ve seen from implementing that?

No problem at all! Instead of making the client wait, we just send a few chunks :)

To inject logs into the chunks it’s quite simple: you can send a full line up to the “\n” (included) inside reasoning_content chunks.
For the dots, I improved it a bit: I send one dot per chunk to keep the display smooth, since that’s usually the longest phase when weights are being transferred.
This could also become a command-line option in llama-swap, since it’s just an optional but very cool enrichment.

Just posted the feature idea here: #367

ServeurpersoCom Oct 29, 2025

Testing :)

ServeurpersoCom · 2025-11-03T09:41:51Z

ServeurpersoCom
Nov 3, 2025

Hi @mostlygeek

At the moment I'm encountering the following crash on the latest main branch.
I'm monitoring these crashes while developing intermediate proxies that handle MCP toolcalls in sandboxed environments: this is an ongoing development context, so my configuration is a bit unusual, but external stream failures on my side shouldn’t be able to crash llama-swap itself. It points to a potential robustness issue in how it handles disconnections or broken streams.

goroutine 87 [running]:
github.com/mostlygeek/llama-swap/proxy.(*statusResponseWriter).sendData(0xc0005f8050, {0xc0005b27f8, 0x2})
        /root/llama-swap.pascal/proxy/process.go:832 +0x2f9
github.com/mostlygeek/llama-swap/proxy.(*statusResponseWriter).sendLine(...)
        /root/llama-swap.pascal/proxy/process.go:798
github.com/mostlygeek/llama-swap/proxy.(*statusResponseWriter).statusUpdates.func1()
        /root/llama-swap.pascal/proxy/process.go:740 +0x13d
github.com/mostlygeek/llama-swap/proxy.(*statusResponseWriter).statusUpdates(0xc0005f8050, {0xa952f0, 0xc0005f8000})
        /root/llama-swap.pascal/proxy/process.go:760 +0x48a
created by github.com/mostlygeek/llama-swap/proxy.(*Process).ProxyRequest in goroutine 16
        /root/llama-swap.pascal/proxy/process.go:511 +0x345
(root|~/scripts)

1 reply

mostlygeek Nov 3, 2025
Maintainer Author

fixed in #378. Thanks for finding this.

ServeurpersoCom · 2025-11-03T09:44:51Z

ServeurpersoCom
Nov 3, 2025

I’ve disabled my custom acceleration patch to make sure it wasn’t the cause: the robustness issue is definitely present on the current main branch.

That said, the default behavior could probably be improved performance-wise with something along these lines:

(root|~/llama-swap.pascal) git diff f91a8b2462ce4e17315a958caf9601f459070b76
diff --git a/proxy/config/config.go b/proxy/config/config.go
index d957cd4..015d17d 100644
--- a/proxy/config/config.go
+++ b/proxy/config/config.go
@@ -179,9 +179,9 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
                return Config{}, err
        }

-       if config.HealthCheckTimeout < 15 {
-               // set a minimum of 15 seconds
-               config.HealthCheckTimeout = 15
+       if config.HealthCheckTimeout < 5 {
+               // set a minimum of 5 seconds
+               config.HealthCheckTimeout = 5
        }

        if config.StartPort < 1 {
diff --git a/proxy/process.go b/proxy/process.go
index 6406358..87b82c2 100644
--- a/proxy/process.go
+++ b/proxy/process.go
@@ -114,7 +114,7 @@ func NewProcess(ID string, healthCheckTimeout int, config config.ModelConfig, pr
                processLogger:           processLogger,
                proxyLogger:             proxyLogger,
                healthCheckTimeout:      healthCheckTimeout,
-               healthCheckLoopInterval: 5 * time.Second, /* default, can not be set by user - used for testing */
+               healthCheckLoopInterval: 250 * time.Millisecond, /* default, can not be set by user - used for testing */
                state:                   StateStopped,

                // concurrency limit
@@ -122,7 +122,7 @@ func NewProcess(ID string, healthCheckTimeout int, config config.ModelConfig, pr

                // To be removed when migration over exec.CommandContext is complete
                // stop timeout
-               gracefulStopTimeout: 10 * time.Second,
+               gracefulStopTimeout: time.Second,
                cmdWaitChan:         make(chan struct{}),
        }
 }
@@ -288,7 +288,7 @@ func (p *Process) start() error {
        // 3. The health check passes
        //
        // only in the third case will the process be considered Ready to accept
-       <-time.After(250 * time.Millisecond) // give process a bit of time to start
+       <-time.After(50 * time.Millisecond) // give process a tiny bit of time to start

        checkStartTime := time.Now()
        maxDuration := time.Second * time.Duration(p.healthCheckTimeout)
@@ -408,7 +408,7 @@ func (p *Process) Shutdown() {
 }

 // stopCommand will send a SIGTERM to the process and wait for it to exit.
-// If it does not exit within 5 seconds, it will send a SIGKILL.
+// If it does not exit quickly (default 1 second), it will send a SIGKILL.
 func (p *Process) stopCommand() {
        stopStartTime := time.Now()
        defer func() {
@@ -522,7 +522,7 @@ func (p *Process) ProxyRequest(w http.ResponseWriter, r *http.Request) {
                                // Wait for statusUpdates goroutine to finish writing its deferred "Done!" messages
                                // before closing the connection. Without this, the connection would close before
                                // the goroutine can write its cleanup messages, causing incomplete SSE output.
-                               srw.waitForCompletion(100 * time.Millisecond)
+                               srw.waitForCompletion(50 * time.Millisecond)
                        } else {
                                http.Error(w, errstr, http.StatusBadGateway)
                        }
@@ -548,7 +548,7 @@ func (p *Process) ProxyRequest(w http.ResponseWriter, r *http.Request) {

        if srw != nil {
                // Wait for the goroutine to finish writing its final messages
-               const completionTimeout = 1 * time.Second
+               const completionTimeout = 200 * time.Millisecond
                if !srw.waitForCompletion(completionTimeout) {
                        p.proxyLogger.Warnf("<%s> status updates goroutine did not complete within %v, proceeding with proxy request", p.ID, completionTimeout)
                }
(root|~/llama-swap.pascal)

0 replies

ServeurpersoCom · 2025-11-03T10:53:23Z

ServeurpersoCom
Nov 3, 2025

It's cool :)

toolcall-sandbox-mcp-kubernetes.mp4

0 replies

xezpeleta · 2025-12-10T13:28:34Z

xezpeleta
Dec 10, 2025

I saw in this comment ggml-org/llama.cpp#16255 (comment)

(This matters for self-hosting scenarios: geeks like me want to run this lightweight WebUI without exposing the llama-swap admin endpoint or /ui to the internet (attack surface). On top of that, llama-swap currently does not support relative URLs, which makes it harder to host safely behind a proxy.)

Plus, there’s been comments here or there about allowing remote access to self hosted chat UIs and APIs.

I’m starting this discussion to get some feedback from the community on use cases they’re solving for and what llama-swap changes could make that easier.

cc @ServeurpersoCom

FYI I've recently started using llama-swap and I was looking for a way to protect /ui access. A basic authentication with a simple configuration via plain text variable in config.yml would be interesting.

Thanks!

1 reply

ServeurpersoCom Dec 10, 2025

@xezpeleta It's possible to do it directly with the reverse proxy without having to touch the code, otherwise you'd need to open a small GitHub issue here to see what @mostlygeek thinks about it. I could even make him a POC if he wants.
Because ideally, we should have a baseUrl option, since there are multiple endpoints directly at the root which is less convenient than a single root endpoint that centralizes everything!
Also worth noting that HTTP Basic or Bearer authentication must be used over TLS, which further justifies using a reverse proxy, as they're battle-tested for security!

Support more remote deployment scenarios #326

Uh oh!

mostlygeek Sep 27, 2025 Maintainer

Replies: 11 comments · 13 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mostlygeek Sep 27, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

mostlygeek Sep 28, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

mostlygeek Oct 15, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

mostlygeek Oct 27, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mostlygeek Nov 3, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mostlygeek
Sep 27, 2025
Maintainer

Replies: 11 comments 13 replies

mostlygeek Sep 27, 2025
Maintainer Author

mostlygeek Sep 28, 2025
Maintainer Author

mostlygeek
Oct 15, 2025
Maintainer Author

mostlygeek Oct 27, 2025
Maintainer Author

mostlygeek Nov 3, 2025
Maintainer Author