unload model before switching backends (TabbyApi->llama-server) #317

mpetruc · 2025-09-23T15:37:40Z

mpetruc
Sep 23, 2025

I use TabbyApi for exl and llama-server for gguf. Currently i need to manually unload a Tabby model before i can load a llama-server one.
Is it possible to configure llama-swap to send a "POST /v1/model/unload " request before sending the request to start a new llama-server -m some_model.gguf?

mpetruc · 2025-09-23T21:39:00Z

mpetruc
Sep 23, 2025
Author

what is the proper way to run multiple commands in succession? I've tried several config.yaml options:

macros:
  "latest-llama": >
    /app/llama-server
    --port 5050
models:
  qwen3-32B:
    proxy: "http://127.0.0.1:5050"
    cmd: |
      curl -X 'POST'   'http://192.168.3.197:5000/v1/model/unload'
      ${latest-llama}
      --n-gpu-layers 999

This one appears to concatenate the 2 commands and throws 'curl: option --port: is unknown'.
Also tried

    cmd:>
      /bin/bash -c "curl ... && ${latest-llama}..."

With this approach the previous model get unloaded and the new one get loaded, but the third iteration doesn't work anymore (loading yet another model does not unload the loaded one).

0 replies

mostlygeek · 2025-09-23T22:18:24Z

mostlygeek
Sep 23, 2025
Maintainer

Is llama-swap not able to unload the Tabby model for you?
Also cmd is not a shell. It's a lot lower level so the typical shell syntax doesn't work in it. You can use a shell script to do multiple things but that is quite complex as the shell has to trap SIGINT/SIGTERM signals and pass it along to llama-server.

0 replies

mpetruc · 2025-09-24T05:38:28Z

mpetruc
Sep 24, 2025
Author

no, llama-swap doesn't unload TabbyApi: tabby needs a curl POST unload call as in my example above.
I've tried a simple, naive shell script, but that now i'm losing the ability to unload llama-server models!

0 replies

mostlygeek · 2025-09-24T06:06:51Z

mostlygeek
Sep 24, 2025
Maintainer

I’m confused. Are you starting tabbyAPI with llama-swap? It sounds like you’re running it outside of llama-swap and trying to send a signal to stop it.

0 replies

mostlygeek · 2025-09-24T06:10:02Z

mostlygeek
Sep 24, 2025
Maintainer

we’re talking in two places. I just saw #58 (comment).

Can you share you whole llama-swap config? It may shed some light on what’s going on and why tabby can’t be stopped by llama-swap.

0 replies

mpetruc · 2025-09-24T15:21:27Z

mpetruc
Sep 24, 2025
Author

Sure, here's the current config.yaml:

healthCheckTimeout: 30000
logRequests: true
metricsMaxInMemory: 1000

models:
  qwen3:
    proxy: "http://127.0.0.1:5050"
    cmd: |
      /app/llama-server  --port 5050  --host 0.0.0.0  -m /models/qwen3.gguf  

  gpt-oss-120b-F16:
    proxy: "http://127.0.0.1:5050"
    cmd: |
      /app/llama-server -m /models/gpt-oss-120b-F16.gguf --host 0.0.0.0 --port 5050 

  phi-4-exl2:
    proxy: "http://192.168.3.197:5000"
    cmd: >
      curl -X 'POST' 'http://192.168.3.197:5000/v1/model/load' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"model_name": "phi-4-exl2"}'

  I am trying to figure out if i can use llama-swap to start/stop/swap models on with different engines (llama-server, tabbyapi, vllm) on the same machine. As of now i can load a model in Tabby from llama-swap, but i can't unload it to make room for another non-Tabby model. I have to go directly to Tabby to unload it, and then come back to llama-swap to load the llama-server model.

2 replies

mpetruc Sep 24, 2025
Author

Oh, maybe we're talking about two different things here. I believe you're thinking about stopping Tabby, while I'm talking about just unloading a model (and keeping Tabby running).
Tabby is running on a different container than llama-swap, but on the same physical box, therefore sharing the GPU. Tabby is smart enough to swap its exl models (they call it "inline loading"). The swapping does not involve restarting the backend (like llama-server and vllm), the application just unloads the currently loaded model and loads the new one when it receives a request. But Tabby is not aware of non-Tabby models being already loaded, and will happily proceed to try to load over, or ignoring what's already in VRAM. Llama-swap addresses this particular situation: it unloads (probably shuts down) llama-server before executing the curl /model/load cmd. It would be fantastic if it could also send a curl /model/unload before running any llama-server cmd.

mostlygeek Sep 24, 2025
Maintainer

Ah! That clears things up.

I've seen people address the docker-in-docker use case in different ways. Usually the easiest way is to have llama-swap manage swapping of containers with a specific cmdStop.

Maybe if #304 lands there can be a pre-swap hook for issuing commands.

mpetruc · 2025-09-24T19:15:58Z

mpetruc
Sep 24, 2025
Author

yes, i saw those discussions, and it seems that cmdStop-ing a container might work. I will probably have to consider this "nuclear option" if there's no hope of sending a clean unload api call. Is it? :-)

0 replies

unload model before switching backends (TabbyApi->llama-server) #317

Uh oh!

Uh oh!

mpetruc Sep 23, 2025

Replies: 7 comments · 2 replies

Uh oh!

mpetruc Sep 23, 2025 Author

Uh oh!

mostlygeek Sep 23, 2025 Maintainer

Uh oh!

mpetruc Sep 24, 2025 Author

Uh oh!

mostlygeek Sep 24, 2025 Maintainer

Uh oh!

mostlygeek Sep 24, 2025 Maintainer

Uh oh!

Uh oh!

mpetruc Sep 24, 2025 Author

Uh oh!

Uh oh!

mpetruc Sep 24, 2025 Author

Uh oh!

mostlygeek Sep 24, 2025 Maintainer

Uh oh!

mpetruc Sep 24, 2025 Author

mpetruc
Sep 23, 2025

Replies: 7 comments 2 replies

mpetruc
Sep 23, 2025
Author

mostlygeek
Sep 23, 2025
Maintainer

mpetruc
Sep 24, 2025
Author

mostlygeek
Sep 24, 2025
Maintainer

mostlygeek
Sep 24, 2025
Maintainer

mpetruc
Sep 24, 2025
Author

mpetruc Sep 24, 2025
Author

mostlygeek Sep 24, 2025
Maintainer

mpetruc
Sep 24, 2025
Author