Using llama-swap in kubernetes (and probably docker) to manage backends in neighboring containers #305

Jordanb716 · 2025-09-18T06:27:43Z

Jordanb716
Sep 18, 2025

My use case was to run llama-swap in kubernetes, run vllm-gfx906 as a sidecar, and use llama-swap to orchestrate the model swapping. After struggling with it for a while, I got a system working, that uses a shared container, some named pipes, and a script on each side of the connection so llama-swap could remote control the other container.

Here's the scripts:

run.sh

#!/bin/bash

# Run command on listening container
# Modified from https://gitlab.com/-/snippets/2500587

MYDIR="$(dirname "$(readlink -f "$0")")"
cd $MYDIR

shutdown() {
  echo "sending __shutdown__ command..."
  echo "__shutdown__" > stdin
  exit 0
}

trap "shutdown" EXIT

(
  echo "$@"
  if read -t 0 ; then
    cat - 2> /dev/null
  fi
) | cat > stdin

cat stderr 1>&2 &
cat stdout

wait

listen.sh

#!/bin/bash

# Set up a pipe and listen for commands on it.
# Modified from https://gitlab.com/-/snippets/2500587

MYDIR="$(dirname "$(readlink -f "$0")")"
cd $MYDIR

echo "Creating pipes"
mkfifo stdin
mkfifo stdout
mkfifo stderr

echo "Listening..."

while true; do
  read -r line < stdin
  if [ -z "$line" ]; then # Skip blank lines
    continue
  fi

  echo "Received \"$line\""

  if [[ "$line" == "__shutdown__" ]]; then
    echo "Terminating running commands..."
    trap '' SIGTERM
    kill -s SIGTERM 0
    echo "Done. Listening for new commands..."

  else
    echo "$line" | bash -s 2> stderr > stdout &

  fi

done

The general setup is to have the llama-swap container, and the container for whatever backend you want top run, side by side. I'm currently using vllm-gfx906, but it shooould work for basically anything. Each container has a shared volume, that contains the two scripts, and a set of named pipes (which the scripts create on startup). On the non-llama-swap container you change the startup command to just run the listen script

Example:
command: ["/tunnel/listen.sh"].

llama-swap runs normally, and in the config, you just write whatever command you would want to run to start the LLM server normally, but pass it to the run.sh script as arguments, as shown in the example below.

Example:

models:
  "Test":
    cmd: /vllm/run.sh vllm serve --port ${PORT} --served-model-name Test

If anyone has some wisdom to share on how to get this working more cleanly, I'd love to hear about it, as I would if anyone has suggestions for improvements in general. (Or to tell me there's a much simpler/better way to do this whole thing, but I might be a bit grumpy about that one after all this work...)

I originally posted this trying to get help with a termination signal issue that was preventing me from getting the system working cleanly. I managed to work around it, but am still a bit confused why the original wasn't working so I'll leave the problem description here.

Now to the problem I wanted to address. According to the llama-swap documentation, the default for when a model is unloaded, is for SIGTERM to be sent to the command being run. Either this is not actually happening, my script is wrong, or something funky is going on that I can't figure out. The run.sh script sets up a trap for SIGTERM and SIGINT, and on getting either of those passes a special shutdown command along the pipe to the actual backend before terminating.

The problem is, when llama-swap tries to switch models, the script never actually gets that signal according to the logs, so nothing happens for five seconds, and then the script is forcibly killed by llama-swap, leaving the actual backend still running. If I run the exact same command that I put in the llama-swap config, in the same container, and then kill the running script in the shell it works perfectly. Manually sending either SIGTERM or SIGINT by puttingkill -SIGwhatever ${PID} as the cmdStop didn't seem to do anything, nor did a myriad of other commands to kill either that PID, all PIDs, or all child PIDs. Literally the only command that I could get to work is the one given in the example above. It kicks off some errors in the log, but at least it works.

EDIT: I got this working pretty smoothly, so I cleaned up some of the "help requesty" bits, and replaced the scripts with new better, cleaner versions.

mostlygeek · 2025-09-18T14:30:08Z

mostlygeek
Sep 18, 2025
Maintainer

Have you seen https://github.com/mostlygeek/llama-swap/wiki/Docker-in-Docker-with-llama%E2%80%90swap-guide?

I have little experience with k8s so can’t really help much there.

2 replies

Jordanb716 Sep 20, 2025
Author

Have you seen https://github.com/mostlygeek/llama-swap/wiki/Docker-in-Docker-with-llama%E2%80%90swap-guide?

I hadn't thank you for linking it! I think my solution is a bit better, partially because I think it's a bit simpler architecturally, and partially because I think that solution would let a malicious script ran in the llama-swap container trivially get root access to the entire host system, whereas the worst mine can do is execute commands in another container. Plus that one wouldn't really work in k8s, whereas mine should work just fine in docker as well as k8s. It's nice to see someone got something like a DinD setup working though, I had dismissed that route myself.

I have little experience with k8s so can’t really help much there.

It should work the same in docker as long as the two containers were sharing a network. (I haven't verified this though).

Thankfully I was able to fix the script and have it work without a special cmdStop by trapping EXIT instead of a specific signal like SIGTERM, so I'm pretty happy with the script now. I'll update the main post with the updated version after I trim things down a bit more now that I have something I'm happy with.

Are you sure llama-swap actually sends a SIGTERM? I tried echoing $? in the trap, which supposedly should contain the signal that triggered the trap, and it printed out "0", so I'm not sure what to make of that. The only place I could find in llama-swap that sets the default shutdown command is here and it only sets it for Windows, unless I'm missing something? I only see it being set for Linux in what looks like test code.

mostlygeek Sep 20, 2025
Maintainer

llama-swap uses go’s CommandContext now. It should send a SIGTERM or a SIGINT. After a timeout it should send a SIGKILL to force terminate the process. I used to do this directly but using go stdlib gets a code path that’s more developed and reliable.

Windows doesn’t have any signals so it gets its own special cmdStop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using llama-swap in kubernetes (and probably docker) to manage backends in neighboring containers #305

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using llama-swap in kubernetes (and probably docker) to manage backends in neighboring containers #305

Uh oh!

Uh oh!

Jordanb716 Sep 18, 2025

Replies: 1 comment · 2 replies

Uh oh!

mostlygeek Sep 18, 2025 Maintainer

Uh oh!

Jordanb716 Sep 20, 2025 Author

Uh oh!

mostlygeek Sep 20, 2025 Maintainer

Jordanb716
Sep 18, 2025

Replies: 1 comment 2 replies

mostlygeek
Sep 18, 2025
Maintainer

Jordanb716 Sep 20, 2025
Author

mostlygeek Sep 20, 2025
Maintainer