Using llama-swap in kubernetes (and probably docker) to manage backends in neighboring containers #305
Jordanb716
started this conversation in
Show and tell
Replies: 1 comment 2 replies
-
|
Have you seen https://github.com/mostlygeek/llama-swap/wiki/Docker-in-Docker-with-llama%E2%80%90swap-guide? I have little experience with k8s so can’t really help much there. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
My use case was to run llama-swap in kubernetes, run vllm-gfx906 as a sidecar, and use llama-swap to orchestrate the model swapping. After struggling with it for a while, I got a system working, that uses a shared container, some named pipes, and a script on each side of the connection so llama-swap could remote control the other container.
Here's the scripts:
run.sh
listen.sh
The general setup is to have the llama-swap container, and the container for whatever backend you want top run, side by side. I'm currently using vllm-gfx906, but it shooould work for basically anything. Each container has a shared volume, that contains the two scripts, and a set of named pipes (which the scripts create on startup). On the non-llama-swap container you change the startup command to just run the listen script
Example:
command: ["/tunnel/listen.sh"].llama-swap runs normally, and in the config, you just write whatever command you would want to run to start the LLM server normally, but pass it to the run.sh script as arguments, as shown in the example below.
Example:
If anyone has some wisdom to share on how to get this working more cleanly, I'd love to hear about it, as I would if anyone has suggestions for improvements in general. (Or to tell me there's a much simpler/better way to do this whole thing, but I might be a bit grumpy about that one after all this work...)
I originally posted this trying to get help with a termination signal issue that was preventing me from getting the system working cleanly. I managed to work around it, but am still a bit confused why the original wasn't working so I'll leave the problem description here.
Now to the problem I wanted to address. According to the llama-swap documentation, the default for when a model is unloaded, is for SIGTERM to be sent to the command being run. Either this is not actually happening, my script is wrong, or something funky is going on that I can't figure out. The run.sh script sets up a trap for SIGTERM and SIGINT, and on getting either of those passes a special shutdown command along the pipe to the actual backend before terminating.
The problem is, when llama-swap tries to switch models, the script never actually gets that signal according to the logs, so nothing happens for five seconds, and then the script is forcibly killed by llama-swap, leaving the actual backend still running. If I run the exact same command that I put in the llama-swap config, in the same container, and then kill the running script in the shell it works perfectly. Manually sending either SIGTERM or SIGINT by putting
kill -SIGwhatever ${PID}as the cmdStop didn't seem to do anything, nor did a myriad of other commands to kill either that PID, all PIDs, or all child PIDs. Literally the only command that I could get to work is the one given in the example above. It kicks off some errors in the log, but at least it works.EDIT: I got this working pretty smoothly, so I cleaned up some of the "help requesty" bits, and replaced the scripts with new better, cleaner versions.
Beta Was this translation helpful? Give feedback.
All reactions