llama.cpp now supports a Router Mode (https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#using-multiple-models) that manages multiple model processes via a single entry point/llama-server process.
Launching a llama-server process in router mode uses a single process to manage requests to multiple models that can be loaded simultaneously. Lemonade currently allows users to load multiple models through setting a max_loaded_models value > 1. Doing this spins up multiple llama-server instances (1 per model) which introduces some overhead and a completely different port to reach each model at.
Router mode also lets you point to a models preset through the --models-preset flag and a .ini configuration file. This becomes especially useful when you want to have multiple models loaded (each having an agent or subagent point back at them).
It even enables the case for having subagents living on discrete GPUs like in the example .ini configuration file below.
; models.ini
version = 1
[*]
; Global defaults for all agents
n-gpu-layers = all
ctx-size = 16384
parallel = 1
load-on-startup = true
[build-agent]
model = /path/to/your/model.gguf
device = Vulkan0
[plan-agent]
model = /path/to/your/model.gguf
device = Vulkan1
[reviewer-agent]
model = /path/to/your/model.gguf
device = Vulkan2
All of these "agents" would be accessible at the same endpoint allowing for easy configuration in OpenCode for example.
{
"$schema": "https://opencode.ai/config.json",
"agent": {
"build": {
"mode": "primary",
"model": "Lemonade/build-agent",
"tools": { "write": true, "edit": true, "bash": true }
},
"plan": {
"mode": "primary",
"model": "Lemonade/plan-agent",
"tools": { "write": false, "edit": false, "bash": false }
},
"code-reviewer": {
"mode": "subagent",
"model": "Lemonade/reviewer-agent",
"tools": { "write": false, "edit": false }
}
}
}
In order to replicate this setup without router mode and with Lemonade's current architecture, we would have to spin up multiple llama-server processes each with their own args (specifying device to use as well as pointing to the model themselves).
Additionally, on the opencode side, multiple Lemonade providers would have to be configured (one for each subagent since they live on different ports), this would lead to a lot of repetitive config which is simply not a clean solution.
Adding support for router mode could be as simple as adding a --router-mode flag to lemond.
llama.cpp now supports a Router Mode (https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#using-multiple-models) that manages multiple model processes via a single entry point/
llama-serverprocess.Launching a
llama-serverprocess in router mode uses a single process to manage requests to multiple models that can be loaded simultaneously. Lemonade currently allows users to load multiple models through setting amax_loaded_modelsvalue > 1. Doing this spins up multiplellama-serverinstances (1 per model) which introduces some overhead and a completely different port to reach each model at.Router mode also lets you point to a models preset through the
--models-presetflag and a.iniconfiguration file. This becomes especially useful when you want to have multiple models loaded (each having an agent or subagent point back at them).It even enables the case for having subagents living on discrete GPUs like in the example
.iniconfiguration file below.All of these "agents" would be accessible at the same endpoint allowing for easy configuration in OpenCode for example.
{ "$schema": "https://opencode.ai/config.json", "agent": { "build": { "mode": "primary", "model": "Lemonade/build-agent", "tools": { "write": true, "edit": true, "bash": true } }, "plan": { "mode": "primary", "model": "Lemonade/plan-agent", "tools": { "write": false, "edit": false, "bash": false } }, "code-reviewer": { "mode": "subagent", "model": "Lemonade/reviewer-agent", "tools": { "write": false, "edit": false } } } }In order to replicate this setup without router mode and with Lemonade's current architecture, we would have to spin up multiple llama-server processes each with their own args (specifying device to use as well as pointing to the model themselves).
Additionally, on the opencode side, multiple Lemonade providers would have to be configured (one for each subagent since they live on different ports), this would lead to a lot of repetitive config which is simply not a clean solution.
Adding support for router mode could be as simple as adding a
--router-modeflag tolemond.