Support llama.cpp Router Mode for isolated multi-GPU subagents

llama.cpp now supports a Router Mode (https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#using-multiple-models) that manages multiple model processes via a single entry point/`llama-server` process. 

Launching a `llama-server` process in router mode uses a single process to manage requests to multiple models that can be loaded simultaneously. Lemonade currently allows users to load multiple models through setting a `max_loaded_models` value > 1. Doing this spins up multiple `llama-server` instances (1 per model) which introduces some overhead and a completely different port to reach each model at.

Router mode also lets you point to a models preset through the `--models-preset` flag and a `.ini` configuration file. This becomes especially useful when you want to have multiple models loaded (each having an agent or subagent point back at them). 

It even enables the case for having subagents living on discrete GPUs like in the example `.ini` configuration file below.

```ini
; models.ini
version = 1

[*]
; Global defaults for all agents
n-gpu-layers = all
ctx-size = 16384
parallel = 1
load-on-startup = true

[build-agent]
model = /path/to/your/model.gguf
device = Vulkan0

[plan-agent]
model = /path/to/your/model.gguf
device = Vulkan1

[reviewer-agent]
model = /path/to/your/model.gguf
device = Vulkan2
```
All of these "agents" would be accessible at the same endpoint allowing for easy configuration in OpenCode for example.

```json
{
  "$schema": "https://opencode.ai/config.json",
  "agent": {
    "build": {
      "mode": "primary",
      "model": "Lemonade/build-agent",
      "tools": { "write": true, "edit": true, "bash": true }
    },
    "plan": {
      "mode": "primary",
      "model": "Lemonade/plan-agent",
      "tools": { "write": false, "edit": false, "bash": false }
    },
    "code-reviewer": {
      "mode": "subagent",
      "model": "Lemonade/reviewer-agent",
      "tools": { "write": false, "edit": false }
    }
  }
}
```

In order to replicate this setup without router mode and with Lemonade's current architecture, we would have to spin up multiple llama-server processes each with their own args (specifying device to use as well as pointing to the model themselves). 

Additionally, on the opencode side, multiple Lemonade providers would have to be configured (one for each subagent since they live on different ports), this would lead to a lot of repetitive config which is simply not a clean solution.

Adding support for router mode could be as simple as adding a `--router-mode` flag to `lemond`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support llama.cpp Router Mode for isolated multi-GPU subagents #1547

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Support llama.cpp Router Mode for isolated multi-GPU subagents #1547

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions