Skip to content

feat(agents): multi-model parallelism — small classifier + large writer via Lemonade multi-model loading #1000

@itomek

Description

@itomek

Problem

GAIA agents today bind to a single LLM per agent instance (LemonadeClient is single-model). That's fine for a chat-only flow, but it forces every cognitive sub-task — fast classification, slow reasoning, structured-data emission, prose summarization — onto the same model. In practice we already see this break:

Concrete trigger surfaced in #995: the EmailTriageAgent's pre_scan_inbox returns a {\"kind\": \"email_pre_scan\", ...} envelope; the system prompt asks the LLM to echo it verbatim inside fenced code blocks so the frontend's EmailPreScanCard can mount. On Gemma-4-E4B (a chat-tuned model), the LLM reliably paraphrases the JSON into prose instead — its RLHF reward explicitly disprefers regurgitating structured input. Card never mounts. We landed a backend-injected fence as a deterministic workaround in src/gaia/ui/sse_handler.py (search for _pending_render_payloads and _capture_render_payload).

The hack works but is the wrong shape long-term: as we add more structured-render tools (calendar cards, doc-summary cards, code-diff cards), the SSE handler accumulates more bypass paths around the LLM. The right architecture is two models running in parallel — one tool-use-tuned for structured emission, one chat-tuned for prose — routed per cognitive role.

Why this matters now

This unblocks several things simultaneously, not just email-triage card rendering:

Use case Small fast model Large smart model Win
Tier-1 triage in triage_heuristics.py 0.6B classifier 35B reasoning for confident=False cases Bounds the 56s pre_scan_inbox latency — currently the dominant Adrian/Ramin demo blocker
Ambient mailbox watcher (planned for the daily-driver vision) 0.6B–1B classifier 35B for triggers Background polling at near-zero cost
Speculative decoding (Lemonade-native) Draft model Target model 2-3× throughput on the same hardware
Reply draft + tone classifier Tone model Draft writer Quality + speed
Structured-emission split (this issue's named driver) Tool-use-tuned (Hermes-3, Qwen-Coder) Gemma / Qwen chat Removes the src/gaia/ui/sse_handler.py _pending_render_payloads hack

Proposed scope (v1)

  • LemonadeMultiClient (or extend LemonadeClient with model_for_role: dict[Role, str]) under src/gaia/llm/.
  • Per-role routing inside the agent loop: structured/tool-use turns → small instruct-tuned model; conversational turns → chat-tuned model.
  • Memory budget management — two large models loaded at once on Strix Halo's 32 GB requires explicit pinning + fallback (e.g. unload the chat model when the classifier is hot if RAM is tight).
  • Eval harness extension: per-role model selection in src/gaia/eval/ so we can A/B model pairings against today's single-model baseline.
  • Concrete first agent migration: EmailTriageAgent (this issue's named consumer) — see acceptance criteria.

Acceptance criteria

  • LemonadeMultiClient lands with role-routed dispatch and at-least 2-model concurrent loading on Strix Halo (32 GB).
  • EmailTriageAgent consumes the new client: tool-use-tuned model handles pre_scan_inbox → structured emission; chat-tuned model handles prose summary.
  • Remove the SSE backend-fence hack added in PR #995. Specifically delete _pending_render_payloads / _capture_render_payload / _drain_render_payloads / the _RENDER_TOOL_TO_LANG map from src/gaia/ui/sse_handler.py; the SSE handler should pass-through the LLM's output unmodified for structured payloads. The system prompt for EmailTriageAgent re-tightens to instruct the structured-emission model to emit the fence (with a worked example).
  • Eval harness reports card-mount-success-rate for pre_scan_inbox over 20 trials × 3 model pairings; deterministic-100% baseline (today's hack) included for comparison.
  • Tier-1 triage wired in triage_heuristics.py using a 0.6B classifier on confident=False cases instead of falling through to the heavy 35B model.
  • Docs: a short note in docs/sdk/sdks/llm.mdx on the role-routing pattern + memory implications.

Effort estimate

1–2 weeks with eval harness, including the model A/B work to pick the right tool-use-tuned default. Lemonade-side work (ensuring concurrent model load + per-request routing) needs to be confirmed against https://lemonade-server.ai/models.html — at first glance this is a client-side responsibility (set model= per request) but memory pinning is the open question.

Refs

  • PR #995 — adds the hack this issue is about removing.
  • #645 — Email Triage Agent umbrella; this is one of its dependencies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions