Problem
GAIA agents today bind to a single LLM per agent instance (LemonadeClient is single-model). That's fine for a chat-only flow, but it forces every cognitive sub-task — fast classification, slow reasoning, structured-data emission, prose summarization — onto the same model. In practice we already see this break:
Concrete trigger surfaced in #995: the EmailTriageAgent's pre_scan_inbox returns a {\"kind\": \"email_pre_scan\", ...} envelope; the system prompt asks the LLM to echo it verbatim inside fenced code blocks so the frontend's EmailPreScanCard can mount. On Gemma-4-E4B (a chat-tuned model), the LLM reliably paraphrases the JSON into prose instead — its RLHF reward explicitly disprefers regurgitating structured input. Card never mounts. We landed a backend-injected fence as a deterministic workaround in src/gaia/ui/sse_handler.py (search for _pending_render_payloads and _capture_render_payload).
The hack works but is the wrong shape long-term: as we add more structured-render tools (calendar cards, doc-summary cards, code-diff cards), the SSE handler accumulates more bypass paths around the LLM. The right architecture is two models running in parallel — one tool-use-tuned for structured emission, one chat-tuned for prose — routed per cognitive role.
Why this matters now
This unblocks several things simultaneously, not just email-triage card rendering:
| Use case |
Small fast model |
Large smart model |
Win |
| Tier-1 triage in triage_heuristics.py |
0.6B classifier |
35B reasoning for confident=False cases |
Bounds the 56s pre_scan_inbox latency — currently the dominant Adrian/Ramin demo blocker |
| Ambient mailbox watcher (planned for the daily-driver vision) |
0.6B–1B classifier |
35B for triggers |
Background polling at near-zero cost |
| Speculative decoding (Lemonade-native) |
Draft model |
Target model |
2-3× throughput on the same hardware |
| Reply draft + tone classifier |
Tone model |
Draft writer |
Quality + speed |
| Structured-emission split (this issue's named driver) |
Tool-use-tuned (Hermes-3, Qwen-Coder) |
Gemma / Qwen chat |
Removes the src/gaia/ui/sse_handler.py _pending_render_payloads hack |
Proposed scope (v1)
LemonadeMultiClient (or extend LemonadeClient with model_for_role: dict[Role, str]) under src/gaia/llm/.
- Per-role routing inside the agent loop: structured/tool-use turns → small instruct-tuned model; conversational turns → chat-tuned model.
- Memory budget management — two large models loaded at once on Strix Halo's 32 GB requires explicit pinning + fallback (e.g. unload the chat model when the classifier is hot if RAM is tight).
- Eval harness extension: per-role model selection in src/gaia/eval/ so we can A/B model pairings against today's single-model baseline.
- Concrete first agent migration:
EmailTriageAgent (this issue's named consumer) — see acceptance criteria.
Acceptance criteria
Effort estimate
1–2 weeks with eval harness, including the model A/B work to pick the right tool-use-tuned default. Lemonade-side work (ensuring concurrent model load + per-request routing) needs to be confirmed against https://lemonade-server.ai/models.html — at first glance this is a client-side responsibility (set model= per request) but memory pinning is the open question.
Refs
- PR #995 — adds the hack this issue is about removing.
- #645 — Email Triage Agent umbrella; this is one of its dependencies.
Problem
GAIA agents today bind to a single LLM per agent instance (
LemonadeClientis single-model). That's fine for a chat-only flow, but it forces every cognitive sub-task — fast classification, slow reasoning, structured-data emission, prose summarization — onto the same model. In practice we already see this break:Concrete trigger surfaced in #995: the EmailTriageAgent's
pre_scan_inboxreturns a{\"kind\": \"email_pre_scan\", ...}envelope; the system prompt asks the LLM to echo it verbatim inside fenced code blocks so the frontend'sEmailPreScanCardcan mount. On Gemma-4-E4B (a chat-tuned model), the LLM reliably paraphrases the JSON into prose instead — its RLHF reward explicitly disprefers regurgitating structured input. Card never mounts. We landed a backend-injected fence as a deterministic workaround in src/gaia/ui/sse_handler.py (search for_pending_render_payloadsand_capture_render_payload).The hack works but is the wrong shape long-term: as we add more structured-render tools (calendar cards, doc-summary cards, code-diff cards), the SSE handler accumulates more bypass paths around the LLM. The right architecture is two models running in parallel — one tool-use-tuned for structured emission, one chat-tuned for prose — routed per cognitive role.
Why this matters now
This unblocks several things simultaneously, not just email-triage card rendering:
confident=Falsecasespre_scan_inboxlatency — currently the dominant Adrian/Ramin demo blocker_pending_render_payloadshackProposed scope (v1)
LemonadeMultiClient(or extendLemonadeClientwithmodel_for_role: dict[Role, str]) under src/gaia/llm/.EmailTriageAgent(this issue's named consumer) — see acceptance criteria.Acceptance criteria
LemonadeMultiClientlands with role-routed dispatch and at-least 2-model concurrent loading on Strix Halo (32 GB).EmailTriageAgentconsumes the new client: tool-use-tuned model handlespre_scan_inbox→ structured emission; chat-tuned model handles prose summary._pending_render_payloads/_capture_render_payload/_drain_render_payloads/ the_RENDER_TOOL_TO_LANGmap from src/gaia/ui/sse_handler.py; the SSE handler should pass-through the LLM's output unmodified for structured payloads. The system prompt forEmailTriageAgentre-tightens to instruct the structured-emission model to emit the fence (with a worked example).pre_scan_inboxover 20 trials × 3 model pairings; deterministic-100% baseline (today's hack) included for comparison.confident=Falsecases instead of falling through to the heavy 35B model.Effort estimate
1–2 weeks with eval harness, including the model A/B work to pick the right tool-use-tuned default. Lemonade-side work (ensuring concurrent model load + per-request routing) needs to be confirmed against https://lemonade-server.ai/models.html — at first glance this is a client-side responsibility (set
model=per request) but memory pinning is the open question.Refs