feat(agents): multi-model parallelism — small classifier + large writer via Lemonade multi-model loading

## Problem

GAIA agents today bind to a single LLM per agent instance (``LemonadeClient`` is single-model). That's fine for a chat-only flow, but it forces every cognitive sub-task — fast classification, slow reasoning, structured-data emission, prose summarization — onto the same model. In practice we already see this break:

**Concrete trigger surfaced in [#995](https://github.com/amd/gaia/pull/995):** the EmailTriageAgent's ``pre_scan_inbox`` returns a ``{\"kind\": \"email_pre_scan\", ...}`` envelope; the system prompt asks the LLM to echo it verbatim inside fenced code blocks so the frontend's ``EmailPreScanCard`` can mount. On Gemma-4-E4B (a chat-tuned model), the LLM reliably **paraphrases the JSON into prose** instead — its RLHF reward explicitly disprefers regurgitating structured input. Card never mounts. We landed a backend-injected fence as a deterministic workaround in [src/gaia/ui/sse_handler.py](https://github.com/amd/gaia/blob/main/src/gaia/ui/sse_handler.py) (search for ``_pending_render_payloads`` and ``_capture_render_payload``).

The hack works but is the wrong shape long-term: as we add more structured-render tools (calendar cards, doc-summary cards, code-diff cards), the SSE handler accumulates more bypass paths around the LLM. The right architecture is **two models running in parallel** — one tool-use-tuned for structured emission, one chat-tuned for prose — routed per cognitive role.

## Why this matters now

This unblocks several things simultaneously, not just email-triage card rendering:

| Use case | Small fast model | Large smart model | Win |
|---|---|---|---|
| Tier-1 triage in [triage_heuristics.py](https://github.com/amd/gaia/blob/main/src/gaia/agents/email/tools/triage_heuristics.py) | 0.6B classifier | 35B reasoning for ``confident=False`` cases | Bounds the 56s ``pre_scan_inbox`` latency — currently the dominant Adrian/Ramin demo blocker |
| Ambient mailbox watcher (planned for the daily-driver vision) | 0.6B–1B classifier | 35B for triggers | Background polling at near-zero cost |
| Speculative decoding (Lemonade-native) | Draft model | Target model | 2-3× throughput on the same hardware |
| Reply draft + tone classifier | Tone model | Draft writer | Quality + speed |
| Structured-emission split (this issue's named driver) | Tool-use-tuned (Hermes-3, Qwen-Coder) | Gemma / Qwen chat | Removes the [src/gaia/ui/sse_handler.py](https://github.com/amd/gaia/blob/main/src/gaia/ui/sse_handler.py) ``_pending_render_payloads`` hack |

## Proposed scope (v1)

- ``LemonadeMultiClient`` (or extend ``LemonadeClient`` with ``model_for_role: dict[Role, str]``) under [src/gaia/llm/](https://github.com/amd/gaia/tree/main/src/gaia/llm/).
- Per-role routing inside the agent loop: structured/tool-use turns → small instruct-tuned model; conversational turns → chat-tuned model.
- Memory budget management — two large models loaded at once on Strix Halo's 32 GB requires explicit pinning + fallback (e.g. unload the chat model when the classifier is hot if RAM is tight).
- Eval harness extension: per-role model selection in [src/gaia/eval/](https://github.com/amd/gaia/tree/main/src/gaia/eval/) so we can A/B model pairings against today's single-model baseline.
- Concrete first agent migration: ``EmailTriageAgent`` (this issue's named consumer) — see acceptance criteria.

## Acceptance criteria

- [ ] ``LemonadeMultiClient`` lands with role-routed dispatch and at-least 2-model concurrent loading on Strix Halo (32 GB).
- [ ] ``EmailTriageAgent`` consumes the new client: tool-use-tuned model handles ``pre_scan_inbox`` → structured emission; chat-tuned model handles prose summary.
- [ ] **Remove the SSE backend-fence hack added in [PR #995](https://github.com/amd/gaia/pull/995).** Specifically delete ``_pending_render_payloads`` / ``_capture_render_payload`` / ``_drain_render_payloads`` / the ``_RENDER_TOOL_TO_LANG`` map from [src/gaia/ui/sse_handler.py](https://github.com/amd/gaia/blob/main/src/gaia/ui/sse_handler.py); the SSE handler should pass-through the LLM's output unmodified for structured payloads. The system prompt for ``EmailTriageAgent`` re-tightens to instruct the structured-emission model to emit the fence (with a worked example).
- [ ] Eval harness reports card-mount-success-rate for ``pre_scan_inbox`` over 20 trials × 3 model pairings; deterministic-100% baseline (today's hack) included for comparison.
- [ ] Tier-1 triage wired in [triage_heuristics.py](https://github.com/amd/gaia/blob/main/src/gaia/agents/email/tools/triage_heuristics.py) using a 0.6B classifier on ``confident=False`` cases instead of falling through to the heavy 35B model.
- [ ] Docs: a short note in [docs/sdk/sdks/llm.mdx](https://github.com/amd/gaia/blob/main/docs/sdk/sdks/llm.mdx) on the role-routing pattern + memory implications.

## Effort estimate

1–2 weeks with eval harness, including the model A/B work to pick the right tool-use-tuned default. Lemonade-side work (ensuring concurrent model load + per-request routing) needs to be confirmed against [https://lemonade-server.ai/models.html](https://lemonade-server.ai/models.html) — at first glance this is a client-side responsibility (set ``model=`` per request) but memory pinning is the open question.

## Refs

- [PR #995](https://github.com/amd/gaia/pull/995) — adds the hack this issue is about removing.
- [#645](https://github.com/amd/gaia/issues/645) — Email Triage Agent umbrella; this is one of its dependencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): multi-model parallelism — small classifier + large writer via Lemonade multi-model loading #1000

Problem

Why this matters now

Proposed scope (v1)

Acceptance criteria

Effort estimate

Refs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use case	Small fast model	Large smart model	Win
Tier-1 triage in triage_heuristics.py	0.6B classifier	35B reasoning for `confident=False` cases	Bounds the 56s `pre_scan_inbox` latency — currently the dominant Adrian/Ramin demo blocker
Ambient mailbox watcher (planned for the daily-driver vision)	0.6B–1B classifier	35B for triggers	Background polling at near-zero cost
Speculative decoding (Lemonade-native)	Draft model	Target model	2-3× throughput on the same hardware
Reply draft + tone classifier	Tone model	Draft writer	Quality + speed
Structured-emission split (this issue's named driver)	Tool-use-tuned (Hermes-3, Qwen-Coder)	Gemma / Qwen chat	Removes the src/gaia/ui/sse_handler.py `_pending_render_payloads` hack

feat(agents): multi-model parallelism — small classifier + large writer via Lemonade multi-model loading #1000

Description

Problem

Why this matters now

Proposed scope (v1)

Acceptance criteria

Effort estimate

Refs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions