feat(email): on-device E2B (NPU/FLM) model integration (#1282)#1433
Conversation
Gemma-4-E2B-it-GGUF was already used in the email benchmark baseline (tests/fixtures/email/baseline_accuracy_e2b.json, accuracy=0.43) but was absent from MODELS, so callers couldn't select it by key. This adds the catalog entry and an `email` AgentProfile that lists E2B first, making the smaller on-device model an explicit option for email triage without touching DEFAULT_MODEL_NAME or forcing any download at install time — the existing lazy first-use path in `_ensure_model_loaded` / `_preload_on_idle_server` is unchanged and asserted in the new tests.
The entry registered Gemma-4-E2B-it-GGUF (llama.cpp), which does not run on the NPU. Hardware validation on a Strix Halo box confirmed the NPU-native build is gemma4-it-e2b-FLM (checkpoint gemma4-it:e2b, recipe=flm, device=npu) served at ctx 4096. Align model_id, min_ctx_size, and the unit tests with what the NPU actually serves.
…eal (#1282) The `email` AGENT_PROFILE was dead config and actively harmful: EmailTriageAgent picks its model via `config.model_id or DEFAULT_MODEL_NAME` and never reads AGENT_PROFILES, so the profile changed nothing — but `get_required_models("all")` iterates every profile, so `gaia init --profile all` would schedule the multi-GB FLM weights for download on every machine, including non-NPU x86 boxes that can't run FLM. Removing the profile restores the #1282 AC "no large download in the critical install path". The MODELS catalog entry stays, keeping `gemma4-it-e2b-FLM` selectable via `EmailAgentConfig(model_id=...)`. Also replace the vacuous import-guard test: it patched the module's own `load_model`, then reimported via importlib — which builds a fresh class object the patch never touched, so it passed regardless of any import-time download. The new test patches the actual network/subprocess chokepoints (`requests.adapters.HTTPAdapter.send`, `subprocess.Popen`, `subprocess.run`) before reimporting and asserts none fire. Verified non-vacuous: injecting an import-time `requests.get` makes it fail.
Review: feat(email) — on-device E2B (NPU/FLM) model integration (#1282)Approve with suggestions. This is a tight, additive change: one new SummaryThe catalog entry is correct, the lazy-download guard test is genuinely well-constructed (patches Issues🟡
🟢 Several tests mirror the catalog literals rather than behavior (
🟢 PR description slightly overstates "never enters the install path" (description, not code) The "not pulled by Strengths
VerdictApprove with suggestions — merge once the FLM tool-calling behavior is confirmed (🟡). The change is additive, well-tested, and consistent with the surrounding code; the only real risk is the tool-calling assumption on the NPU engine, which a single hardware triage call would settle. |
…1282) The lazy-download import-guard popped and re-imported lemonade_client in-process, rebuilding its classes and corrupting module identity (e.g. HardwareRequirementError) for every later test in the session — red-listing the unit suite (test_hardware_selection and others). Run the import probe in an isolated subprocess instead: zero pollution, and a stronger guard (a fresh interpreter with requests/subprocess instrumented to fail). Also: hardware-verified that the FLM/NPU build 500-errors on a native OpenAI 'tools' payload, so the catalog entry now declares tool_calling=False. The agent uses the embedded-JSON tool path for this model; email triage parses a JSON object from a plain completion (no native tool calls) and is unaffected.
kovtcharov-amd
left a comment
There was a problem hiding this comment.
Approving. Clean first-class catalog entry. The lazy-download guard (re-import crosses no requests/subprocess boundary) and the AC3 cloud-base_url block even when E2B is explicitly requested are well-targeted tests. The one thing not exercisable in CI is tool_calling=True on the FLM build — you've validated that on the Strix Halo box (device=npu/recipe=flm), so good.
|
Thanks for the thorough review — addressed the feedback and fixed the CI red. 🟡 🟢 "never enters the install path" — narrowed to CI failure — this was test pollution, not the catalog change. The lazy-download guard popped + re-imported |
Closes #1282
Users on Ryzen AI NPU hardware couldn't pick the lighter, faster on-device model for email triage — only the larger E4B was in the catalog. Now
gemma4-it-e2b-FLM(the NPU-native FastFlowLM build) is a first-class catalog model, selectable viaEmailAgentConfig(model_id="gemma4-it-e2b-FLM"). Validated on real Strix Halo NPU hardware:device=npu,recipe=flm, served at:13305, ~23 tok/s decode / ~1s TTFT. The model downloads lazily on first use, sogaia init --profile allnever pulls it (it is pulled by--profile npu, by design).Two hardware findings shaped the entry: the FLM build serves a 4096-token window (not 32K), and it 500-errors on a native OpenAI
toolspayload — so the entry declarestool_calling=Falseand the agent uses the embedded-JSON tool path. Email triage itself parses a JSON object from a plain completion (no native tool calls), so it's unaffected.Test plan
python -m pytest tests/unit/agents/test_email_agent_local_llm_enforcement.py -q— catalog entry (FLM id, ctx 4096,tool_calling=False), local-only (AC3) enforcement, and a subprocess-isolated no-import-time-download guardgemma4-it-e2b-FLMvia Lemonade; confirmdevice=npu/recipe=flmin/api/v1/health; runLEMONADE_MODEL=gemma4-it-e2b-FLM python -m pytest tests/integration/test_email_bench_throughput.py