feat(email): on-device E2B (NPU/FLM) model integration (#1282) by itomek · Pull Request #1433 · amd/gaia

itomek · 2026-06-04T14:24:11Z

Users on Ryzen AI NPU hardware couldn't pick the lighter, faster on-device model for email triage — only the larger E4B was in the catalog. Now gemma4-it-e2b-FLM (the NPU-native FastFlowLM build) is a first-class catalog model, selectable via EmailAgentConfig(model_id="gemma4-it-e2b-FLM"). Validated on real Strix Halo NPU hardware: device=npu, recipe=flm, served at :13305, ~23 tok/s decode / ~1s TTFT. The model downloads lazily on first use, so gaia init --profile all never pulls it (it is pulled by --profile npu, by design).

Two hardware findings shaped the entry: the FLM build serves a 4096-token window (not 32K), and it 500-errors on a native OpenAI tools payload — so the entry declares tool_calling=False and the agent uses the embedded-JSON tool path. Email triage itself parses a JSON object from a plain completion (no native tool calls), so it's unaffected.

Test plan

python -m pytest tests/unit/agents/test_email_agent_local_llm_enforcement.py -q — catalog entry (FLM id, ctx 4096, tool_calling=False), local-only (AC3) enforcement, and a subprocess-isolated no-import-time-download guard
On a Strix Halo NPU box: load gemma4-it-e2b-FLM via Lemonade; confirm device=npu / recipe=flm in /api/v1/health; run LEMONADE_MODEL=gemma4-it-e2b-FLM python -m pytest tests/integration/test_email_bench_throughput.py

Gemma-4-E2B-it-GGUF was already used in the email benchmark baseline (tests/fixtures/email/baseline_accuracy_e2b.json, accuracy=0.43) but was absent from MODELS, so callers couldn't select it by key. This adds the catalog entry and an `email` AgentProfile that lists E2B first, making the smaller on-device model an explicit option for email triage without touching DEFAULT_MODEL_NAME or forcing any download at install time — the existing lazy first-use path in `_ensure_model_loaded` / `_preload_on_idle_server` is unchanged and asserted in the new tests.

The entry registered Gemma-4-E2B-it-GGUF (llama.cpp), which does not run on the NPU. Hardware validation on a Strix Halo box confirmed the NPU-native build is gemma4-it-e2b-FLM (checkpoint gemma4-it:e2b, recipe=flm, device=npu) served at ctx 4096. Align model_id, min_ctx_size, and the unit tests with what the NPU actually serves.

…eal (#1282) The `email` AGENT_PROFILE was dead config and actively harmful: EmailTriageAgent picks its model via `config.model_id or DEFAULT_MODEL_NAME` and never reads AGENT_PROFILES, so the profile changed nothing — but `get_required_models("all")` iterates every profile, so `gaia init --profile all` would schedule the multi-GB FLM weights for download on every machine, including non-NPU x86 boxes that can't run FLM. Removing the profile restores the #1282 AC "no large download in the critical install path". The MODELS catalog entry stays, keeping `gemma4-it-e2b-FLM` selectable via `EmailAgentConfig(model_id=...)`. Also replace the vacuous import-guard test: it patched the module's own `load_model`, then reimported via importlib — which builds a fresh class object the patch never touched, so it passed regardless of any import-time download. The new test patches the actual network/subprocess chokepoints (`requests.adapters.HTTPAdapter.send`, `subprocess.Popen`, `subprocess.run`) before reimporting and asserts none fire. Verified non-vacuous: injecting an import-time `requests.get` makes it fail.

github-actions · 2026-06-04T15:29:48Z

Review: feat(email) — on-device E2B (NPU/FLM) model integration (#1282)

Approve with suggestions. This is a tight, additive change: one new MODELS catalog entry plus a thorough test class, no production code paths altered. The entry is internally consistent with the rest of the codebase (init_command.py:101, registry.py:277, and the NPU docs all use the same gemma4-it-e2b-FLM id and ctx_size=4096), and I confirmed the headline safety claim: gemma-4-e2b is not referenced by any AGENT_PROFILE, so get_required_models("all") never resolves it — gaia init --profile all will not pull the weights. The one thing worth confirming before merge is whether tool_calling=True actually holds for the FLM build (see below), since email triage depends on it.

Summary

The catalog entry is correct, the lazy-download guard test is genuinely well-constructed (patches requests.adapters.HTTPAdapter.send + subprocess at the real chokepoints before re-import — and lemonade_client.py does route all HTTP through requests, so the patch target is right), and the AC3 cloud-base_url rejection test reflects real EmailAgentConfig.validate() behavior. No security concerns, no silent fallbacks, no breaking changes.

Issues

🟡 tool_calling=True is asserted for an FLM build, but the field's own contract scopes it to GGUF (src/gaia/llm/lemonade_client.py:226)

ModelRequirement.tool_calling's default comment reads "True for GGUF models via Lemonade --jinja (Tier 0 empirical)" — but this entry is the FastFlowLM/NPU build, a different inference engine. The test test_e2b_tool_calling_enabled only asserts the literal True we set; it doesn't prove FLM-on-NPU actually serves native tool calls. The docstring claims the email agent "relies on native tool calls for triage, organize, and reply" — if FLM doesn't support --jinja-style tool calling, NPU email triage breaks at runtime for exactly the users this PR targets. The PR says the model was hardware-validated (device=npu, ~24 tok/s), but doesn't state that tool calling specifically was exercised there. Please confirm tool calls round-trip on the FLM build (e.g. the unchecked test_email_bench_throughput.py run, or a single triage call on the Strix Halo box) before merge. If it's verified, a one-line note on the entry distinguishing it from the GGUF --jinja assumption would save the next reader the same question.

🟢 Several tests mirror the catalog literals rather than behavior (tests/unit/agents/test_email_agent_local_llm_enforcement.py:66-111)

test_e2b_model_id_is_flm_npu_build, test_e2b_is_llm_type, test_e2b_tool_calling_enabled, and test_e2b_display_name_set each re-assert a value set three lines apart in the source — they're change-detectors that fail only if someone edits the entry, not tests of behavior. They're cheap and the failure messages are good, so this isn't worth removing; just flagging that the real coverage value lives in the two EmailAgentConfig integration tests and the lazy-download guard. No action needed.

🟢 PR description slightly overstates "never enters the install path" (description, not code)

The "not pulled by gaia init --profile all" claim is correct and I verified it. But gaia init --profile npu does pull gemma4-it-e2b-FLM (init_command.py:101) — which is the intended behavior, not a bug. Consider narrowing the wording to "--profile all" so it doesn't read as "no init profile ever pulls it."

Strengths

The import-time side-effect guard is the right test, done right — popping the module, patching the actual requests/subprocess boundaries before re-import, and the comment correctly explaining why patching load_model would be vacuous under importlib. This is the kind of test that catches a real future regression (an import-time weight pull), not boilerplate.
Cross-file consistency is clean — model id, recipe (flm), and ctx_size=4096 match across the catalog, init profile, agent registry, and NPU docs. Docs were already updated in prior feat(email): primary on-device model integration (gemma4-it-e2b-FLM) #1282 work, so there's no doc gap here.
Fails loud, stays local — the AC3 test confirms requesting the E2B model can't be used as a side door to a cloud base_url.

Verdict

Approve with suggestions — merge once the FLM tool-calling behavior is confirmed (🟡). The change is additive, well-tested, and consistent with the surrounding code; the only real risk is the tool-calling assumption on the NPU engine, which a single hardware triage call would settle.

…1282) The lazy-download import-guard popped and re-imported lemonade_client in-process, rebuilding its classes and corrupting module identity (e.g. HardwareRequirementError) for every later test in the session — red-listing the unit suite (test_hardware_selection and others). Run the import probe in an isolated subprocess instead: zero pollution, and a stronger guard (a fresh interpreter with requests/subprocess instrumented to fail). Also: hardware-verified that the FLM/NPU build 500-errors on a native OpenAI 'tools' payload, so the catalog entry now declares tool_calling=False. The agent uses the embedded-JSON tool path for this model; email triage parses a JSON object from a plain completion (no native tool calls) and is unaffected.

kovtcharov-amd

Approving. Clean first-class catalog entry. The lazy-download guard (re-import crosses no requests/subprocess boundary) and the AC3 cloud-base_url block even when E2B is explicitly requested are well-targeted tests. The one thing not exercisable in CI is tool_calling=True on the FLM build — you've validated that on the Strix Halo box (device=npu/recipe=flm), so good.

itomek · 2026-06-04T15:50:16Z

Thanks for the thorough review — addressed the feedback and fixed the CI red.

🟡 tool_calling on the FLM build — good catch, and it does not hold. Verified directly on the Strix Halo NPU box: the FLM server 500-errors on a native OpenAI tools payload ("type must be string, but is object"), while plain completions work fine (~23 tok/s). The GGUF --jinja tool-calling assumption does not carry over to the FastFlowLM build. Changed the entry to tool_calling=False — the agent then uses the embedded-JSON tool path for this model, and email triage (which parses a JSON object from a plain completion, no native tool calls) is unaffected. The entry now carries a comment distinguishing it from the GGUF builds, and the test asserts False with that rationale.

🟢 "never enters the install path" — narrowed to --profile all in the description (you're right that --profile npu pulls it, by design).

CI failure — this was test pollution, not the catalog change. The lazy-download guard popped + re-imported lemonade_client in-process, rebuilding its classes and corrupting module identity (e.g. HardwareRequirementError) for later tests (test_hardware_selection and others), which red-listed the unit suite. Rewrote the guard to run its import probe in an isolated subprocess — no pollution, and a stronger guard (fresh interpreter, requests/subprocess instrumented to fail). Confirmed via a full-suite diff that the fixed branch introduces zero new failures vs main.

itomek added 3 commits June 4, 2026 09:23

github-actions Bot added llm LLM backend changes tests Test changes performance Performance-critical changes labels Jun 4, 2026

itomek self-assigned this Jun 4, 2026

itomek marked this pull request as ready for review June 4, 2026 15:24

itomek requested a review from kovtcharov-amd as a code owner June 4, 2026 15:24

itomek mentioned this pull request Jun 4, 2026

feat(email): auto-select gemma4-it-e2b-FLM on NPU hardware for the email agent #1439

Open

3 tasks

itomek enabled auto-merge June 4, 2026 15:31

kovtcharov-amd approved these changes Jun 4, 2026

View reviewed changes

itomek added this pull request to the merge queue Jun 4, 2026

Merged via the queue into main with commit c8b9aef Jun 4, 2026
29 of 30 checks passed

itomek deleted the feat/email-e2b-model-1282 branch June 4, 2026 15:49

itomek mentioned this pull request Jun 4, 2026

ci(windows): Windows Full Integration fails on stale tests/test_summarizer.py path (removed in #1407) #1444

Closed

2 tasks

kovtcharov-amd mentioned this pull request Jun 10, 2026

Feature Request: Native FLM format and Ryzen AI NPU support via dedicated Agent/Profile #1220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(email): on-device E2B (NPU/FLM) model integration (#1282)#1433

feat(email): on-device E2B (NPU/FLM) model integration (#1282)#1433
itomek merged 4 commits into
mainfrom
feat/email-e2b-model-1282

itomek commented Jun 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

kovtcharov-amd left a comment

Uh oh!

Uh oh!

itomek commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

itomek commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

github-actions Bot commented Jun 4, 2026

Review: feat(email) — on-device E2B (NPU/FLM) model integration (#1282)

Summary

Issues

Strengths

Verdict

Uh oh!

kovtcharov-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

itomek commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

itomek commented Jun 4, 2026 •

edited

Loading