Skip to content

feat(email): on-device E2B (NPU/FLM) model integration (#1282)#1433

Merged
itomek merged 4 commits into
mainfrom
feat/email-e2b-model-1282
Jun 4, 2026
Merged

feat(email): on-device E2B (NPU/FLM) model integration (#1282)#1433
itomek merged 4 commits into
mainfrom
feat/email-e2b-model-1282

Conversation

@itomek

@itomek itomek commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Closes #1282

Users on Ryzen AI NPU hardware couldn't pick the lighter, faster on-device model for email triage — only the larger E4B was in the catalog. Now gemma4-it-e2b-FLM (the NPU-native FastFlowLM build) is a first-class catalog model, selectable via EmailAgentConfig(model_id="gemma4-it-e2b-FLM"). Validated on real Strix Halo NPU hardware: device=npu, recipe=flm, served at :13305, ~23 tok/s decode / ~1s TTFT. The model downloads lazily on first use, so gaia init --profile all never pulls it (it is pulled by --profile npu, by design).

Two hardware findings shaped the entry: the FLM build serves a 4096-token window (not 32K), and it 500-errors on a native OpenAI tools payload — so the entry declares tool_calling=False and the agent uses the embedded-JSON tool path. Email triage itself parses a JSON object from a plain completion (no native tool calls), so it's unaffected.

Test plan

  • python -m pytest tests/unit/agents/test_email_agent_local_llm_enforcement.py -q — catalog entry (FLM id, ctx 4096, tool_calling=False), local-only (AC3) enforcement, and a subprocess-isolated no-import-time-download guard
  • On a Strix Halo NPU box: load gemma4-it-e2b-FLM via Lemonade; confirm device=npu / recipe=flm in /api/v1/health; run LEMONADE_MODEL=gemma4-it-e2b-FLM python -m pytest tests/integration/test_email_bench_throughput.py

itomek added 3 commits June 4, 2026 09:23
Gemma-4-E2B-it-GGUF was already used in the email benchmark baseline
(tests/fixtures/email/baseline_accuracy_e2b.json, accuracy=0.43) but
was absent from MODELS, so callers couldn't select it by key. This adds
the catalog entry and an `email` AgentProfile that lists E2B first,
making the smaller on-device model an explicit option for email triage
without touching DEFAULT_MODEL_NAME or forcing any download at install
time — the existing lazy first-use path in `_ensure_model_loaded` /
`_preload_on_idle_server` is unchanged and asserted in the new tests.
The entry registered Gemma-4-E2B-it-GGUF (llama.cpp), which does not run
on the NPU. Hardware validation on a Strix Halo box confirmed the
NPU-native build is gemma4-it-e2b-FLM (checkpoint gemma4-it:e2b,
recipe=flm, device=npu) served at ctx 4096. Align model_id, min_ctx_size,
and the unit tests with what the NPU actually serves.
…eal (#1282)

The `email` AGENT_PROFILE was dead config and actively harmful:
EmailTriageAgent picks its model via `config.model_id or
DEFAULT_MODEL_NAME` and never reads AGENT_PROFILES, so the profile
changed nothing — but `get_required_models("all")` iterates every
profile, so `gaia init --profile all` would schedule the multi-GB FLM
weights for download on every machine, including non-NPU x86 boxes that
can't run FLM. Removing the profile restores the #1282 AC "no large
download in the critical install path". The MODELS catalog entry stays,
keeping `gemma4-it-e2b-FLM` selectable via
`EmailAgentConfig(model_id=...)`.

Also replace the vacuous import-guard test: it patched the module's own
`load_model`, then reimported via importlib — which builds a fresh class
object the patch never touched, so it passed regardless of any
import-time download. The new test patches the actual network/subprocess
chokepoints (`requests.adapters.HTTPAdapter.send`, `subprocess.Popen`,
`subprocess.run`) before reimporting and asserts none fire. Verified
non-vacuous: injecting an import-time `requests.get` makes it fail.
@github-actions github-actions Bot added llm LLM backend changes tests Test changes performance Performance-critical changes labels Jun 4, 2026
@itomek itomek self-assigned this Jun 4, 2026
@itomek itomek marked this pull request as ready for review June 4, 2026 15:24
@itomek itomek requested a review from kovtcharov-amd as a code owner June 4, 2026 15:24
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review: feat(email) — on-device E2B (NPU/FLM) model integration (#1282)

Approve with suggestions. This is a tight, additive change: one new MODELS catalog entry plus a thorough test class, no production code paths altered. The entry is internally consistent with the rest of the codebase (init_command.py:101, registry.py:277, and the NPU docs all use the same gemma4-it-e2b-FLM id and ctx_size=4096), and I confirmed the headline safety claim: gemma-4-e2b is not referenced by any AGENT_PROFILE, so get_required_models("all") never resolves it — gaia init --profile all will not pull the weights. The one thing worth confirming before merge is whether tool_calling=True actually holds for the FLM build (see below), since email triage depends on it.

Summary

The catalog entry is correct, the lazy-download guard test is genuinely well-constructed (patches requests.adapters.HTTPAdapter.send + subprocess at the real chokepoints before re-import — and lemonade_client.py does route all HTTP through requests, so the patch target is right), and the AC3 cloud-base_url rejection test reflects real EmailAgentConfig.validate() behavior. No security concerns, no silent fallbacks, no breaking changes.

Issues

🟡 tool_calling=True is asserted for an FLM build, but the field's own contract scopes it to GGUF (src/gaia/llm/lemonade_client.py:226)

ModelRequirement.tool_calling's default comment reads "True for GGUF models via Lemonade --jinja (Tier 0 empirical)" — but this entry is the FastFlowLM/NPU build, a different inference engine. The test test_e2b_tool_calling_enabled only asserts the literal True we set; it doesn't prove FLM-on-NPU actually serves native tool calls. The docstring claims the email agent "relies on native tool calls for triage, organize, and reply" — if FLM doesn't support --jinja-style tool calling, NPU email triage breaks at runtime for exactly the users this PR targets. The PR says the model was hardware-validated (device=npu, ~24 tok/s), but doesn't state that tool calling specifically was exercised there. Please confirm tool calls round-trip on the FLM build (e.g. the unchecked test_email_bench_throughput.py run, or a single triage call on the Strix Halo box) before merge. If it's verified, a one-line note on the entry distinguishing it from the GGUF --jinja assumption would save the next reader the same question.

🟢 Several tests mirror the catalog literals rather than behavior (tests/unit/agents/test_email_agent_local_llm_enforcement.py:66-111)

test_e2b_model_id_is_flm_npu_build, test_e2b_is_llm_type, test_e2b_tool_calling_enabled, and test_e2b_display_name_set each re-assert a value set three lines apart in the source — they're change-detectors that fail only if someone edits the entry, not tests of behavior. They're cheap and the failure messages are good, so this isn't worth removing; just flagging that the real coverage value lives in the two EmailAgentConfig integration tests and the lazy-download guard. No action needed.

🟢 PR description slightly overstates "never enters the install path" (description, not code)

The "not pulled by gaia init --profile all" claim is correct and I verified it. But gaia init --profile npu does pull gemma4-it-e2b-FLM (init_command.py:101) — which is the intended behavior, not a bug. Consider narrowing the wording to "--profile all" so it doesn't read as "no init profile ever pulls it."

Strengths

  • The import-time side-effect guard is the right test, done right — popping the module, patching the actual requests/subprocess boundaries before re-import, and the comment correctly explaining why patching load_model would be vacuous under importlib. This is the kind of test that catches a real future regression (an import-time weight pull), not boilerplate.
  • Cross-file consistency is clean — model id, recipe (flm), and ctx_size=4096 match across the catalog, init profile, agent registry, and NPU docs. Docs were already updated in prior feat(email): primary on-device model integration (gemma4-it-e2b-FLM) #1282 work, so there's no doc gap here.
  • Fails loud, stays local — the AC3 test confirms requesting the E2B model can't be used as a side door to a cloud base_url.

Verdict

Approve with suggestions — merge once the FLM tool-calling behavior is confirmed (🟡). The change is additive, well-tested, and consistent with the surrounding code; the only real risk is the tool-calling assumption on the NPU engine, which a single hardware triage call would settle.

@itomek itomek enabled auto-merge June 4, 2026 15:31
…1282)

The lazy-download import-guard popped and re-imported lemonade_client
in-process, rebuilding its classes and corrupting module identity (e.g.
HardwareRequirementError) for every later test in the session — red-listing
the unit suite (test_hardware_selection and others). Run the import probe in
an isolated subprocess instead: zero pollution, and a stronger guard (a fresh
interpreter with requests/subprocess instrumented to fail).

Also: hardware-verified that the FLM/NPU build 500-errors on a native OpenAI
'tools' payload, so the catalog entry now declares tool_calling=False. The
agent uses the embedded-JSON tool path for this model; email triage parses a
JSON object from a plain completion (no native tool calls) and is unaffected.

@kovtcharov-amd kovtcharov-amd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Clean first-class catalog entry. The lazy-download guard (re-import crosses no requests/subprocess boundary) and the AC3 cloud-base_url block even when E2B is explicitly requested are well-targeted tests. The one thing not exercisable in CI is tool_calling=True on the FLM build — you've validated that on the Strix Halo box (device=npu/recipe=flm), so good.

@itomek itomek added this pull request to the merge queue Jun 4, 2026
Merged via the queue into main with commit c8b9aef Jun 4, 2026
29 of 30 checks passed
@itomek itomek deleted the feat/email-e2b-model-1282 branch June 4, 2026 15:49
@itomek

itomek commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough review — addressed the feedback and fixed the CI red.

🟡 tool_calling on the FLM build — good catch, and it does not hold. Verified directly on the Strix Halo NPU box: the FLM server 500-errors on a native OpenAI tools payload ("type must be string, but is object"), while plain completions work fine (~23 tok/s). The GGUF --jinja tool-calling assumption does not carry over to the FastFlowLM build. Changed the entry to tool_calling=False — the agent then uses the embedded-JSON tool path for this model, and email triage (which parses a JSON object from a plain completion, no native tool calls) is unaffected. The entry now carries a comment distinguishing it from the GGUF builds, and the test asserts False with that rationale.

🟢 "never enters the install path" — narrowed to --profile all in the description (you're right that --profile npu pulls it, by design).

CI failure — this was test pollution, not the catalog change. The lazy-download guard popped + re-imported lemonade_client in-process, rebuilding its classes and corrupting module identity (e.g. HardwareRequirementError) for later tests (test_hardware_selection and others), which red-listed the unit suite. Rewrote the guard to run its import probe in an isolated subprocess — no pollution, and a stronger guard (fresh interpreter, requests/subprocess instrumented to fail). Confirmed via a full-suite diff that the fixed branch introduces zero new failures vs main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llm LLM backend changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(email): primary on-device model integration (gemma4-it-e2b-FLM)

2 participants