feat(llm): add Gemma 4 E4B as default and native tool_calls priority by itomek · Pull Request #865 · amd/gaia

itomek · 2026-04-24T18:55:15Z

Summary

Gemma-4-E4B-it-GGUF becomes GAIA's default model for all roles (LLM, VLM, installer profiles, CLI, Agent UI, eval, EMR). Simultaneously inverts the tool-call priority chain so native OpenAI tool_calls is the primary path, with embedded-JSON format falling back only for legacy non-tool-calling models. Also bumps the minimum Lemonade version to v10.1.0 (which moved its default port from 8000 → 13305 and is where Gemma 4 support was added).

This ships on top of the existing UI model-resolution fixes (#841, #842). Resolves #863.

What changed and why

Universal Gemma default — Gemma 4 E4B is natively multimodal (~4.5B effective params, 128K context, Apache 2.0), making it the right single default across the LLM/VLM split that previously required two different models. Footprint drops 19.7 GB → 5 GB.
Native tool_calls path (Lemonade v10.1.0+ --jinja) — GAIA now passes tools=[...] to Lemonade for tool-capable models. The response comes back as native tool_calls; LemonadeProvider.chat() encodes them as a sentinel JSON string ({"__tool_calls__": ...}) so no callers need a type change. _parse_llm_response detects the sentinel and returns the unified {"tool": ..., "tool_args": ...} dict.
System-prompt gating — The embedded-JSON format block (_PLANNING_FORMAT/_CONVERSATIONAL_FORMAT) is excluded from the composed system prompt for tool-calling models; it actively prevented native tool_calls in prior testing.
Startup validator — _validate_profile_model_registry() raises at import time if any AGENT_PROFILES entry references a model key not in MODELS.
Lemonade v10.1.0+ / port 13305 — DEFAULT_PORT flipped from 8000 to 13305 (Lemonade's spring-cleaning release changed the default). 75 files updated (agents, UI, MCP bridge, RAG SDK, VLM, CLI, tests, docs). min_lemonade_version = 10.1.0 everywhere INIT_PROFILES is declared.
Eval baselines — Pre-swap Qwen3.5-35B baseline at commit 3b51ca92 and post-swap Gemma-4-E4B baseline both committed under tests/fixtures/eval_baselines/; Gemma outperforms Qwen 14/15 vs 13/15 (see comment below for per-scenario breakdown).

Test plan

python -m pytest tests/unit/ --ignore=tests/unit/chat/ui/ -q → 928 passed, 16 skipped
python -m pytest tests/unit/test_tool_call_priority.py -v → 23 passed (sentinel detection, native branch parsing, edge cases, prompt gating, startup validator)
python util/lint.py --black --isort --flake8 → all pass
Eval against Gemma-4-E4B on Lemonade v10.2.0, Sonnet judge → 14/15 scenarios pass, beats Qwen baseline (see comment)
Verified claude -p --model claude-sonnet-4-6 was actually the judge (not Opus) via modelUsage in test subprocess

Open follow-ups (not blockers for this PR)

tool_selection/known_path_read regression: Gemma doesn't discover indexed-internal-copy fallback path in Turn 1 after Access-Denied on the original. Prompt-engineering candidate.
/api/system/status reports the catalog ctx_size even when Lemonade loaded the model with a smaller window. Surface a warning when they diverge; a whole eval run was wasted due to this mask.

) Previously, _chat_helpers.py always passed model_id=<session model> explicitly to registry.create_agent(), defeating kwargs.setdefault("model_id", ...) in custom agents — which only fires when the key is absent. Fix: build create_kwargs conditionally, omitting model_id when the session is at the DB default so the agent's __init__ setdefault governs. Also use agent.model_id (post-construction) for both _store_agent cache key and the pre-flight _maybe_load_expected_model call. Three-branch precedence: custom_model setting > session-explicit > omit kwarg. Closes #841

…N_DEFAULT_MODEL Addresses code review feedback on PR #842: - Export SESSION_DEFAULT_MODEL from database.py (single source of truth) instead of duplicating the string literal in _chat_helpers.py - Extract _build_create_kwargs() helper to eliminate the duplicate three-branch create_kwargs logic across non-streaming and streaming code paths - Extract _effective_model() helper using explicit None check (not `or`) to safely read agent.model_id post-construction without treating empty string as missing - Fix static regression guard regex to use [^()]* so nested helper calls inside create_agent() are not falsely flagged - Update unit test to import SESSION_DEFAULT_MODEL instead of hardcoding

…ion (#842) _store_agent was changed by the #842 fix to use _effective_model(agent, model_id) as the cache key — the post-construction value set by kwargs.setdefault. _get_cached_agent still looks up using the pre-construction model_id variable. For custom agents whose setdefault model differs from the session model, the keys never match and the agent is rebuilt on every turn. Revert the two _store_agent call sites to use model_id (the pre-construction intent key), matching what the lookup uses. _effective_model stays at the two _maybe_load_expected_model sites (Lemonade pre-flight needs the actual model) and in log statements (observability). Add two regression guards: - test_cache_hit_on_second_turn_for_setdefault_agent: two-turn cache-hit test with four assertions (call count, object identity, stored-key equality, agent.model_id). Covers the builder/template.py setdefault pattern. - test_no_effective_model_in_store_agent_calls: static grep guard that asserts _store_agent never receives _effective_model(...) as a positional arg, preventing this pattern from silently returning in a future cleanup pass.

#817) ## Summary One-line fix: swap the failing `www.cs.cmu.edu/~tom/EMNLP2004_final.pdf` URL in `docs/plans/email-triage-agent.mdx:2601` for the canonical ACL Anthology record at [W04-3240](https://aclanthology.org/W04-3240/). The CMU URL fails DNS resolution in CI (see [recent run](https://github.com/amd/gaia/actions/runs/24595902571/job/72072156929)), breaking the ``Verify external URLs`` check for every open PR that touches docs. ACL Anthology is the permanent archive for ACL/EMNLP papers — stable URL, no more link rot. Also restored the paper's actual full title ("Learning to Classify Email into 'Speech Acts'") for consistency with the other full-title citations in the same references list. ## Test plan - [x] `curl -sI https://aclanthology.org/W04-3240/` returns 200 - [ ] After merge, `Verify external URLs` check should go green on downstream PRs

Gemma-4-E4B-it-GGUF becomes the default model for all GAIA roles (LLM, VLM, installer, CLI, UI, eval, EMR), replacing the Qwen family defaults. Simultaneously inverts the tool-call priority chain: for tool-calling models, GAIA now passes `tools=[...]` to Lemonade and handles native OpenAI `tool_calls` as the primary path, falling back to embedded-JSON format only for legacy non-tool-calling models. Key changes: - lemonade_client.py: adds `tool_calling` field to ModelRequirement, new `is_tool_calling_model()` helper (optimistic default for unknown GGUFs), startup `_validate_profile_model_registry()` validator, gemma-4-e4b entry in MODELS, AGENT_PROFILES all referencing gemma-4-e4b - providers/lemonade.py: surfaces native tool_calls as a sentinel JSON string so the response type stays `str` throughout the call chain; forces non-streaming when tools are provided to a tool-capable model - agents/base/agent.py: native tool_calls branch in _parse_llm_response, _build_openai_tool_schemas + _openai_tools property, system-prompt gating (excludes embedded-JSON format template for tool-calling models) - chat/sdk.py: threads `tools` kwarg through send_messages/stream Includes pre-swap eval baseline for Qwen3.5-35B at commit 3b51ca9 and 23 new unit tests covering the full tool_calls priority chain.

Keep Gemma-4-E4B-it-GGUF as SESSION_DEFAULT_MODEL per this branch's intent.

Lemonade v10.1.0 ("spring cleaning" release, 2026-04-06) changed its default port from 8000 to 13305 and added Gemma 4 on GPU support. Migration guide: https://github.com/lemonade-sdk/lemonade/wiki/Migration#v10x---v101 Changes: - DEFAULT_PORT in lemonade_client.py flipped to 13305 (with a comment pointing at the migration guide) - min_lemonade_version in every INIT_PROFILES entry bumped 10.0.0 -> 10.1.0 - LEMONADE_VERSION constant bumped 10.0.0 -> 10.1.0 - All agent base_url fallbacks, UI helpers, MCP bridge, RAG/VLM SDKs, and CLI kill/port defaults updated to 13305 - Test fixtures and mocked URLs updated to 13305; version-policy tests re-based on the new 10.1.0 minimum - Docs under docs/ (spec, SDK, guides, plans) refreshed; docs/releases/ left untouched since those are historical This pairs with the Gemma-4-E4B swap already on this branch: Gemma 4 E4B is only available on Lemonade v10.1.0+, so requiring the newer version closes the gap that would otherwise make `gaia init` succeed but model loads fail.

…6 caveat) Captured the post-swap eval baseline for Gemma-4-E4B-it-GGUF running on ngrok Lemonade v10.2.0, judged by claude-sonnet-4-6 — same judge model and three categories as the pre-swap Qwen3.5-35B baseline at 3b51ca9. Headline deltas vs Qwen baseline: tool_selection: 50% -> 50% (+0 pp, avg 7.64 -> 7.77) rag_quality: 100% -> 86% (-14 pp, avg 9.47 -> 8.17) context_retention:100% -> 75% (-25 pp, avg 9.20 -> 8.55) Per-scenario regressions: tool_selection/known_path_read PASS -> FAIL — agent didn't discover indexed-internal-copy fallback path in Turn 1 after Access-Denied on original path. Turn 2 succeeded; real single-turn quality gap. tool_selection/smart_discovery FAIL -> INFRA_ERROR — Gemma loaded with ctx_size=4096; GAIA system prompt (~16K tokens) exceeds window. NOT a model regression — configuration bug on Lemonade side. rag_quality/budget_query PASS -> FAIL (1.15/10) — same ctx_size=4096 issue; exceed_context_size_error before any output. NOT a model regression. context_retention/cross_turn_file_recall PASS -> FAIL — agent anchored on Turn 1 summary and ignored pricing data retrieved in Turn 2 chunks. Real model quality gap (prompt-engineering candidate). Three of the four regressions thus fall into two buckets: - Two are fixable by reloading Gemma with --ctx-size 32768 on Lemonade (matches GAIA's DEFAULT_CONTEXT_SIZE), not code changes on our side. - Two are legitimate model-quality gaps worth follow-up issues. Hardware: AMD Ryzen 9 9950X + gfx1036 iGPU (2 GB VRAM, 5 GB model on disk). ~110 tok/s throughout. Total cost: ~$7 judge (Sonnet, 15 scenarios).

Initial baseline was captured with Gemma loaded on Lemonade at its default ctx_size=4096, which is below GAIA's ~16K system-prompt requirement. After reloading with ctx_size=32768, 3 of the 4 "regressions" recovered: budget_query FAIL 1.15 -> PASS 9.95 (ctx-overflow masked) smart_discovery INFRA_ERROR 0.0 -> PASS 9.45 (ctx-overflow masked) cross_turn_file_recall FAIL 7.52 -> PASS 8.78 (ctx-overflow masked) The 4th regression (known_path_read) was re-run too and still FAILs at 5.6/10 — confirmed real model-quality gap, not ctx-masked. Revised Gemma vs Qwen (at ctx=32768): tool_selection: Qwen 50% (7.64) vs Gemma 75% (8.36) +25pp, +0.72 rag_quality: Qwen 100% (9.47) vs Gemma 100% (9.43) flat context_retention: Qwen 100% (9.20) vs Gemma 100% (9.25) +0.05 Net: Gemma 14/15 pass, Qwen 13/15 pass. Gemma trades one scenario loss (known_path_read) for two scenario wins on tool_selection. Throughput is ~110 tok/s on an iGPU and footprint shrinks from 19.7GB to 5GB. Only open follow-up is known_path_read (PASS 9.28 -> FAIL 6.67/5.6), where Gemma doesn't discover the indexed-internal-copy fallback path in Turn 1 after an Access-Denied error on the original path. Turn 2 recovers. This is a prompt-engineering candidate, not a blocker.

itomek · 2026-04-24T23:20:12Z

Eval results — Gemma-4-E4B vs Qwen3.5-35B baseline

Ran the same three-category eval suite against Gemma-4-E4B-it-GGUF on Lemonade v10.2.0, judge claude-sonnet-4-6, matching the Qwen3.5-35B-A3B-GGUF pre-swap baseline at commit 3b51ca92. Full scorecards in tests/fixtures/eval_baselines/gemma-4-e4b-d71cd914/.

Headline

Category	Qwen pass	Gemma pass	Qwen avg	Gemma avg
tool_selection	2/4 (50%)	3/4 (75%)	7.64	8.36
rag_quality	7/7 (100%)	7/7 (100%)	9.47	9.43
context_retention	4/4 (100%)	4/4 (100%)	9.20	9.25
Total	13/15 (87%)	14/15 (93%)	8.82	8.98

Gemma passes more scenarios than Qwen (14 vs 13) at a fraction of the footprint (5 GB vs 19.7 GB) and equal throughput (~110 tok/s on a gfx1036 iGPU).

Wins

tool_selection/multi_step_plan: Qwen FAIL 6.33 → Gemma PASS 7.35
tool_selection/no_tools_needed: 9.95 → 9.98
tool_selection/smart_discovery: Qwen FAIL 5.35 → Gemma PASS 9.45
rag_quality/csv_analysis: 9.15 → 9.82
rag_quality/table_extraction: 9.27 → 9.67
context_retention/conversation_summary: 9.51 → 9.80

The one real regression

tool_selection/known_path_read (Qwen PASS 9.28 → Gemma FAIL 6.67; confirmed at 5.6 on a second run). When the original file path hits Access-Denied, Qwen discovers the indexed-internal-copy fallback in Turn 1; Gemma takes until Turn 2. Candidate for prompt-engineering, not a blocker.

Infra issue caught by the eval (worth a follow-up)

Initial run had 3 "regressions" that turned out to be ctx-overflow failures: Lemonade loaded Gemma with its default ctx_size=4096, below GAIA's ~16K system-prompt requirement. After reloading with ctx_size=32768, all 3 recovered (budget_query, smart_discovery, cross_turn_file_recall).

Critically, /api/system/status reported model_context_size: 32768 (the catalog value) while the loaded model was actually 4096 — the mask cost a whole run. Follow-up: make _build_system_status read the loaded ctx_size from Lemonade's /health endpoint and warn when it diverges from the catalog default.

Artifacts

Pre-swap baseline: tests/fixtures/eval_baselines/qwen-3.5-35b-3b51ca92/ (committed with this PR)
Post-swap baseline: tests/fixtures/eval_baselines/gemma-4-e4b-d71cd914/ (committed in 6864cc37)
Port bump: commit d71cd914 flips DEFAULT_PORT = 8000 → 13305 and min_lemonade_version = 10.1.0 across 75 files (required for Gemma availability; tracks Lemonade v10.1.0 "spring cleaning" release)

…->10.2.0 CI workflows still started Lemonade on the old default port 8000, but the GAIA client now defaults to 13305 (v10.1.0+ migration). Update test_api, test_embeddings, test_gaia_cli_{linux,windows}, test_agent_sdk, test_rag, test_lemonade_server, and build_cpp to start Lemonade on 13305 to match. Three unit tests hardcoded the old default model name (Qwen3.5-35B-A3B-GGUF) and two hardcoded version strings around 10.1.0 — bump them to match the Gemma-4-E4B/v10.2.0 floors set elsewhere in this PR.

The previous commit updated health-check URLs to :13305 but left the server *start* commands still using port 8000 (explicit -Port 8000 in start-lemonade.ps1 calls; implicit default on older lemonade-server installs). This caused five jobs to time out waiting for 13305 while the server was listening on 8000. Changes: - build_cpp.yml, test_lemonade_server.yml (STX): -Port 8000 → -Port 13305 - test_agent_sdk.yml, test_gaia_cli_windows.yml: add --port 13305 to lemonade-server serve invocations - test_gaia_cli_linux.yml: Python lemonade-server-dev defaults to 8000; revert health/models URLs to :8000 and export LEMONADE_BASE_URL so gaia CLI connects to the correct port

CDN-protected sites (e.g. electron.build via Cloudflare) actively reset connections from CI runner IPs. This is bot protection, not a dead URL. Treat ConnectionResetError (errno 104) and ConnectionRefusedError (111) as warnings — the same treatment given to timeouts and 429s — so valid URLs behind strict anti-bot guards don't fail the external URL check.

Lemonade Server v10.1.0 changed its default port from 8000 to 13305. The test file still used PORT = 8000, causing the Windows CI integration tests to look for a server on :8000 (finding none), try to auto-launch one there (failing), and assert the wrong default in the unit test. - tests/test_lemonade_client.py: PORT constant and assertEqual assertion - tests/test_lemonade_health.py: LEMONADE_PORT default - tests/conftest.py: stale docstring

The Agent SDK integration test hardcoded DEFAULT_MODEL_NAME (now Gemma-4-E4B-it-GGUF) but the GitHub-hosted Windows CI runner only pulls Llama-3.2-3B-Instruct-Hybrid, causing HTTP 422 from the server. - tests/test_agent_sdk.py: read model from GAIA_TEST_MODEL env var, falling back to DEFAULT_MODEL_NAME so local runs still work - test_agent_sdk.yml: set GAIA_TEST_MODEL=Llama-3.2-3B-Instruct-Hybrid before running the test suite to match the pulled model

… var lemonade-server-dev (Python) always binds to port 8000 and has no --port flag, while C++ lemonade-server v10.1.0+ defaults to 13305. The integration test was using PORT=13305 unconditionally, causing is_server_running(localhost, 13305) to return False on Linux, then auto-starting a new server on 13305 which also failed. Fix: make PORT read LEMONADE_PORT env var (defaults to 13305), and pass LEMONADE_PORT=8000 inline when running Linux integration tests.

Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective params, 128K context, natively multimodal, Apache 2.0. Same memory footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off on quality, license, and forward-compatibility with PR #865's universal Gemma 4 transition. Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked up the Gemma 4 drop yet. Side fix while in this code: factory's setdefault now reads the primary from the registration's models list (single source of truth) rather than hardcoding the same string twice. Without this, the "Fallback: ..." comment on the registration was a lie at runtime — if the primary wasn't in the catalog, the factory still hardcoded it instead of walking the list. Test plan: tests/unit/agents/test_registry.py covers the 4B-class invariant (case-insensitive now to accommodate Gemma 3's lowercase "4b" naming), the factory preset, and the chat-lite legacy alias path.

Brings in PR #865 (Gemma 4 E4B universal default + native tool_calls priority + Lemonade v10.1.0 port flip 8000→13305) on top of our existing gaia-lite work. The two PRs are complementary: #865 sets the global LLM/VLM default; this branch sets gaia-lite's preset. Conflict resolution ------------------- src/gaia/llm/lemonade_manager.py Both branches changed the module header. Ours added a re-export pattern (__all__ + ``from lemonade_client import DEFAULT_CONTEXT_SIZE``) so DEFAULT_CONTEXT_SIZE has a single source of truth. Main flipped the default port 8000→13305 for Lemonade v10.1.0. Resolution keeps both: the re-export pattern is preserved and the port is updated. Test updates required by stricter pre-flight semantics ------------------------------------------------------ This branch's _maybe_load_expected_model + _ensure_model_loaded changes tightened two behaviours that the existing test suite expressed against the old loose semantics: tests/unit/test_chat_preflight.py Pre-flight now requires the EXPECTED model to be active (not just any LLM) and its ctx_size ≥ 32K. Updated _model() helper to accept name + ctx_size kwargs; the four "skips load" tests now load a health entry whose model_name matches the expected model_id and whose ctx_size is at the 32K floor. Without this, the test fixture "test-llm" mismatched the expected "Qwen3.5-35B-A3B-GGUF" and tripped the wrong-model reload branch our PR added. tests/unit/test_lemonade_model_loading.py _ensure_model_loaded now defaults ctx_size to DEFAULT_CONTEXT_SIZE (32768) for unknown models, fixing the silent-empty-stream regression where Lemonade's default 4096 ctx truncated ChatAgent's >7K-token system prompt. Two assertions updated from ctx_size=None → 32768. All affected tests pass; lint clean.

These were marked "known flakies, pre-existing on main" in the merge PR, but every one was a real test bug worth nailing down rather than papering over. All three reproduced on bare main HEAD. test_sse_confirmation (3 tests) Polled ``handler._confirm_result is None`` to detect when the worker thread had registered itself. But _confirm_result is initialised to ``False`` (not None), so the polling loop exited immediately — resolve fired before the worker's confirm_tool_execution set up _confirm_event, and the worker's own setdefault then overwrote the resolved state with a fresh unset event. Net result: the worker waited for an event that no one would ever set, hit the internal 90 s confirmation timeout, and the test failed with "thread still alive". Fix: poll ``handler._confirm_event is None`` instead. _confirm_event starts as None and only becomes non-None inside confirm_tool_execution, so it correctly tracks the registration moment. test_semaphore_exhausted_returns_429 Created a SECOND asyncio event loop with ``asyncio.new_event_loop()`` and acquired the semaphore on it, then handed the half-locked semaphore to TestClient (which runs on its OWN loop). ``asyncio.Semaphore`` doesn't promise cross-loop sanity — the waiter list is loop-bound, so acquire() on TestClient's loop saw inconsistent state under contention. Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop. Plus patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so the test goes from 60 s → 0.6 s. test_llm_command_with_server Health check accepted any 200, even when ``all_models_loaded == []``. Worse: even with a model loaded, ``gaia llm`` defaults to whatever the global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF. CI runners almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI client retried with exponential backoff, and the subprocess timed out at 60 s. Fix: extend the health check to require at least one ``llm``/``vlm`` in ``all_models_loaded`` and return that model's name. The test then passes ``--model <loaded_one>`` so we don't trip the auto-load on a model the runner doesn't have. Verified: full unit suite 1630 passed / 0 failed / 15 skipped.

Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective params, 128K context, natively multimodal, Apache 2.0. Same memory footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off on quality, license, and forward-compatibility with PR #865's universal Gemma 4 transition. Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked up the Gemma 4 drop yet. Side fix while in this code: factory's setdefault now reads the primary from the registration's models list (single source of truth) rather than hardcoding the same string twice. Without this, the "Fallback: ..." comment on the registration was a lie at runtime — if the primary wasn't in the catalog, the factory still hardcoded it instead of walking the list. Test plan: tests/unit/agents/test_registry.py covers the 4B-class invariant (case-insensitive now to accommodate Gemma 3's lowercase "4b" naming), the factory preset, and the chat-lite legacy alias path.

These were marked "known flakies, pre-existing on main" in the merge PR, but every one was a real test bug worth nailing down rather than papering over. All three reproduced on bare main HEAD. test_sse_confirmation (3 tests) Polled ``handler._confirm_result is None`` to detect when the worker thread had registered itself. But _confirm_result is initialised to ``False`` (not None), so the polling loop exited immediately — resolve fired before the worker's confirm_tool_execution set up _confirm_event, and the worker's own setdefault then overwrote the resolved state with a fresh unset event. Net result: the worker waited for an event that no one would ever set, hit the internal 90 s confirmation timeout, and the test failed with "thread still alive". Fix: poll ``handler._confirm_event is None`` instead. _confirm_event starts as None and only becomes non-None inside confirm_tool_execution, so it correctly tracks the registration moment. test_semaphore_exhausted_returns_429 Created a SECOND asyncio event loop with ``asyncio.new_event_loop()`` and acquired the semaphore on it, then handed the half-locked semaphore to TestClient (which runs on its OWN loop). ``asyncio.Semaphore`` doesn't promise cross-loop sanity — the waiter list is loop-bound, so acquire() on TestClient's loop saw inconsistent state under contention. Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop. Plus patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so the test goes from 60 s → 0.6 s. test_llm_command_with_server Health check accepted any 200, even when ``all_models_loaded == []``. Worse: even with a model loaded, ``gaia llm`` defaults to whatever the global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF. CI runners almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI client retried with exponential backoff, and the subprocess timed out at 60 s. Fix: extend the health check to require at least one ``llm``/``vlm`` in ``all_models_loaded`` and return that model's name. The test then passes ``--model <loaded_one>`` so we don't trip the auto-load on a model the runner doesn't have. Verified: full unit suite 1630 passed / 0 failed / 15 skipped.

Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective params, 128K context, natively multimodal, Apache 2.0. Same memory footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off on quality, license, and forward-compatibility with PR #865's universal Gemma 4 transition. Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked up the Gemma 4 drop yet. Side fix while in this code: factory's setdefault now reads the primary from the registration's models list (single source of truth) rather than hardcoding the same string twice. Without this, the "Fallback: ..." comment on the registration was a lie at runtime — if the primary wasn't in the catalog, the factory still hardcoded it instead of walking the list. Test plan: tests/unit/agents/test_registry.py covers the 4B-class invariant (case-insensitive now to accommodate Gemma 3's lowercase "4b" naming), the factory preset, and the chat-lite legacy alias path.

These were marked "known flakies, pre-existing on main" in the merge PR, but every one was a real test bug worth nailing down rather than papering over. All three reproduced on bare main HEAD. test_sse_confirmation (3 tests) Polled ``handler._confirm_result is None`` to detect when the worker thread had registered itself. But _confirm_result is initialised to ``False`` (not None), so the polling loop exited immediately — resolve fired before the worker's confirm_tool_execution set up _confirm_event, and the worker's own setdefault then overwrote the resolved state with a fresh unset event. Net result: the worker waited for an event that no one would ever set, hit the internal 90 s confirmation timeout, and the test failed with "thread still alive". Fix: poll ``handler._confirm_event is None`` instead. _confirm_event starts as None and only becomes non-None inside confirm_tool_execution, so it correctly tracks the registration moment. test_semaphore_exhausted_returns_429 Created a SECOND asyncio event loop with ``asyncio.new_event_loop()`` and acquired the semaphore on it, then handed the half-locked semaphore to TestClient (which runs on its OWN loop). ``asyncio.Semaphore`` doesn't promise cross-loop sanity — the waiter list is loop-bound, so acquire() on TestClient's loop saw inconsistent state under contention. Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop. Plus patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so the test goes from 60 s → 0.6 s. test_llm_command_with_server Health check accepted any 200, even when ``all_models_loaded == []``. Worse: even with a model loaded, ``gaia llm`` defaults to whatever the global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF. CI runners almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI client retried with exponential backoff, and the subprocess timed out at 60 s. Fix: extend the health check to require at least one ``llm``/``vlm`` in ``all_models_loaded`` and return that model's name. The test then passes ``--model <loaded_one>`` so we don't trip the auto-load on a model the runner doesn't have. Verified: full unit suite 1630 passed / 0 failed / 15 skipped.

itomek and others added 6 commits April 20, 2026 18:50

style(mcp): apply Black formatting to mcp_bridge.py (CI lint fix)

3b51ca9

itomek requested a review from kovtcharov-amd as a code owner April 24, 2026 18:55

itomek marked this pull request as draft April 24, 2026 18:56

itomek self-assigned this Apr 24, 2026

itomek added this to the v0.18.0 — Agent Eval Benchmark [OSS] milestone Apr 24, 2026

itomek linked an issue Apr 24, 2026 that may be closed by this pull request

feat: Switch default agent model to Gemma 4 (26B-A4B) #863

Closed

itomek added 2 commits April 24, 2026 15:08

chore: merge main into branch, resolve SESSION_DEFAULT_MODEL conflict

7b03ee8

Keep Gemma-4-E4B-it-GGUF as SESSION_DEFAULT_MODEL per this branch's intent.

github-actions Bot added the rag RAG system changes label Apr 24, 2026

itomek added 2 commits April 24, 2026 18:36

docs(cpp): bump Lemonade version references to v10.1.0

7492d41

github-actions Bot added the cpp label Apr 24, 2026

feat(llm): bump Lemonade version requirement to 10.2.0

e9ef966

kovtcharov approved these changes Apr 24, 2026

View reviewed changes

itomek added this pull request to the merge queue Apr 24, 2026

itomek removed this pull request from the merge queue due to a manual request Apr 24, 2026

github-actions Bot added the devops DevOps/infrastructure changes label Apr 25, 2026

itomek added 6 commits April 25, 2026 10:26

style: black format test_lemonade_client.py

67e4b6a

kovtcharov added this pull request to the merge queue Apr 26, 2026

Merged via the queue into main with commit 5d37771 Apr 26, 2026
69 of 72 checks passed

kovtcharov deleted the claude/sad-matsumoto-fd179f branch April 26, 2026 02:50

kovtcharov mentioned this pull request Apr 26, 2026

docs(lemonade): use Launchpad PPA for Linux install + fix v10.2.0 navbar label #801

Open

3 tasks

github-actions Bot mentioned this pull request May 1, 2026

chore(release): v0.17.5 #940

Merged

5 tasks

kovtcharov mentioned this pull request May 3, 2026

Agent loop rejects parallel tool_calls from native tool-calling models (Gemma-4-E4B default) #944

Open

theonlychant mentioned this pull request May 3, 2026

fix(agents): support parallel tool calls and improve recovery prompt #945

Closed

7 tasks

kovtcharov mentioned this pull request May 3, 2026

fix(agents): support parallel tool_calls (#944) #946

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add Gemma 4 E4B as default and native tool_calls priority#865

feat(llm): add Gemma 4 E4B as default and native tool_calls priority#865
kovtcharov merged 19 commits intomainfrom
claude/sad-matsumoto-fd179f

itomek commented Apr 24, 2026 •

edited

Loading

Uh oh!

itomek commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

itomek commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed and why

Test plan

Open follow-ups (not blockers for this PR)

Uh oh!

itomek commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Eval results — Gemma-4-E4B vs Qwen3.5-35B baseline

Headline

Wins

The one real regression

Infra issue caught by the eval (worth a follow-up)

Artifacts

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itomek commented Apr 24, 2026 •

edited

Loading

itomek commented Apr 24, 2026 •

edited

Loading