Skip to content

feat(llm): add Gemma 4 E4B as default and native tool_calls priority#865

Merged
kovtcharov merged 19 commits intomainfrom
claude/sad-matsumoto-fd179f
Apr 26, 2026
Merged

feat(llm): add Gemma 4 E4B as default and native tool_calls priority#865
kovtcharov merged 19 commits intomainfrom
claude/sad-matsumoto-fd179f

Conversation

@itomek
Copy link
Copy Markdown
Collaborator

@itomek itomek commented Apr 24, 2026

Summary

Gemma-4-E4B-it-GGUF becomes GAIA's default model for all roles (LLM, VLM, installer profiles, CLI, Agent UI, eval, EMR). Simultaneously inverts the tool-call priority chain so native OpenAI tool_calls is the primary path, with embedded-JSON format falling back only for legacy non-tool-calling models. Also bumps the minimum Lemonade version to v10.1.0 (which moved its default port from 8000 → 13305 and is where Gemma 4 support was added).

This ships on top of the existing UI model-resolution fixes (#841, #842). Resolves #863.

What changed and why

  • Universal Gemma default — Gemma 4 E4B is natively multimodal (~4.5B effective params, 128K context, Apache 2.0), making it the right single default across the LLM/VLM split that previously required two different models. Footprint drops 19.7 GB → 5 GB.
  • Native tool_calls path (Lemonade v10.1.0+ --jinja) — GAIA now passes tools=[...] to Lemonade for tool-capable models. The response comes back as native tool_calls; LemonadeProvider.chat() encodes them as a sentinel JSON string ({"__tool_calls__": ...}) so no callers need a type change. _parse_llm_response detects the sentinel and returns the unified {"tool": ..., "tool_args": ...} dict.
  • System-prompt gating — The embedded-JSON format block (_PLANNING_FORMAT/_CONVERSATIONAL_FORMAT) is excluded from the composed system prompt for tool-calling models; it actively prevented native tool_calls in prior testing.
  • Startup validator_validate_profile_model_registry() raises at import time if any AGENT_PROFILES entry references a model key not in MODELS.
  • Lemonade v10.1.0+ / port 13305DEFAULT_PORT flipped from 8000 to 13305 (Lemonade's spring-cleaning release changed the default). 75 files updated (agents, UI, MCP bridge, RAG SDK, VLM, CLI, tests, docs). min_lemonade_version = 10.1.0 everywhere INIT_PROFILES is declared.
  • Eval baselines — Pre-swap Qwen3.5-35B baseline at commit 3b51ca92 and post-swap Gemma-4-E4B baseline both committed under tests/fixtures/eval_baselines/; Gemma outperforms Qwen 14/15 vs 13/15 (see comment below for per-scenario breakdown).

Test plan

  • python -m pytest tests/unit/ --ignore=tests/unit/chat/ui/ -q → 928 passed, 16 skipped
  • python -m pytest tests/unit/test_tool_call_priority.py -v → 23 passed (sentinel detection, native branch parsing, edge cases, prompt gating, startup validator)
  • python util/lint.py --black --isort --flake8 → all pass
  • Eval against Gemma-4-E4B on Lemonade v10.2.0, Sonnet judge → 14/15 scenarios pass, beats Qwen baseline (see comment)
  • Verified claude -p --model claude-sonnet-4-6 was actually the judge (not Opus) via modelUsage in test subprocess

Open follow-ups (not blockers for this PR)

  • tool_selection/known_path_read regression: Gemma doesn't discover indexed-internal-copy fallback path in Turn 1 after Access-Denied on the original. Prompt-engineering candidate.
  • /api/system/status reports the catalog ctx_size even when Lemonade loaded the model with a smaller window. Surface a warning when they diverge; a whole eval run was wasted due to this mask.

itomek and others added 6 commits April 20, 2026 18:50
)

Previously, _chat_helpers.py always passed model_id=<session model> explicitly
to registry.create_agent(), defeating kwargs.setdefault("model_id", ...) in
custom agents — which only fires when the key is absent.

Fix: build create_kwargs conditionally, omitting model_id when the session is
at the DB default so the agent's __init__ setdefault governs. Also use
agent.model_id (post-construction) for both _store_agent cache key and the
pre-flight _maybe_load_expected_model call.

Three-branch precedence: custom_model setting > session-explicit > omit kwarg.

Closes #841
…N_DEFAULT_MODEL

Addresses code review feedback on PR #842:

- Export SESSION_DEFAULT_MODEL from database.py (single source of truth)
  instead of duplicating the string literal in _chat_helpers.py
- Extract _build_create_kwargs() helper to eliminate the duplicate three-branch
  create_kwargs logic across non-streaming and streaming code paths
- Extract _effective_model() helper using explicit None check (not `or`)
  to safely read agent.model_id post-construction without treating empty
  string as missing
- Fix static regression guard regex to use [^()]* so nested helper calls
  inside create_agent() are not falsely flagged
- Update unit test to import SESSION_DEFAULT_MODEL instead of hardcoding
…ion (#842)

_store_agent was changed by the #842 fix to use _effective_model(agent,
model_id) as the cache key — the post-construction value set by kwargs.setdefault.
_get_cached_agent still looks up using the pre-construction model_id variable.
For custom agents whose setdefault model differs from the session model, the
keys never match and the agent is rebuilt on every turn.

Revert the two _store_agent call sites to use model_id (the pre-construction
intent key), matching what the lookup uses. _effective_model stays at the two
_maybe_load_expected_model sites (Lemonade pre-flight needs the actual model)
and in log statements (observability).

Add two regression guards:
- test_cache_hit_on_second_turn_for_setdefault_agent: two-turn cache-hit test
  with four assertions (call count, object identity, stored-key equality,
  agent.model_id). Covers the builder/template.py setdefault pattern.
- test_no_effective_model_in_store_agent_calls: static grep guard that asserts
  _store_agent never receives _effective_model(...) as a positional arg,
  preventing this pattern from silently returning in a future cleanup pass.
#817)

## Summary

One-line fix: swap the failing `www.cs.cmu.edu/~tom/EMNLP2004_final.pdf`
URL in `docs/plans/email-triage-agent.mdx:2601` for the canonical ACL
Anthology record at [W04-3240](https://aclanthology.org/W04-3240/). The
CMU URL fails DNS resolution in CI (see [recent
run](https://github.com/amd/gaia/actions/runs/24595902571/job/72072156929)),
breaking the ``Verify external URLs`` check for every open PR that
touches docs. ACL Anthology is the permanent archive for ACL/EMNLP
papers — stable URL, no more link rot.

Also restored the paper's actual full title ("Learning to Classify Email
into 'Speech Acts'") for consistency with the other full-title citations
in the same references list.

## Test plan

- [x] `curl -sI https://aclanthology.org/W04-3240/` returns 200
- [ ] After merge, `Verify external URLs` check should go green on
downstream PRs
Gemma-4-E4B-it-GGUF becomes the default model for all GAIA roles (LLM,
VLM, installer, CLI, UI, eval, EMR), replacing the Qwen family defaults.
Simultaneously inverts the tool-call priority chain: for tool-calling
models, GAIA now passes `tools=[...]` to Lemonade and handles native
OpenAI `tool_calls` as the primary path, falling back to embedded-JSON
format only for legacy non-tool-calling models.

Key changes:
- lemonade_client.py: adds `tool_calling` field to ModelRequirement,
  new `is_tool_calling_model()` helper (optimistic default for unknown
  GGUFs), startup `_validate_profile_model_registry()` validator,
  gemma-4-e4b entry in MODELS, AGENT_PROFILES all referencing gemma-4-e4b
- providers/lemonade.py: surfaces native tool_calls as a sentinel JSON
  string so the response type stays `str` throughout the call chain;
  forces non-streaming when tools are provided to a tool-capable model
- agents/base/agent.py: native tool_calls branch in _parse_llm_response,
  _build_openai_tool_schemas + _openai_tools property, system-prompt gating
  (excludes embedded-JSON format template for tool-calling models)
- chat/sdk.py: threads `tools` kwarg through send_messages/stream

Includes pre-swap eval baseline for Qwen3.5-35B at commit 3b51ca9 and
23 new unit tests covering the full tool_calls priority chain.
@itomek itomek requested a review from kovtcharov-amd as a code owner April 24, 2026 18:55
@github-actions github-actions Bot added documentation Documentation changes dependencies Dependency updates agents chat Chat SDK changes mcp MCP integration changes llm LLM backend changes cli CLI changes eval Evaluation framework changes tests Test changes performance Performance-critical changes labels Apr 24, 2026
@itomek itomek marked this pull request as draft April 24, 2026 18:56
@itomek itomek self-assigned this Apr 24, 2026
@itomek itomek linked an issue Apr 24, 2026 that may be closed by this pull request
itomek added 2 commits April 24, 2026 15:08
Keep Gemma-4-E4B-it-GGUF as SESSION_DEFAULT_MODEL per this branch's intent.
Lemonade v10.1.0 ("spring cleaning" release, 2026-04-06) changed its
default port from 8000 to 13305 and added Gemma 4 on GPU support.
Migration guide: https://github.com/lemonade-sdk/lemonade/wiki/Migration#v10x---v101

Changes:
- DEFAULT_PORT in lemonade_client.py flipped to 13305 (with a comment
  pointing at the migration guide)
- min_lemonade_version in every INIT_PROFILES entry bumped 10.0.0 -> 10.1.0
- LEMONADE_VERSION constant bumped 10.0.0 -> 10.1.0
- All agent base_url fallbacks, UI helpers, MCP bridge, RAG/VLM SDKs, and
  CLI kill/port defaults updated to 13305
- Test fixtures and mocked URLs updated to 13305; version-policy tests
  re-based on the new 10.1.0 minimum
- Docs under docs/ (spec, SDK, guides, plans) refreshed; docs/releases/
  left untouched since those are historical

This pairs with the Gemma-4-E4B swap already on this branch: Gemma 4 E4B
is only available on Lemonade v10.1.0+, so requiring the newer version
closes the gap that would otherwise make `gaia init` succeed but model
loads fail.
@github-actions github-actions Bot added the rag RAG system changes label Apr 24, 2026
itomek added 2 commits April 24, 2026 18:36
…6 caveat)

Captured the post-swap eval baseline for Gemma-4-E4B-it-GGUF running on
ngrok Lemonade v10.2.0, judged by claude-sonnet-4-6 — same judge model and
three categories as the pre-swap Qwen3.5-35B baseline at 3b51ca9.

Headline deltas vs Qwen baseline:
  tool_selection:    50% -> 50%  (+0 pp, avg 7.64 -> 7.77)
  rag_quality:      100% -> 86%  (-14 pp, avg 9.47 -> 8.17)
  context_retention:100% -> 75%  (-25 pp, avg 9.20 -> 8.55)

Per-scenario regressions:
  tool_selection/known_path_read     PASS -> FAIL — agent didn't discover
    indexed-internal-copy fallback path in Turn 1 after Access-Denied on
    original path. Turn 2 succeeded; real single-turn quality gap.
  tool_selection/smart_discovery     FAIL -> INFRA_ERROR — Gemma loaded
    with ctx_size=4096; GAIA system prompt (~16K tokens) exceeds window.
    NOT a model regression — configuration bug on Lemonade side.
  rag_quality/budget_query           PASS -> FAIL (1.15/10) — same
    ctx_size=4096 issue; exceed_context_size_error before any output.
    NOT a model regression.
  context_retention/cross_turn_file_recall   PASS -> FAIL — agent
    anchored on Turn 1 summary and ignored pricing data retrieved in
    Turn 2 chunks. Real model quality gap (prompt-engineering candidate).

Three of the four regressions thus fall into two buckets:
  - Two are fixable by reloading Gemma with --ctx-size 32768 on Lemonade
    (matches GAIA's DEFAULT_CONTEXT_SIZE), not code changes on our side.
  - Two are legitimate model-quality gaps worth follow-up issues.

Hardware: AMD Ryzen 9 9950X + gfx1036 iGPU (2 GB VRAM, 5 GB model on disk).
~110 tok/s throughout. Total cost: ~$7 judge (Sonnet, 15 scenarios).
Initial baseline was captured with Gemma loaded on Lemonade at its default
ctx_size=4096, which is below GAIA's ~16K system-prompt requirement. After
reloading with ctx_size=32768, 3 of the 4 "regressions" recovered:

  budget_query              FAIL 1.15 -> PASS 9.95   (ctx-overflow masked)
  smart_discovery     INFRA_ERROR 0.0 -> PASS 9.45   (ctx-overflow masked)
  cross_turn_file_recall    FAIL 7.52 -> PASS 8.78   (ctx-overflow masked)

The 4th regression (known_path_read) was re-run too and still FAILs at
5.6/10 — confirmed real model-quality gap, not ctx-masked.

Revised Gemma vs Qwen (at ctx=32768):
  tool_selection:    Qwen 50%  (7.64)  vs Gemma  75%  (8.36)   +25pp, +0.72
  rag_quality:       Qwen 100% (9.47)  vs Gemma 100% (9.43)    flat
  context_retention: Qwen 100% (9.20)  vs Gemma 100% (9.25)    +0.05

Net: Gemma 14/15 pass, Qwen 13/15 pass. Gemma trades one scenario loss
(known_path_read) for two scenario wins on tool_selection. Throughput is
~110 tok/s on an iGPU and footprint shrinks from 19.7GB to 5GB.

Only open follow-up is known_path_read (PASS 9.28 -> FAIL 6.67/5.6),
where Gemma doesn't discover the indexed-internal-copy fallback path
in Turn 1 after an Access-Denied error on the original path. Turn 2
recovers. This is a prompt-engineering candidate, not a blocker.
@itomek
Copy link
Copy Markdown
Collaborator Author

itomek commented Apr 24, 2026

Eval results — Gemma-4-E4B vs Qwen3.5-35B baseline

Ran the same three-category eval suite against Gemma-4-E4B-it-GGUF on Lemonade v10.2.0, judge claude-sonnet-4-6, matching the Qwen3.5-35B-A3B-GGUF pre-swap baseline at commit 3b51ca92. Full scorecards in tests/fixtures/eval_baselines/gemma-4-e4b-d71cd914/.

Headline

Category Qwen pass Gemma pass Qwen avg Gemma avg
tool_selection 2/4 (50%) 3/4 (75%) 7.64 8.36
rag_quality 7/7 (100%) 7/7 (100%) 9.47 9.43
context_retention 4/4 (100%) 4/4 (100%) 9.20 9.25
Total 13/15 (87%) 14/15 (93%) 8.82 8.98

Gemma passes more scenarios than Qwen (14 vs 13) at a fraction of the footprint (5 GB vs 19.7 GB) and equal throughput (~110 tok/s on a gfx1036 iGPU).

Wins

  • tool_selection/multi_step_plan: Qwen FAIL 6.33 → Gemma PASS 7.35
  • tool_selection/no_tools_needed: 9.95 → 9.98
  • tool_selection/smart_discovery: Qwen FAIL 5.35 → Gemma PASS 9.45
  • rag_quality/csv_analysis: 9.15 → 9.82
  • rag_quality/table_extraction: 9.27 → 9.67
  • context_retention/conversation_summary: 9.51 → 9.80

The one real regression

tool_selection/known_path_read (Qwen PASS 9.28 → Gemma FAIL 6.67; confirmed at 5.6 on a second run). When the original file path hits Access-Denied, Qwen discovers the indexed-internal-copy fallback in Turn 1; Gemma takes until Turn 2. Candidate for prompt-engineering, not a blocker.

Infra issue caught by the eval (worth a follow-up)

Initial run had 3 "regressions" that turned out to be ctx-overflow failures: Lemonade loaded Gemma with its default ctx_size=4096, below GAIA's ~16K system-prompt requirement. After reloading with ctx_size=32768, all 3 recovered (budget_query, smart_discovery, cross_turn_file_recall).

Critically, /api/system/status reported model_context_size: 32768 (the catalog value) while the loaded model was actually 4096 — the mask cost a whole run. Follow-up: make _build_system_status read the loaded ctx_size from Lemonade's /health endpoint and warn when it diverges from the catalog default.

Artifacts

  • Pre-swap baseline: tests/fixtures/eval_baselines/qwen-3.5-35b-3b51ca92/ (committed with this PR)
  • Post-swap baseline: tests/fixtures/eval_baselines/gemma-4-e4b-d71cd914/ (committed in 6864cc37)
  • Port bump: commit d71cd914 flips DEFAULT_PORT = 8000 → 13305 and min_lemonade_version = 10.1.0 across 75 files (required for Gemma availability; tracks Lemonade v10.1.0 "spring cleaning" release)

@github-actions github-actions Bot added the cpp label Apr 24, 2026
@itomek itomek added this pull request to the merge queue Apr 24, 2026
@itomek itomek removed this pull request from the merge queue due to a manual request Apr 24, 2026
…->10.2.0

CI workflows still started Lemonade on the old default port 8000, but the GAIA
client now defaults to 13305 (v10.1.0+ migration). Update test_api,
test_embeddings, test_gaia_cli_{linux,windows}, test_agent_sdk, test_rag,
test_lemonade_server, and build_cpp to start Lemonade on 13305 to match.

Three unit tests hardcoded the old default model name (Qwen3.5-35B-A3B-GGUF)
and two hardcoded version strings around 10.1.0 — bump them to match the
Gemma-4-E4B/v10.2.0 floors set elsewhere in this PR.
@github-actions github-actions Bot added the devops DevOps/infrastructure changes label Apr 25, 2026
itomek added 6 commits April 25, 2026 10:26
The previous commit updated health-check URLs to :13305 but left the
server *start* commands still using port 8000 (explicit -Port 8000 in
start-lemonade.ps1 calls; implicit default on older lemonade-server
installs). This caused five jobs to time out waiting for 13305 while
the server was listening on 8000.

Changes:
- build_cpp.yml, test_lemonade_server.yml (STX): -Port 8000 → -Port 13305
- test_agent_sdk.yml, test_gaia_cli_windows.yml: add --port 13305 to
  lemonade-server serve invocations
- test_gaia_cli_linux.yml: Python lemonade-server-dev defaults to 8000;
  revert health/models URLs to :8000 and export LEMONADE_BASE_URL so
  gaia CLI connects to the correct port
CDN-protected sites (e.g. electron.build via Cloudflare) actively reset
connections from CI runner IPs. This is bot protection, not a dead URL.
Treat ConnectionResetError (errno 104) and ConnectionRefusedError (111)
as warnings — the same treatment given to timeouts and 429s — so valid
URLs behind strict anti-bot guards don't fail the external URL check.
Lemonade Server v10.1.0 changed its default port from 8000 to 13305.
The test file still used PORT = 8000, causing the Windows CI integration
tests to look for a server on :8000 (finding none), try to auto-launch
one there (failing), and assert the wrong default in the unit test.

- tests/test_lemonade_client.py: PORT constant and assertEqual assertion
- tests/test_lemonade_health.py: LEMONADE_PORT default
- tests/conftest.py: stale docstring
The Agent SDK integration test hardcoded DEFAULT_MODEL_NAME (now
Gemma-4-E4B-it-GGUF) but the GitHub-hosted Windows CI runner only
pulls Llama-3.2-3B-Instruct-Hybrid, causing HTTP 422 from the server.

- tests/test_agent_sdk.py: read model from GAIA_TEST_MODEL env var,
  falling back to DEFAULT_MODEL_NAME so local runs still work
- test_agent_sdk.yml: set GAIA_TEST_MODEL=Llama-3.2-3B-Instruct-Hybrid
  before running the test suite to match the pulled model
… var

lemonade-server-dev (Python) always binds to port 8000 and has no
--port flag, while C++ lemonade-server v10.1.0+ defaults to 13305.

The integration test was using PORT=13305 unconditionally, causing
is_server_running(localhost, 13305) to return False on Linux, then
auto-starting a new server on 13305 which also failed.

Fix: make PORT read LEMONADE_PORT env var (defaults to 13305), and
pass LEMONADE_PORT=8000 inline when running Linux integration tests.
@kovtcharov kovtcharov added this pull request to the merge queue Apr 26, 2026
Merged via the queue into main with commit 5d37771 Apr 26, 2026
69 of 72 checks passed
@kovtcharov kovtcharov deleted the claude/sad-matsumoto-fd179f branch April 26, 2026 02:50
kovtcharov added a commit that referenced this pull request Apr 26, 2026
Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's
Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective
params, 128K context, natively multimodal, Apache 2.0. Same memory
footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off
on quality, license, and forward-compatibility with PR #865's universal
Gemma 4 transition.

Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked
up the Gemma 4 drop yet.

Side fix while in this code: factory's setdefault now reads the primary
from the registration's models list (single source of truth) rather
than hardcoding the same string twice. Without this, the "Fallback: ..."
comment on the registration was a lie at runtime — if the primary
wasn't in the catalog, the factory still hardcoded it instead of
walking the list.

Test plan: tests/unit/agents/test_registry.py covers the 4B-class
invariant (case-insensitive now to accommodate Gemma 3's lowercase
"4b" naming), the factory preset, and the chat-lite legacy alias path.
kovtcharov added a commit that referenced this pull request Apr 26, 2026
Brings in PR #865 (Gemma 4 E4B universal default + native tool_calls
priority + Lemonade v10.1.0 port flip 8000→13305) on top of our
existing gaia-lite work. The two PRs are complementary: #865 sets the
global LLM/VLM default; this branch sets gaia-lite's preset.

Conflict resolution
-------------------

src/gaia/llm/lemonade_manager.py
  Both branches changed the module header.  Ours added a re-export
  pattern (__all__ + ``from lemonade_client import DEFAULT_CONTEXT_SIZE``)
  so DEFAULT_CONTEXT_SIZE has a single source of truth.  Main flipped
  the default port 8000→13305 for Lemonade v10.1.0.  Resolution keeps
  both: the re-export pattern is preserved and the port is updated.

Test updates required by stricter pre-flight semantics
------------------------------------------------------

This branch's _maybe_load_expected_model + _ensure_model_loaded changes
tightened two behaviours that the existing test suite expressed against
the old loose semantics:

  tests/unit/test_chat_preflight.py
    Pre-flight now requires the EXPECTED model to be active (not just
    any LLM) and its ctx_size ≥ 32K.  Updated _model() helper to accept
    name + ctx_size kwargs; the four "skips load" tests now load a
    health entry whose model_name matches the expected model_id and
    whose ctx_size is at the 32K floor.  Without this, the test fixture
    "test-llm" mismatched the expected "Qwen3.5-35B-A3B-GGUF" and tripped
    the wrong-model reload branch our PR added.

  tests/unit/test_lemonade_model_loading.py
    _ensure_model_loaded now defaults ctx_size to DEFAULT_CONTEXT_SIZE
    (32768) for unknown models, fixing the silent-empty-stream regression
    where Lemonade's default 4096 ctx truncated ChatAgent's >7K-token
    system prompt.  Two assertions updated from ctx_size=None → 32768.

All affected tests pass; lint clean.
kovtcharov added a commit that referenced this pull request Apr 26, 2026
These were marked "known flakies, pre-existing on main" in the merge
PR, but every one was a real test bug worth nailing down rather than
papering over.  All three reproduced on bare main HEAD.

test_sse_confirmation (3 tests)
  Polled ``handler._confirm_result is None`` to detect when the worker
  thread had registered itself.  But _confirm_result is initialised to
  ``False`` (not None), so the polling loop exited immediately — resolve
  fired before the worker's confirm_tool_execution set up _confirm_event,
  and the worker's own setdefault then overwrote the resolved state with
  a fresh unset event.  Net result: the worker waited for an event that
  no one would ever set, hit the internal 90 s confirmation timeout, and
  the test failed with "thread still alive".

  Fix: poll ``handler._confirm_event is None`` instead.  _confirm_event
  starts as None and only becomes non-None inside confirm_tool_execution,
  so it correctly tracks the registration moment.

test_semaphore_exhausted_returns_429
  Created a SECOND asyncio event loop with ``asyncio.new_event_loop()``
  and acquired the semaphore on it, then handed the half-locked semaphore
  to TestClient (which runs on its OWN loop).  ``asyncio.Semaphore``
  doesn't promise cross-loop sanity — the waiter list is loop-bound, so
  acquire() on TestClient's loop saw inconsistent state under contention.

  Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop.  Plus
  patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so
  the test goes from 60 s → 0.6 s.

test_llm_command_with_server
  Health check accepted any 200, even when ``all_models_loaded == []``.
  Worse: even with a model loaded, ``gaia llm`` defaults to whatever the
  global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF.  CI runners
  almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI
  client retried with exponential backoff, and the subprocess timed out
  at 60 s.

  Fix: extend the health check to require at least one ``llm``/``vlm``
  in ``all_models_loaded`` and return that model's name.  The test then
  passes ``--model <loaded_one>`` so we don't trip the auto-load on a
  model the runner doesn't have.

Verified: full unit suite 1630 passed / 0 failed / 15 skipped.
itomek pushed a commit that referenced this pull request Apr 29, 2026
Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's
Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective
params, 128K context, natively multimodal, Apache 2.0. Same memory
footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off
on quality, license, and forward-compatibility with PR #865's universal
Gemma 4 transition.

Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked
up the Gemma 4 drop yet.

Side fix while in this code: factory's setdefault now reads the primary
from the registration's models list (single source of truth) rather
than hardcoding the same string twice. Without this, the "Fallback: ..."
comment on the registration was a lie at runtime — if the primary
wasn't in the catalog, the factory still hardcoded it instead of
walking the list.

Test plan: tests/unit/agents/test_registry.py covers the 4B-class
invariant (case-insensitive now to accommodate Gemma 3's lowercase
"4b" naming), the factory preset, and the chat-lite legacy alias path.
itomek pushed a commit that referenced this pull request Apr 29, 2026
These were marked "known flakies, pre-existing on main" in the merge
PR, but every one was a real test bug worth nailing down rather than
papering over.  All three reproduced on bare main HEAD.

test_sse_confirmation (3 tests)
  Polled ``handler._confirm_result is None`` to detect when the worker
  thread had registered itself.  But _confirm_result is initialised to
  ``False`` (not None), so the polling loop exited immediately — resolve
  fired before the worker's confirm_tool_execution set up _confirm_event,
  and the worker's own setdefault then overwrote the resolved state with
  a fresh unset event.  Net result: the worker waited for an event that
  no one would ever set, hit the internal 90 s confirmation timeout, and
  the test failed with "thread still alive".

  Fix: poll ``handler._confirm_event is None`` instead.  _confirm_event
  starts as None and only becomes non-None inside confirm_tool_execution,
  so it correctly tracks the registration moment.

test_semaphore_exhausted_returns_429
  Created a SECOND asyncio event loop with ``asyncio.new_event_loop()``
  and acquired the semaphore on it, then handed the half-locked semaphore
  to TestClient (which runs on its OWN loop).  ``asyncio.Semaphore``
  doesn't promise cross-loop sanity — the waiter list is loop-bound, so
  acquire() on TestClient's loop saw inconsistent state under contention.

  Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop.  Plus
  patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so
  the test goes from 60 s → 0.6 s.

test_llm_command_with_server
  Health check accepted any 200, even when ``all_models_loaded == []``.
  Worse: even with a model loaded, ``gaia llm`` defaults to whatever the
  global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF.  CI runners
  almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI
  client retried with exponential backoff, and the subprocess timed out
  at 60 s.

  Fix: extend the health check to require at least one ``llm``/``vlm``
  in ``all_models_loaded`` and return that model's name.  The test then
  passes ``--model <loaded_one>`` so we don't trip the auto-load on a
  model the runner doesn't have.

Verified: full unit suite 1630 passed / 0 failed / 15 skipped.
itomek pushed a commit that referenced this pull request Apr 29, 2026
Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's
Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective
params, 128K context, natively multimodal, Apache 2.0. Same memory
footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off
on quality, license, and forward-compatibility with PR #865's universal
Gemma 4 transition.

Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked
up the Gemma 4 drop yet.

Side fix while in this code: factory's setdefault now reads the primary
from the registration's models list (single source of truth) rather
than hardcoding the same string twice. Without this, the "Fallback: ..."
comment on the registration was a lie at runtime — if the primary
wasn't in the catalog, the factory still hardcoded it instead of
walking the list.

Test plan: tests/unit/agents/test_registry.py covers the 4B-class
invariant (case-insensitive now to accommodate Gemma 3's lowercase
"4b" naming), the factory preset, and the chat-lite legacy alias path.
itomek pushed a commit that referenced this pull request Apr 29, 2026
These were marked "known flakies, pre-existing on main" in the merge
PR, but every one was a real test bug worth nailing down rather than
papering over.  All three reproduced on bare main HEAD.

test_sse_confirmation (3 tests)
  Polled ``handler._confirm_result is None`` to detect when the worker
  thread had registered itself.  But _confirm_result is initialised to
  ``False`` (not None), so the polling loop exited immediately — resolve
  fired before the worker's confirm_tool_execution set up _confirm_event,
  and the worker's own setdefault then overwrote the resolved state with
  a fresh unset event.  Net result: the worker waited for an event that
  no one would ever set, hit the internal 90 s confirmation timeout, and
  the test failed with "thread still alive".

  Fix: poll ``handler._confirm_event is None`` instead.  _confirm_event
  starts as None and only becomes non-None inside confirm_tool_execution,
  so it correctly tracks the registration moment.

test_semaphore_exhausted_returns_429
  Created a SECOND asyncio event loop with ``asyncio.new_event_loop()``
  and acquired the semaphore on it, then handed the half-locked semaphore
  to TestClient (which runs on its OWN loop).  ``asyncio.Semaphore``
  doesn't promise cross-loop sanity — the waiter list is loop-bound, so
  acquire() on TestClient's loop saw inconsistent state under contention.

  Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop.  Plus
  patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so
  the test goes from 60 s → 0.6 s.

test_llm_command_with_server
  Health check accepted any 200, even when ``all_models_loaded == []``.
  Worse: even with a model loaded, ``gaia llm`` defaults to whatever the
  global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF.  CI runners
  almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI
  client retried with exponential backoff, and the subprocess timed out
  at 60 s.

  Fix: extend the health check to require at least one ``llm``/``vlm``
  in ``all_models_loaded`` and return that model's name.  The test then
  passes ``--model <loaded_one>`` so we don't trip the auto-load on a
  model the runner doesn't have.

Verified: full unit suite 1630 passed / 0 failed / 15 skipped.
@github-actions github-actions Bot mentioned this pull request May 1, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chat Chat SDK changes cli CLI changes cpp dependencies Dependency updates devops DevOps/infrastructure changes documentation Documentation changes eval Evaluation framework changes llm LLM backend changes mcp MCP integration changes performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Switch default agent model to Gemma 4 (26B-A4B)

3 participants