feat(llm): add Gemma 4 E4B as default and native tool_calls priority#865
feat(llm): add Gemma 4 E4B as default and native tool_calls priority#865kovtcharov merged 19 commits intomainfrom
Conversation
) Previously, _chat_helpers.py always passed model_id=<session model> explicitly to registry.create_agent(), defeating kwargs.setdefault("model_id", ...) in custom agents — which only fires when the key is absent. Fix: build create_kwargs conditionally, omitting model_id when the session is at the DB default so the agent's __init__ setdefault governs. Also use agent.model_id (post-construction) for both _store_agent cache key and the pre-flight _maybe_load_expected_model call. Three-branch precedence: custom_model setting > session-explicit > omit kwarg. Closes #841
…N_DEFAULT_MODEL Addresses code review feedback on PR #842: - Export SESSION_DEFAULT_MODEL from database.py (single source of truth) instead of duplicating the string literal in _chat_helpers.py - Extract _build_create_kwargs() helper to eliminate the duplicate three-branch create_kwargs logic across non-streaming and streaming code paths - Extract _effective_model() helper using explicit None check (not `or`) to safely read agent.model_id post-construction without treating empty string as missing - Fix static regression guard regex to use [^()]* so nested helper calls inside create_agent() are not falsely flagged - Update unit test to import SESSION_DEFAULT_MODEL instead of hardcoding
…ion (#842) _store_agent was changed by the #842 fix to use _effective_model(agent, model_id) as the cache key — the post-construction value set by kwargs.setdefault. _get_cached_agent still looks up using the pre-construction model_id variable. For custom agents whose setdefault model differs from the session model, the keys never match and the agent is rebuilt on every turn. Revert the two _store_agent call sites to use model_id (the pre-construction intent key), matching what the lookup uses. _effective_model stays at the two _maybe_load_expected_model sites (Lemonade pre-flight needs the actual model) and in log statements (observability). Add two regression guards: - test_cache_hit_on_second_turn_for_setdefault_agent: two-turn cache-hit test with four assertions (call count, object identity, stored-key equality, agent.model_id). Covers the builder/template.py setdefault pattern. - test_no_effective_model_in_store_agent_calls: static grep guard that asserts _store_agent never receives _effective_model(...) as a positional arg, preventing this pattern from silently returning in a future cleanup pass.
#817) ## Summary One-line fix: swap the failing `www.cs.cmu.edu/~tom/EMNLP2004_final.pdf` URL in `docs/plans/email-triage-agent.mdx:2601` for the canonical ACL Anthology record at [W04-3240](https://aclanthology.org/W04-3240/). The CMU URL fails DNS resolution in CI (see [recent run](https://github.com/amd/gaia/actions/runs/24595902571/job/72072156929)), breaking the ``Verify external URLs`` check for every open PR that touches docs. ACL Anthology is the permanent archive for ACL/EMNLP papers — stable URL, no more link rot. Also restored the paper's actual full title ("Learning to Classify Email into 'Speech Acts'") for consistency with the other full-title citations in the same references list. ## Test plan - [x] `curl -sI https://aclanthology.org/W04-3240/` returns 200 - [ ] After merge, `Verify external URLs` check should go green on downstream PRs
Gemma-4-E4B-it-GGUF becomes the default model for all GAIA roles (LLM, VLM, installer, CLI, UI, eval, EMR), replacing the Qwen family defaults. Simultaneously inverts the tool-call priority chain: for tool-calling models, GAIA now passes `tools=[...]` to Lemonade and handles native OpenAI `tool_calls` as the primary path, falling back to embedded-JSON format only for legacy non-tool-calling models. Key changes: - lemonade_client.py: adds `tool_calling` field to ModelRequirement, new `is_tool_calling_model()` helper (optimistic default for unknown GGUFs), startup `_validate_profile_model_registry()` validator, gemma-4-e4b entry in MODELS, AGENT_PROFILES all referencing gemma-4-e4b - providers/lemonade.py: surfaces native tool_calls as a sentinel JSON string so the response type stays `str` throughout the call chain; forces non-streaming when tools are provided to a tool-capable model - agents/base/agent.py: native tool_calls branch in _parse_llm_response, _build_openai_tool_schemas + _openai_tools property, system-prompt gating (excludes embedded-JSON format template for tool-calling models) - chat/sdk.py: threads `tools` kwarg through send_messages/stream Includes pre-swap eval baseline for Qwen3.5-35B at commit 3b51ca9 and 23 new unit tests covering the full tool_calls priority chain.
Keep Gemma-4-E4B-it-GGUF as SESSION_DEFAULT_MODEL per this branch's intent.
Lemonade v10.1.0 ("spring cleaning" release, 2026-04-06) changed its
default port from 8000 to 13305 and added Gemma 4 on GPU support.
Migration guide: https://github.com/lemonade-sdk/lemonade/wiki/Migration#v10x---v101
Changes:
- DEFAULT_PORT in lemonade_client.py flipped to 13305 (with a comment
pointing at the migration guide)
- min_lemonade_version in every INIT_PROFILES entry bumped 10.0.0 -> 10.1.0
- LEMONADE_VERSION constant bumped 10.0.0 -> 10.1.0
- All agent base_url fallbacks, UI helpers, MCP bridge, RAG/VLM SDKs, and
CLI kill/port defaults updated to 13305
- Test fixtures and mocked URLs updated to 13305; version-policy tests
re-based on the new 10.1.0 minimum
- Docs under docs/ (spec, SDK, guides, plans) refreshed; docs/releases/
left untouched since those are historical
This pairs with the Gemma-4-E4B swap already on this branch: Gemma 4 E4B
is only available on Lemonade v10.1.0+, so requiring the newer version
closes the gap that would otherwise make `gaia init` succeed but model
loads fail.
…6 caveat) Captured the post-swap eval baseline for Gemma-4-E4B-it-GGUF running on ngrok Lemonade v10.2.0, judged by claude-sonnet-4-6 — same judge model and three categories as the pre-swap Qwen3.5-35B baseline at 3b51ca9. Headline deltas vs Qwen baseline: tool_selection: 50% -> 50% (+0 pp, avg 7.64 -> 7.77) rag_quality: 100% -> 86% (-14 pp, avg 9.47 -> 8.17) context_retention:100% -> 75% (-25 pp, avg 9.20 -> 8.55) Per-scenario regressions: tool_selection/known_path_read PASS -> FAIL — agent didn't discover indexed-internal-copy fallback path in Turn 1 after Access-Denied on original path. Turn 2 succeeded; real single-turn quality gap. tool_selection/smart_discovery FAIL -> INFRA_ERROR — Gemma loaded with ctx_size=4096; GAIA system prompt (~16K tokens) exceeds window. NOT a model regression — configuration bug on Lemonade side. rag_quality/budget_query PASS -> FAIL (1.15/10) — same ctx_size=4096 issue; exceed_context_size_error before any output. NOT a model regression. context_retention/cross_turn_file_recall PASS -> FAIL — agent anchored on Turn 1 summary and ignored pricing data retrieved in Turn 2 chunks. Real model quality gap (prompt-engineering candidate). Three of the four regressions thus fall into two buckets: - Two are fixable by reloading Gemma with --ctx-size 32768 on Lemonade (matches GAIA's DEFAULT_CONTEXT_SIZE), not code changes on our side. - Two are legitimate model-quality gaps worth follow-up issues. Hardware: AMD Ryzen 9 9950X + gfx1036 iGPU (2 GB VRAM, 5 GB model on disk). ~110 tok/s throughout. Total cost: ~$7 judge (Sonnet, 15 scenarios).
Initial baseline was captured with Gemma loaded on Lemonade at its default ctx_size=4096, which is below GAIA's ~16K system-prompt requirement. After reloading with ctx_size=32768, 3 of the 4 "regressions" recovered: budget_query FAIL 1.15 -> PASS 9.95 (ctx-overflow masked) smart_discovery INFRA_ERROR 0.0 -> PASS 9.45 (ctx-overflow masked) cross_turn_file_recall FAIL 7.52 -> PASS 8.78 (ctx-overflow masked) The 4th regression (known_path_read) was re-run too and still FAILs at 5.6/10 — confirmed real model-quality gap, not ctx-masked. Revised Gemma vs Qwen (at ctx=32768): tool_selection: Qwen 50% (7.64) vs Gemma 75% (8.36) +25pp, +0.72 rag_quality: Qwen 100% (9.47) vs Gemma 100% (9.43) flat context_retention: Qwen 100% (9.20) vs Gemma 100% (9.25) +0.05 Net: Gemma 14/15 pass, Qwen 13/15 pass. Gemma trades one scenario loss (known_path_read) for two scenario wins on tool_selection. Throughput is ~110 tok/s on an iGPU and footprint shrinks from 19.7GB to 5GB. Only open follow-up is known_path_read (PASS 9.28 -> FAIL 6.67/5.6), where Gemma doesn't discover the indexed-internal-copy fallback path in Turn 1 after an Access-Denied error on the original path. Turn 2 recovers. This is a prompt-engineering candidate, not a blocker.
Eval results — Gemma-4-E4B vs Qwen3.5-35B baselineRan the same three-category eval suite against Gemma-4-E4B-it-GGUF on Lemonade v10.2.0, judge Headline
Gemma passes more scenarios than Qwen (14 vs 13) at a fraction of the footprint (5 GB vs 19.7 GB) and equal throughput (~110 tok/s on a gfx1036 iGPU). Wins
The one real regression
Infra issue caught by the eval (worth a follow-up)Initial run had 3 "regressions" that turned out to be ctx-overflow failures: Lemonade loaded Gemma with its default Critically, Artifacts
|
…->10.2.0
CI workflows still started Lemonade on the old default port 8000, but the GAIA
client now defaults to 13305 (v10.1.0+ migration). Update test_api,
test_embeddings, test_gaia_cli_{linux,windows}, test_agent_sdk, test_rag,
test_lemonade_server, and build_cpp to start Lemonade on 13305 to match.
Three unit tests hardcoded the old default model name (Qwen3.5-35B-A3B-GGUF)
and two hardcoded version strings around 10.1.0 — bump them to match the
Gemma-4-E4B/v10.2.0 floors set elsewhere in this PR.
The previous commit updated health-check URLs to :13305 but left the server *start* commands still using port 8000 (explicit -Port 8000 in start-lemonade.ps1 calls; implicit default on older lemonade-server installs). This caused five jobs to time out waiting for 13305 while the server was listening on 8000. Changes: - build_cpp.yml, test_lemonade_server.yml (STX): -Port 8000 → -Port 13305 - test_agent_sdk.yml, test_gaia_cli_windows.yml: add --port 13305 to lemonade-server serve invocations - test_gaia_cli_linux.yml: Python lemonade-server-dev defaults to 8000; revert health/models URLs to :8000 and export LEMONADE_BASE_URL so gaia CLI connects to the correct port
CDN-protected sites (e.g. electron.build via Cloudflare) actively reset connections from CI runner IPs. This is bot protection, not a dead URL. Treat ConnectionResetError (errno 104) and ConnectionRefusedError (111) as warnings — the same treatment given to timeouts and 429s — so valid URLs behind strict anti-bot guards don't fail the external URL check.
Lemonade Server v10.1.0 changed its default port from 8000 to 13305. The test file still used PORT = 8000, causing the Windows CI integration tests to look for a server on :8000 (finding none), try to auto-launch one there (failing), and assert the wrong default in the unit test. - tests/test_lemonade_client.py: PORT constant and assertEqual assertion - tests/test_lemonade_health.py: LEMONADE_PORT default - tests/conftest.py: stale docstring
The Agent SDK integration test hardcoded DEFAULT_MODEL_NAME (now Gemma-4-E4B-it-GGUF) but the GitHub-hosted Windows CI runner only pulls Llama-3.2-3B-Instruct-Hybrid, causing HTTP 422 from the server. - tests/test_agent_sdk.py: read model from GAIA_TEST_MODEL env var, falling back to DEFAULT_MODEL_NAME so local runs still work - test_agent_sdk.yml: set GAIA_TEST_MODEL=Llama-3.2-3B-Instruct-Hybrid before running the test suite to match the pulled model
… var lemonade-server-dev (Python) always binds to port 8000 and has no --port flag, while C++ lemonade-server v10.1.0+ defaults to 13305. The integration test was using PORT=13305 unconditionally, causing is_server_running(localhost, 13305) to return False on Linux, then auto-starting a new server on 13305 which also failed. Fix: make PORT read LEMONADE_PORT env var (defaults to 13305), and pass LEMONADE_PORT=8000 inline when running Linux integration tests.
Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective params, 128K context, natively multimodal, Apache 2.0. Same memory footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off on quality, license, and forward-compatibility with PR #865's universal Gemma 4 transition. Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked up the Gemma 4 drop yet. Side fix while in this code: factory's setdefault now reads the primary from the registration's models list (single source of truth) rather than hardcoding the same string twice. Without this, the "Fallback: ..." comment on the registration was a lie at runtime — if the primary wasn't in the catalog, the factory still hardcoded it instead of walking the list. Test plan: tests/unit/agents/test_registry.py covers the 4B-class invariant (case-insensitive now to accommodate Gemma 3's lowercase "4b" naming), the factory preset, and the chat-lite legacy alias path.
Brings in PR #865 (Gemma 4 E4B universal default + native tool_calls priority + Lemonade v10.1.0 port flip 8000→13305) on top of our existing gaia-lite work. The two PRs are complementary: #865 sets the global LLM/VLM default; this branch sets gaia-lite's preset. Conflict resolution ------------------- src/gaia/llm/lemonade_manager.py Both branches changed the module header. Ours added a re-export pattern (__all__ + ``from lemonade_client import DEFAULT_CONTEXT_SIZE``) so DEFAULT_CONTEXT_SIZE has a single source of truth. Main flipped the default port 8000→13305 for Lemonade v10.1.0. Resolution keeps both: the re-export pattern is preserved and the port is updated. Test updates required by stricter pre-flight semantics ------------------------------------------------------ This branch's _maybe_load_expected_model + _ensure_model_loaded changes tightened two behaviours that the existing test suite expressed against the old loose semantics: tests/unit/test_chat_preflight.py Pre-flight now requires the EXPECTED model to be active (not just any LLM) and its ctx_size ≥ 32K. Updated _model() helper to accept name + ctx_size kwargs; the four "skips load" tests now load a health entry whose model_name matches the expected model_id and whose ctx_size is at the 32K floor. Without this, the test fixture "test-llm" mismatched the expected "Qwen3.5-35B-A3B-GGUF" and tripped the wrong-model reload branch our PR added. tests/unit/test_lemonade_model_loading.py _ensure_model_loaded now defaults ctx_size to DEFAULT_CONTEXT_SIZE (32768) for unknown models, fixing the silent-empty-stream regression where Lemonade's default 4096 ctx truncated ChatAgent's >7K-token system prompt. Two assertions updated from ctx_size=None → 32768. All affected tests pass; lint clean.
These were marked "known flakies, pre-existing on main" in the merge PR, but every one was a real test bug worth nailing down rather than papering over. All three reproduced on bare main HEAD. test_sse_confirmation (3 tests) Polled ``handler._confirm_result is None`` to detect when the worker thread had registered itself. But _confirm_result is initialised to ``False`` (not None), so the polling loop exited immediately — resolve fired before the worker's confirm_tool_execution set up _confirm_event, and the worker's own setdefault then overwrote the resolved state with a fresh unset event. Net result: the worker waited for an event that no one would ever set, hit the internal 90 s confirmation timeout, and the test failed with "thread still alive". Fix: poll ``handler._confirm_event is None`` instead. _confirm_event starts as None and only becomes non-None inside confirm_tool_execution, so it correctly tracks the registration moment. test_semaphore_exhausted_returns_429 Created a SECOND asyncio event loop with ``asyncio.new_event_loop()`` and acquired the semaphore on it, then handed the half-locked semaphore to TestClient (which runs on its OWN loop). ``asyncio.Semaphore`` doesn't promise cross-loop sanity — the waiter list is loop-bound, so acquire() on TestClient's loop saw inconsistent state under contention. Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop. Plus patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so the test goes from 60 s → 0.6 s. test_llm_command_with_server Health check accepted any 200, even when ``all_models_loaded == []``. Worse: even with a model loaded, ``gaia llm`` defaults to whatever the global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF. CI runners almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI client retried with exponential backoff, and the subprocess timed out at 60 s. Fix: extend the health check to require at least one ``llm``/``vlm`` in ``all_models_loaded`` and return that model's name. The test then passes ``--model <loaded_one>`` so we don't trip the auto-load on a model the runner doesn't have. Verified: full unit suite 1630 passed / 0 failed / 15 skipped.
Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective params, 128K context, natively multimodal, Apache 2.0. Same memory footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off on quality, license, and forward-compatibility with PR #865's universal Gemma 4 transition. Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked up the Gemma 4 drop yet. Side fix while in this code: factory's setdefault now reads the primary from the registration's models list (single source of truth) rather than hardcoding the same string twice. Without this, the "Fallback: ..." comment on the registration was a lie at runtime — if the primary wasn't in the catalog, the factory still hardcoded it instead of walking the list. Test plan: tests/unit/agents/test_registry.py covers the 4B-class invariant (case-insensitive now to accommodate Gemma 3's lowercase "4b" naming), the factory preset, and the chat-lite legacy alias path.
These were marked "known flakies, pre-existing on main" in the merge PR, but every one was a real test bug worth nailing down rather than papering over. All three reproduced on bare main HEAD. test_sse_confirmation (3 tests) Polled ``handler._confirm_result is None`` to detect when the worker thread had registered itself. But _confirm_result is initialised to ``False`` (not None), so the polling loop exited immediately — resolve fired before the worker's confirm_tool_execution set up _confirm_event, and the worker's own setdefault then overwrote the resolved state with a fresh unset event. Net result: the worker waited for an event that no one would ever set, hit the internal 90 s confirmation timeout, and the test failed with "thread still alive". Fix: poll ``handler._confirm_event is None`` instead. _confirm_event starts as None and only becomes non-None inside confirm_tool_execution, so it correctly tracks the registration moment. test_semaphore_exhausted_returns_429 Created a SECOND asyncio event loop with ``asyncio.new_event_loop()`` and acquired the semaphore on it, then handed the half-locked semaphore to TestClient (which runs on its OWN loop). ``asyncio.Semaphore`` doesn't promise cross-loop sanity — the waiter list is loop-bound, so acquire() on TestClient's loop saw inconsistent state under contention. Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop. Plus patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so the test goes from 60 s → 0.6 s. test_llm_command_with_server Health check accepted any 200, even when ``all_models_loaded == []``. Worse: even with a model loaded, ``gaia llm`` defaults to whatever the global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF. CI runners almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI client retried with exponential backoff, and the subprocess timed out at 60 s. Fix: extend the health check to require at least one ``llm``/``vlm`` in ``all_models_loaded`` and return that model's name. The test then passes ``--model <loaded_one>`` so we don't trip the auto-load on a model the runner doesn't have. Verified: full unit suite 1630 passed / 0 failed / 15 skipped.
Qwen3-4B-Instruct-2507 was a reasonable interim default but Google's Gemma 4 E4B is a better fit for the lite agent's mission: ~4B effective params, 128K context, natively multimodal, Apache 2.0. Same memory footprint (~2.7 GB Q4 weights, 5 GB total) — strictly better trade-off on quality, license, and forward-compatibility with PR #865's universal Gemma 4 transition. Fallback: Gemma-3-4b-it-GGUF for Lemonade catalogs that haven't picked up the Gemma 4 drop yet. Side fix while in this code: factory's setdefault now reads the primary from the registration's models list (single source of truth) rather than hardcoding the same string twice. Without this, the "Fallback: ..." comment on the registration was a lie at runtime — if the primary wasn't in the catalog, the factory still hardcoded it instead of walking the list. Test plan: tests/unit/agents/test_registry.py covers the 4B-class invariant (case-insensitive now to accommodate Gemma 3's lowercase "4b" naming), the factory preset, and the chat-lite legacy alias path.
These were marked "known flakies, pre-existing on main" in the merge PR, but every one was a real test bug worth nailing down rather than papering over. All three reproduced on bare main HEAD. test_sse_confirmation (3 tests) Polled ``handler._confirm_result is None`` to detect when the worker thread had registered itself. But _confirm_result is initialised to ``False`` (not None), so the polling loop exited immediately — resolve fired before the worker's confirm_tool_execution set up _confirm_event, and the worker's own setdefault then overwrote the resolved state with a fresh unset event. Net result: the worker waited for an event that no one would ever set, hit the internal 90 s confirmation timeout, and the test failed with "thread still alive". Fix: poll ``handler._confirm_event is None`` instead. _confirm_event starts as None and only becomes non-None inside confirm_tool_execution, so it correctly tracks the registration moment. test_semaphore_exhausted_returns_429 Created a SECOND asyncio event loop with ``asyncio.new_event_loop()`` and acquired the semaphore on it, then handed the half-locked semaphore to TestClient (which runs on its OWN loop). ``asyncio.Semaphore`` doesn't promise cross-loop sanity — the waiter list is loop-bound, so acquire() on TestClient's loop saw inconsistent state under contention. Fix: use ``Semaphore(0)`` — exhausted from birth, no second loop. Plus patch ``asyncio.wait_for`` to a 0.2 s timeout in the chat router so the test goes from 60 s → 0.6 s. test_llm_command_with_server Health check accepted any 200, even when ``all_models_loaded == []``. Worse: even with a model loaded, ``gaia llm`` defaults to whatever the global default is — post-PR-#865 that's Gemma-4-E4B-it-GGUF. CI runners almost never have Gemma preloaded, so Lemonade returned 500, the OpenAI client retried with exponential backoff, and the subprocess timed out at 60 s. Fix: extend the health check to require at least one ``llm``/``vlm`` in ``all_models_loaded`` and return that model's name. The test then passes ``--model <loaded_one>`` so we don't trip the auto-load on a model the runner doesn't have. Verified: full unit suite 1630 passed / 0 failed / 15 skipped.
Summary
Gemma-4-E4B-it-GGUF becomes GAIA's default model for all roles (LLM, VLM, installer profiles, CLI, Agent UI, eval, EMR). Simultaneously inverts the tool-call priority chain so native OpenAI
tool_callsis the primary path, with embedded-JSON format falling back only for legacy non-tool-calling models. Also bumps the minimum Lemonade version to v10.1.0 (which moved its default port from 8000 → 13305 and is where Gemma 4 support was added).This ships on top of the existing UI model-resolution fixes (#841, #842). Resolves #863.
What changed and why
--jinja) — GAIA now passestools=[...]to Lemonade for tool-capable models. The response comes back as nativetool_calls;LemonadeProvider.chat()encodes them as a sentinel JSON string ({"__tool_calls__": ...}) so no callers need a type change._parse_llm_responsedetects the sentinel and returns the unified{"tool": ..., "tool_args": ...}dict._PLANNING_FORMAT/_CONVERSATIONAL_FORMAT) is excluded from the composed system prompt for tool-calling models; it actively prevented nativetool_callsin prior testing._validate_profile_model_registry()raises at import time if anyAGENT_PROFILESentry references a model key not inMODELS.DEFAULT_PORTflipped from 8000 to 13305 (Lemonade's spring-cleaning release changed the default). 75 files updated (agents, UI, MCP bridge, RAG SDK, VLM, CLI, tests, docs).min_lemonade_version = 10.1.0everywhereINIT_PROFILESis declared.3b51ca92and post-swap Gemma-4-E4B baseline both committed undertests/fixtures/eval_baselines/; Gemma outperforms Qwen 14/15 vs 13/15 (see comment below for per-scenario breakdown).Test plan
python -m pytest tests/unit/ --ignore=tests/unit/chat/ui/ -q→ 928 passed, 16 skippedpython -m pytest tests/unit/test_tool_call_priority.py -v→ 23 passed (sentinel detection, native branch parsing, edge cases, prompt gating, startup validator)python util/lint.py --black --isort --flake8→ all passclaude -p --model claude-sonnet-4-6was actually the judge (not Opus) viamodelUsagein test subprocessOpen follow-ups (not blockers for this PR)
tool_selection/known_path_readregression: Gemma doesn't discover indexed-internal-copy fallback path in Turn 1 after Access-Denied on the original. Prompt-engineering candidate./api/system/statusreports the catalogctx_sizeeven when Lemonade loaded the model with a smaller window. Surface a warning when they diverge; a whole eval run was wasted due to this mask.