fix(installer): _verify_model passes for thinking-mode models#920
fix(installer): _verify_model passes for thinking-mode models#920bong-water-water-bong wants to merge 4 commits intoamd:mainfrom
Conversation
|
@claude review this PR. |
Review🟢 LGTM — small, surgical fix for a real false-failure that was telling users to delete healthy ~22 GB models. Approving with one optional follow-up. What works
Observations (non-blocking)🟢 Minor — soft-pass robustness. A model emitting 🟢 Minor — no test added. 🟢 Nit — VerificationPR description shows before/after on Strix Halo with all four chat-profile models passing. The repro is precise (model IDs, Lemonade version, llama.cpp lane), and the manual Suggested follow-up (separate PR)Add
That locks in the three-way contract this PR establishes. Ready to merge from my read. cc @kovtcharov-amd for final approval. |
itomek
left a comment
There was a problem hiding this comment.
Thanks for digging into this @bong-water-water-bong, the diagnosis is solid.
Before merging I want to flag something I ran into when I tested the patch end-to-end. On main today the chat profile no longer ships Qwen3 — it defaults to Gemma-4-E4B-it-GGUF (see INIT_PROFILES). So the verifier's primary target right now is Gemma, not Qwen3, and Gemma reacts badly to the /no_think token.
Reproducible against a live Lemonade instance with this PR applied:
| Model | Before | After |
|---|---|---|
Gemma-4-E4B-it-GGUF (chat default) |
True · 0.22s · content="ok" |
True · 3.87s · empty content, 241 chars reasoning_content, finish_reason="length" — passes only via the new soft-pass branch |
LFM2-1.2B-GGUF |
True · content="Okay." |
True · content="Okay" |
LFM2-8B-A1B-GGUF |
True · content="Ok." |
True · content="ok \nno_think" (echoing /no_think as literal text) |
Gemma's reasoning trace on Say 'ok' /no_think:
Thinking Process:
- Analyze the Request: The user has provided the input "Say ok /no_think". This is a command instructing the AI to output a specific phrase or choice.
- Identify the Options: The options are "ok" and "/no_think"…
It's parsing /no_think as a multiple-choice option and burning the full 64-token budget. Verification passes only via the soft-pass branch — the signal drops from "model produced a coherent reply" to "model produced any tokens." Same prompt without /no_think still gets a clean 2-token "ok".
So the patch doesn't break verification, but it's ~17× slower for Gemma and the health signal is weaker.
A few shapes that keep the Qwen3 fix without regressing Gemma:
- Gate
/no_thinkon Qwen3-family ids —if "qwen3" in model_id.lower(): prompt += " /no_think". Keeps the fix where it's needed. - Drop
/no_thinkentirely and rely on themax_tokens=64bump + soft-pass alone. Qwen3 finishes a short<think>well under 64 tokens, so this should be enough. - Tighten the soft-pass to also require
usage.completion_tokens > 0(and non-emptyreasoning_contentwhen present), so a model that genuinely streams nothing still fails.
Do you specifically need Qwen3 catalog support? If yes, option 1; if not, option 2 + 3. Also worth a unit test in tests/unit/ mocking LemonadeClient.chat_completions with the three response shapes so this doesn't silently regress.
Requesting changes for now so we don't merge a Gemma regression in service of a Qwen fix.
Verification step in `gaia init` was failing with "Empty response" for modern thinking-mode chat models (Qwen3 family, etc.). Three small changes in `_verify_model`: 1. Append `/no_think` to the verification prompt — explicitly suppress thinking mode where the model supports the tag. 2. Bump max_tokens from 10 to 64 — gives short thinking blocks room to finish before content tokens start, even when the tag isn't honored. 3. Treat `finish_reason == "length"` with empty `content` as a soft pass. If the model was producing tokens (length cutoff) it's clearly alive; the budget was just too tight for it to surface a final answer. False-failing on this case marks otherwise-working models as broken and prompts users to delete + re-download (~22 GB for Qwen3.5-35B-A3B). Repro before this patch: gaia init --profile chat # ❌ Qwen3.5-35B-A3B-GGUF - Empty response # ❌ Qwen3-0.6B-GGUF - Empty response # (manual chat with both models works fine via /v1/chat/completions) After this patch all four models in the chat profile verify cleanly on a Strix Halo box (Lemonade Server 10.2.0, llama.cpp b1231 lane).
Three independent bugs combined to keep the desktop window from ever rendering on a Linux Wayland/X11 session — only the tray icon and Python backend ever came up. Each fix is independently necessary: - assets/icon.png: 4096×4096 → 512×512. The icon decoded to RGBA was exactly 67,108,864 bytes; Electron tries to send it to the X server in a single request, X11 rejects with "Cannot send request of length 67108896", and the renderer dies before ready-to-show. 512×512 is the standard high-DPI Linux icon size and well under the X11 limit. - main.cjs window-all-closed: during bootstrap, the install-progress dialog is the only window in flight and closes before the main window is created. With trayManager not yet initialized, the default handler called app.quit(), will-quit started cleanup, and the whenReady continuation raced into createWindow() during teardown — V8 crashed with DisallowJavascriptExecutionScope on Wayland (silent exit on X11). New startupInProgress flag suppresses the quit until the main window's first ready-to-show. - main.cjs autoUpdater.init: now gated on GAIA_ENABLE_AUTOUPDATE=1. By default the updater points at upstream amd/gaia's release channel and a successful "update" overwrites a forked binary with upstream's, silently destroying any divergence (mobile-QR pairing, packaging tweaks, anything). Forks must opt in explicitly. Verified end-to-end: window now opens on default Wayland, stable for 40+ seconds (was crashing ≤20 s before), agent session created and a chat message round-tripped through the builder agent.
The Agent UI's WelcomeScreen reads /api/system/status, which checks for ~/.gaia/chat/initialized (src/gaia/ui/routers/system.py:270) to decide whether to show the "First time? Run gaia init --profile chat" banner. Nothing in the codebase ever writes that marker, so the banner was permanently sticky no matter how many times init ran or how cleanly it succeeded — verified by grep across src/. Touch the marker at the end of run() right before _print_completion, so any successful gaia init (any profile) clears the banner. Failure to write is non-fatal — we warn and proceed, since init really did succeed and a stale banner is a softer failure than aborting.
Adds a Mobile Access gateway in the Agent UI that exposes the
backend to the user's phone via a copyable URL + QR code. The
sidebar toggle (re)starts the tunnel in local mode every click —
fresh URL every time — so an expired pairing can be refreshed
without restarting the desktop app.
- frontend: App.tsx handleMobileToggle deduplicated to a single
startTunnel({ mode: 'local' }) path; MobileAccessModal renders
QR + URL; api.ts + types add the mode parameter
- backend: ui/tunnel.py grows the local-mode adapter,
ui/routers/tunnel.py exposes it, ui/server.py wires the router
- cli: gaia tunnel surface for testing without the UI
Fork divergence over upstream amd/gaia — see
project_1bit_gaia_mobile_qr in the project memory.
e4b59d4 to
ae783d8
Compare
Problem
gaia init --profile chatflags every modern thinking-mode chat model as broken:Both flagged models work cleanly when hit directly via
/v1/chat/completionsoutside thegaia initflow. The problem is the verification function's prompt + budget combination, not the models.Root cause
_verify_modelinsrc/gaia/installer/init_command.pyruns:Qwen3-family (and other thinking-mode) chat models emit
<think>...</think>tokens before they emit content. Withmax_tokens=10and no/no_thinkdirective, the model burns the full budget inside the thinking block, the response surfaces with emptycontentandfinish_reason="length", and the verifier reportsEmpty response. The user is then told the model is corrupt and prompted to delete + re-download (which for Qwen3.5-35B-A3B is ~22 GB).Patch
Three small changes, all in
_verify_model:/no_thinkto the verification prompt — explicitly suppress thinking mode where the tag is honored.max_tokens=10 → 64— give short thinking blocks room to complete even when the tag isn't honored.finish_reason="length"with emptycontentas a soft pass — if the model was producing tokens (length cutoff) it's clearly alive; the budget was just too tight for a final answer to surface. False-failing this case prompts users to nuke working models.Repro
Before the patch (Lemonade 10.2.0, llama.cpp b1231 lane, Strix Halo):
Manual verification of the same models from the same shell:
After the patch all four chat-profile models pass verification.
Scope
Single function. No behavior change for embedding models, image-generation models, or non-thinking LLMs. No new dependencies.
/no_thinkis ignored by older models (it just becomes part of the prompt), so the patch is backward-compatible across the catalog.