Skip to content

fix(installer): _verify_model passes for thinking-mode models#920

Open
bong-water-water-bong wants to merge 4 commits intoamd:mainfrom
bong-water-water-bong:fix/verify-thinking-models
Open

fix(installer): _verify_model passes for thinking-mode models#920
bong-water-water-bong wants to merge 4 commits intoamd:mainfrom
bong-water-water-bong:fix/verify-thinking-models

Conversation

@bong-water-water-bong
Copy link
Copy Markdown

Problem

gaia init --profile chat flags every modern thinking-mode chat model as broken:

❌ Qwen3.5-35B-A3B-GGUF - Empty response
❌ Qwen3-0.6B-GGUF - Empty response
✓  Qwen3-VL-4B-Instruct-GGUF - OK
✓  nomic-embed-text-v2-moe-GGUF - OK

Both flagged models work cleanly when hit directly via /v1/chat/completions outside the gaia init flow. The problem is the verification function's prompt + budget combination, not the models.

Root cause

_verify_model in src/gaia/installer/init_command.py runs:

response = client.chat_completions(
    model=model_id,
    messages=[{"role": "user", "content": "Say 'ok'"}],
    max_tokens=10,
    temperature=0,
)

Qwen3-family (and other thinking-mode) chat models emit <think>...</think> tokens before they emit content. With max_tokens=10 and no /no_think directive, the model burns the full budget inside the thinking block, the response surfaces with empty content and finish_reason="length", and the verifier reports Empty response. The user is then told the model is corrupt and prompted to delete + re-download (which for Qwen3.5-35B-A3B is ~22 GB).

Patch

Three small changes, all in _verify_model:

  1. Append /no_think to the verification prompt — explicitly suppress thinking mode where the tag is honored.
  2. Bump max_tokens=10 → 64 — give short thinking blocks room to complete even when the tag isn't honored.
  3. Treat finish_reason="length" with empty content as a soft pass — if the model was producing tokens (length cutoff) it's clearly alive; the budget was just too tight for a final answer to surface. False-failing this case prompts users to nuke working models.

Repro

Before the patch (Lemonade 10.2.0, llama.cpp b1231 lane, Strix Halo):

gaia init --profile chat --yes
# ❌ Qwen3.5-35B-A3B-GGUF - Empty response
# ❌ Qwen3-0.6B-GGUF - Empty response
# ⚠  Models verified: 2/4 passed

Manual verification of the same models from the same shell:

$ curl -s :13305/v1/chat/completions -d '{"model":"Qwen3-0.6B-GGUF","messages":[{"role":"user","content":"Say ok /no_think"}],"max_tokens":20}' | jq -r '.choices[0].message.content'
ready

After the patch all four chat-profile models pass verification.

Scope

Single function. No behavior change for embedding models, image-generation models, or non-thinking LLMs. No new dependencies. /no_think is ignored by older models (it just becomes part of the prompt), so the patch is backward-compatible across the catalog.

@kovtcharov
Copy link
Copy Markdown
Collaborator

@claude review this PR.

@github-actions
Copy link
Copy Markdown
Contributor

Review

🟢 LGTM — small, surgical fix for a real false-failure that was telling users to delete healthy ~22 GB models. Approving with one optional follow-up.

What works

  • Scope is tight. One function in src/gaia/installer/init_command.py:1386-1408. No collateral changes to embedding/SD/non-thinking paths.
  • Backward-compatible. /no_think is honored by Qwen3+ and is benign prompt content for everything else, so the existing pass set doesn't regress.
  • Three independent mitigations stack correctly. Even if /no_think isn't honored by a future thinking model, the max_tokens=10 → 64 bump plus the finish_reason="length" soft-pass keeps the verifier from false-failing it.
  • Consistent with CLAUDE.md "fail loudly." The soft-pass is not a silent fallback — empty content with finish_reason ≠ "length", missing choices, or an exception still fails loudly with an actionable message. Only the specific "model was actively generating, budget was tight" case is reclassified.
  • Comments earn their place. The two new comment blocks explain why (thinking-mode tokens, budget exhaustion) — exactly the non-obvious context a future reader needs.
  • finish_reason access path is correct. Matches the schema at src/gaia/llm/lemonade_client.py:1281choices[0].finish_reason.

Observations (non-blocking)

🟢 Minor — soft-pass robustness. A model emitting max_tokens of garbage tokens with no <think> close and no surfaced content would now pass verification. In practice this is far less harmful than the current bug (false-failing a working model and prompting a 22 GB redownload), and the curated catalog is unlikely to contain such a model. Worth a one-liner note inline, not a blocker.

🟢 Minor — no test added. _verify_model has no unit coverage today (grep _verify_model returns just the implementation). A small mocked test exercising the three LLM branches — content present, finish_reason="length" + empty, fully empty/exception — would lock the contract so a future "make verification stricter" refactor doesn't silently re-introduce the regression. Nice-to-have follow-up, not a blocker for this fix.

🟢 Nit — max_tokens=64 choice. Reasonable for "Say 'ok'"-class prompts. If reports surface of a thinking model whose <think> block routinely exceeds 64 tokens even on this trivial input, bumping to ~128 is cheap. Premature to do now.

Verification

PR description shows before/after on Strix Halo with all four chat-profile models passing. The repro is precise (model IDs, Lemonade version, llama.cpp lane), and the manual curl against /v1/chat/completions is a clean control proving the models themselves were never the problem.

Suggested follow-up (separate PR)

Add tests/unit/test_init_verify.py with three cases against a mocked LemonadeClient.chat_completions:

  1. {"choices":[{"message":{"content":"ok"}, "finish_reason":"stop"}]}(True, None)
  2. {"choices":[{"message":{"content":""}, "finish_reason":"length"}]}(True, None) (the new soft-pass)
  3. {"choices":[{"message":{"content":""}, "finish_reason":"stop"}]}(False, "Empty response")

That locks in the three-way contract this PR establishes.


Ready to merge from my read. cc @kovtcharov-amd for final approval.

Copy link
Copy Markdown
Collaborator

@itomek itomek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this @bong-water-water-bong, the diagnosis is solid.

Before merging I want to flag something I ran into when I tested the patch end-to-end. On main today the chat profile no longer ships Qwen3 — it defaults to Gemma-4-E4B-it-GGUF (see INIT_PROFILES). So the verifier's primary target right now is Gemma, not Qwen3, and Gemma reacts badly to the /no_think token.

Reproducible against a live Lemonade instance with this PR applied:

Model Before After
Gemma-4-E4B-it-GGUF (chat default) True · 0.22s · content="ok" True · 3.87s · empty content, 241 chars reasoning_content, finish_reason="length" — passes only via the new soft-pass branch
LFM2-1.2B-GGUF True · content="Okay." True · content="Okay"
LFM2-8B-A1B-GGUF True · content="Ok." True · content="ok \nno_think" (echoing /no_think as literal text)

Gemma's reasoning trace on Say 'ok' /no_think:

Thinking Process:

  1. Analyze the Request: The user has provided the input "Say ok /no_think". This is a command instructing the AI to output a specific phrase or choice.
  2. Identify the Options: The options are "ok" and "/no_think"…

It's parsing /no_think as a multiple-choice option and burning the full 64-token budget. Verification passes only via the soft-pass branch — the signal drops from "model produced a coherent reply" to "model produced any tokens." Same prompt without /no_think still gets a clean 2-token "ok".

So the patch doesn't break verification, but it's ~17× slower for Gemma and the health signal is weaker.

A few shapes that keep the Qwen3 fix without regressing Gemma:

  1. Gate /no_think on Qwen3-family idsif "qwen3" in model_id.lower(): prompt += " /no_think". Keeps the fix where it's needed.
  2. Drop /no_think entirely and rely on the max_tokens=64 bump + soft-pass alone. Qwen3 finishes a short <think> well under 64 tokens, so this should be enough.
  3. Tighten the soft-pass to also require usage.completion_tokens > 0 (and non-empty reasoning_content when present), so a model that genuinely streams nothing still fails.

Do you specifically need Qwen3 catalog support? If yes, option 1; if not, option 2 + 3. Also worth a unit test in tests/unit/ mocking LemonadeClient.chat_completions with the three response shapes so this doesn't silently regress.

Requesting changes for now so we don't merge a Gemma regression in service of a Qwen fix.

bong-water-water-bong and others added 4 commits May 2, 2026 18:50
Verification step in `gaia init` was failing with "Empty response" for
modern thinking-mode chat models (Qwen3 family, etc.). Three small
changes in `_verify_model`:

1. Append `/no_think` to the verification prompt — explicitly suppress
   thinking mode where the model supports the tag.
2. Bump max_tokens from 10 to 64 — gives short thinking blocks room to
   finish before content tokens start, even when the tag isn't honored.
3. Treat `finish_reason == "length"` with empty `content` as a soft
   pass. If the model was producing tokens (length cutoff) it's clearly
   alive; the budget was just too tight for it to surface a final
   answer. False-failing on this case marks otherwise-working models as
   broken and prompts users to delete + re-download (~22 GB for
   Qwen3.5-35B-A3B).

Repro before this patch:
  gaia init --profile chat
  # ❌ Qwen3.5-35B-A3B-GGUF - Empty response
  # ❌ Qwen3-0.6B-GGUF - Empty response
  # (manual chat with both models works fine via /v1/chat/completions)

After this patch all four models in the chat profile verify cleanly on
a Strix Halo box (Lemonade Server 10.2.0, llama.cpp b1231 lane).
Three independent bugs combined to keep the desktop window from
ever rendering on a Linux Wayland/X11 session — only the tray icon
and Python backend ever came up. Each fix is independently necessary:

- assets/icon.png: 4096×4096 → 512×512. The icon decoded to RGBA was
  exactly 67,108,864 bytes; Electron tries to send it to the X server
  in a single request, X11 rejects with "Cannot send request of length
  67108896", and the renderer dies before ready-to-show. 512×512 is
  the standard high-DPI Linux icon size and well under the X11 limit.

- main.cjs window-all-closed: during bootstrap, the install-progress
  dialog is the only window in flight and closes before the main
  window is created. With trayManager not yet initialized, the
  default handler called app.quit(), will-quit started cleanup, and
  the whenReady continuation raced into createWindow() during
  teardown — V8 crashed with DisallowJavascriptExecutionScope on
  Wayland (silent exit on X11). New startupInProgress flag suppresses
  the quit until the main window's first ready-to-show.

- main.cjs autoUpdater.init: now gated on GAIA_ENABLE_AUTOUPDATE=1.
  By default the updater points at upstream amd/gaia's release
  channel and a successful "update" overwrites a forked binary with
  upstream's, silently destroying any divergence (mobile-QR pairing,
  packaging tweaks, anything). Forks must opt in explicitly.

Verified end-to-end: window now opens on default Wayland, stable for
40+ seconds (was crashing ≤20 s before), agent session created and
a chat message round-tripped through the builder agent.
The Agent UI's WelcomeScreen reads /api/system/status, which checks
for ~/.gaia/chat/initialized (src/gaia/ui/routers/system.py:270) to
decide whether to show the "First time? Run gaia init --profile chat"
banner. Nothing in the codebase ever writes that marker, so the
banner was permanently sticky no matter how many times init ran or
how cleanly it succeeded — verified by grep across src/.

Touch the marker at the end of run() right before _print_completion,
so any successful gaia init (any profile) clears the banner. Failure
to write is non-fatal — we warn and proceed, since init really did
succeed and a stale banner is a softer failure than aborting.
Adds a Mobile Access gateway in the Agent UI that exposes the
backend to the user's phone via a copyable URL + QR code. The
sidebar toggle (re)starts the tunnel in local mode every click —
fresh URL every time — so an expired pairing can be refreshed
without restarting the desktop app.

- frontend: App.tsx handleMobileToggle deduplicated to a single
  startTunnel({ mode: 'local' }) path; MobileAccessModal renders
  QR + URL; api.ts + types add the mode parameter
- backend: ui/tunnel.py grows the local-mode adapter,
  ui/routers/tunnel.py exposes it, ui/server.py wires the router
- cli: gaia tunnel surface for testing without the UI

Fork divergence over upstream amd/gaia — see
project_1bit_gaia_mobile_qr in the project memory.
@bong-water-water-bong bong-water-water-bong force-pushed the fix/verify-thinking-models branch from e4b59d4 to ae783d8 Compare May 2, 2026 21:55
@github-actions github-actions Bot added the cli CLI changes label May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants