fix(installer): _verify_model passes for thinking-mode models by bong-water-water-bong · Pull Request #920 · amd/gaia

bong-water-water-bong · 2026-04-28T14:59:38Z

Problem

gaia init --profile chat flags every modern thinking-mode chat model as broken:

❌ Qwen3.5-35B-A3B-GGUF - Empty response
❌ Qwen3-0.6B-GGUF - Empty response
✓  Qwen3-VL-4B-Instruct-GGUF - OK
✓  nomic-embed-text-v2-moe-GGUF - OK

Both flagged models work cleanly when hit directly via /v1/chat/completions outside the gaia init flow. The problem is the verification function's prompt + budget combination, not the models.

Root cause

_verify_model in src/gaia/installer/init_command.py runs:

response = client.chat_completions(
    model=model_id,
    messages=[{"role": "user", "content": "Say 'ok'"}],
    max_tokens=10,
    temperature=0,
)

Qwen3-family (and other thinking-mode) chat models emit <think>...</think> tokens before they emit content. With max_tokens=10 and no /no_think directive, the model burns the full budget inside the thinking block, the response surfaces with empty content and finish_reason="length", and the verifier reports Empty response. The user is then told the model is corrupt and prompted to delete + re-download (which for Qwen3.5-35B-A3B is ~22 GB).

Patch

Three small changes, all in _verify_model:

Append /no_think to the verification prompt — explicitly suppress thinking mode where the tag is honored.
Bump max_tokens=10 → 64 — give short thinking blocks room to complete even when the tag isn't honored.
Treat finish_reason="length" with empty content as a soft pass — if the model was producing tokens (length cutoff) it's clearly alive; the budget was just too tight for a final answer to surface. False-failing this case prompts users to nuke working models.

Repro

Before the patch (Lemonade 10.2.0, llama.cpp b1231 lane, Strix Halo):

gaia init --profile chat --yes
# ❌ Qwen3.5-35B-A3B-GGUF - Empty response
# ❌ Qwen3-0.6B-GGUF - Empty response
# ⚠  Models verified: 2/4 passed

Manual verification of the same models from the same shell:

$ curl -s :13305/v1/chat/completions -d '{"model":"Qwen3-0.6B-GGUF","messages":[{"role":"user","content":"Say ok /no_think"}],"max_tokens":20}' | jq -r '.choices[0].message.content'
ready

After the patch all four chat-profile models pass verification.

Scope

Single function. No behavior change for embedding models, image-generation models, or non-thinking LLMs. No new dependencies. /no_think is ignored by older models (it just becomes part of the prompt), so the patch is backward-compatible across the catalog.

kovtcharov · 2026-04-29T02:15:45Z

@claude review this PR.

github-actions · 2026-04-29T02:17:30Z

Review

🟢 LGTM — small, surgical fix for a real false-failure that was telling users to delete healthy ~22 GB models. Approving with one optional follow-up.

What works

Scope is tight. One function in src/gaia/installer/init_command.py:1386-1408. No collateral changes to embedding/SD/non-thinking paths.
Backward-compatible. /no_think is honored by Qwen3+ and is benign prompt content for everything else, so the existing pass set doesn't regress.
Three independent mitigations stack correctly. Even if /no_think isn't honored by a future thinking model, the max_tokens=10 → 64 bump plus the finish_reason="length" soft-pass keeps the verifier from false-failing it.
Consistent with CLAUDE.md "fail loudly." The soft-pass is not a silent fallback — empty content with finish_reason ≠ "length", missing choices, or an exception still fails loudly with an actionable message. Only the specific "model was actively generating, budget was tight" case is reclassified.
Comments earn their place. The two new comment blocks explain why (thinking-mode tokens, budget exhaustion) — exactly the non-obvious context a future reader needs.
finish_reason access path is correct. Matches the schema at src/gaia/llm/lemonade_client.py:1281 — choices[0].finish_reason.

Observations (non-blocking)

🟢 Minor — soft-pass robustness. A model emitting max_tokens of garbage tokens with no <think> close and no surfaced content would now pass verification. In practice this is far less harmful than the current bug (false-failing a working model and prompting a 22 GB redownload), and the curated catalog is unlikely to contain such a model. Worth a one-liner note inline, not a blocker.

🟢 Minor — no test added. _verify_model has no unit coverage today (grep _verify_model returns just the implementation). A small mocked test exercising the three LLM branches — content present, finish_reason="length" + empty, fully empty/exception — would lock the contract so a future "make verification stricter" refactor doesn't silently re-introduce the regression. Nice-to-have follow-up, not a blocker for this fix.

🟢 Nit — max_tokens=64 choice. Reasonable for "Say 'ok'"-class prompts. If reports surface of a thinking model whose <think> block routinely exceeds 64 tokens even on this trivial input, bumping to ~128 is cheap. Premature to do now.

Verification

PR description shows before/after on Strix Halo with all four chat-profile models passing. The repro is precise (model IDs, Lemonade version, llama.cpp lane), and the manual curl against /v1/chat/completions is a clean control proving the models themselves were never the problem.

Suggested follow-up (separate PR)

Add tests/unit/test_init_verify.py with three cases against a mocked LemonadeClient.chat_completions:

{"choices":[{"message":{"content":"ok"}, "finish_reason":"stop"}]} → (True, None)
{"choices":[{"message":{"content":""}, "finish_reason":"length"}]} → (True, None) (the new soft-pass)
{"choices":[{"message":{"content":""}, "finish_reason":"stop"}]} → (False, "Empty response")

That locks in the three-way contract this PR establishes.

Ready to merge from my read. cc @kovtcharov-amd for final approval.

itomek

Thanks for digging into this @bong-water-water-bong, the diagnosis is solid.

Before merging I want to flag something I ran into when I tested the patch end-to-end. On main today the chat profile no longer ships Qwen3 — it defaults to Gemma-4-E4B-it-GGUF (see INIT_PROFILES). So the verifier's primary target right now is Gemma, not Qwen3, and Gemma reacts badly to the /no_think token.

Reproducible against a live Lemonade instance with this PR applied:

Model	Before	After
`Gemma-4-E4B-it-GGUF` (chat default)	True · 0.22s · `content="ok"`	True · 3.87s · empty content, 241 chars `reasoning_content`, finish_reason="length" — passes only via the new soft-pass branch
`LFM2-1.2B-GGUF`	True · `content="Okay."`	True · `content="Okay"`
`LFM2-8B-A1B-GGUF`	True · `content="Ok."`	True · `content="ok \nno_think"` (echoing `/no_think` as literal text)

Gemma's reasoning trace on Say 'ok' /no_think:

Thinking Process:

Analyze the Request: The user has provided the input "Say ok /no_think". This is a command instructing the AI to output a specific phrase or choice.

Identify the Options: The options are "ok" and "/no_think"…

It's parsing /no_think as a multiple-choice option and burning the full 64-token budget. Verification passes only via the soft-pass branch — the signal drops from "model produced a coherent reply" to "model produced any tokens." Same prompt without /no_think still gets a clean 2-token "ok".

So the patch doesn't break verification, but it's ~17× slower for Gemma and the health signal is weaker.

A few shapes that keep the Qwen3 fix without regressing Gemma:

Gate /no_think on Qwen3-family ids — if "qwen3" in model_id.lower(): prompt += " /no_think". Keeps the fix where it's needed.
Drop /no_think entirely and rely on the max_tokens=64 bump + soft-pass alone. Qwen3 finishes a short <think> well under 64 tokens, so this should be enough.
Tighten the soft-pass to also require usage.completion_tokens > 0 (and non-empty reasoning_content when present), so a model that genuinely streams nothing still fails.

Do you specifically need Qwen3 catalog support? If yes, option 1; if not, option 2 + 3. Also worth a unit test in tests/unit/ mocking LemonadeClient.chat_completions with the three response shapes so this doesn't silently regress.

Requesting changes for now so we don't merge a Gemma regression in service of a Qwen fix.

Verification step in `gaia init` was failing with "Empty response" for modern thinking-mode chat models (Qwen3 family, etc.). Three small changes in `_verify_model`: 1. Append `/no_think` to the verification prompt — explicitly suppress thinking mode where the model supports the tag. 2. Bump max_tokens from 10 to 64 — gives short thinking blocks room to finish before content tokens start, even when the tag isn't honored. 3. Treat `finish_reason == "length"` with empty `content` as a soft pass. If the model was producing tokens (length cutoff) it's clearly alive; the budget was just too tight for it to surface a final answer. False-failing on this case marks otherwise-working models as broken and prompts users to delete + re-download (~22 GB for Qwen3.5-35B-A3B). Repro before this patch: gaia init --profile chat # ❌ Qwen3.5-35B-A3B-GGUF - Empty response # ❌ Qwen3-0.6B-GGUF - Empty response # (manual chat with both models works fine via /v1/chat/completions) After this patch all four models in the chat profile verify cleanly on a Strix Halo box (Lemonade Server 10.2.0, llama.cpp b1231 lane).

Three independent bugs combined to keep the desktop window from ever rendering on a Linux Wayland/X11 session — only the tray icon and Python backend ever came up. Each fix is independently necessary: - assets/icon.png: 4096×4096 → 512×512. The icon decoded to RGBA was exactly 67,108,864 bytes; Electron tries to send it to the X server in a single request, X11 rejects with "Cannot send request of length 67108896", and the renderer dies before ready-to-show. 512×512 is the standard high-DPI Linux icon size and well under the X11 limit. - main.cjs window-all-closed: during bootstrap, the install-progress dialog is the only window in flight and closes before the main window is created. With trayManager not yet initialized, the default handler called app.quit(), will-quit started cleanup, and the whenReady continuation raced into createWindow() during teardown — V8 crashed with DisallowJavascriptExecutionScope on Wayland (silent exit on X11). New startupInProgress flag suppresses the quit until the main window's first ready-to-show. - main.cjs autoUpdater.init: now gated on GAIA_ENABLE_AUTOUPDATE=1. By default the updater points at upstream amd/gaia's release channel and a successful "update" overwrites a forked binary with upstream's, silently destroying any divergence (mobile-QR pairing, packaging tweaks, anything). Forks must opt in explicitly. Verified end-to-end: window now opens on default Wayland, stable for 40+ seconds (was crashing ≤20 s before), agent session created and a chat message round-tripped through the builder agent.

The Agent UI's WelcomeScreen reads /api/system/status, which checks for ~/.gaia/chat/initialized (src/gaia/ui/routers/system.py:270) to decide whether to show the "First time? Run gaia init --profile chat" banner. Nothing in the codebase ever writes that marker, so the banner was permanently sticky no matter how many times init ran or how cleanly it succeeded — verified by grep across src/. Touch the marker at the end of run() right before _print_completion, so any successful gaia init (any profile) clears the banner. Failure to write is non-fatal — we warn and proceed, since init really did succeed and a stale banner is a softer failure than aborting.

Adds a Mobile Access gateway in the Agent UI that exposes the backend to the user's phone via a copyable URL + QR code. The sidebar toggle (re)starts the tunnel in local mode every click — fresh URL every time — so an expired pairing can be refreshed without restarting the desktop app. - frontend: App.tsx handleMobileToggle deduplicated to a single startTunnel({ mode: 'local' }) path; MobileAccessModal renders QR + URL; api.ts + types add the mode parameter - backend: ui/tunnel.py grows the local-mode adapter, ui/routers/tunnel.py exposes it, ui/server.py wires the router - cli: gaia tunnel surface for testing without the UI Fork divergence over upstream amd/gaia — see project_1bit_gaia_mobile_qr in the project memory.

bong-water-water-bong requested a review from kovtcharov-amd as a code owner April 28, 2026 14:59

itomek requested changes Apr 29, 2026

View reviewed changes

bong-water-water-bong and others added 4 commits May 2, 2026 18:50

bong-water-water-bong force-pushed the fix/verify-thinking-models branch from e4b59d4 to ae783d8 Compare May 2, 2026 21:55

github-actions Bot added the cli CLI changes label May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(installer): _verify_model passes for thinking-mode models#920

fix(installer): _verify_model passes for thinking-mode models#920
bong-water-water-bong wants to merge 4 commits intoamd:mainfrom
bong-water-water-bong:fix/verify-thinking-models

bong-water-water-bong commented Apr 28, 2026

Uh oh!

kovtcharov commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

itomek left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bong-water-water-bong commented Apr 28, 2026

Problem

Root cause

Patch

Repro

Scope

Uh oh!

kovtcharov commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Review

What works

Observations (non-blocking)

Verification

Suggested follow-up (separate PR)

Uh oh!

itomek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants