feat(chat): route Gemma via LiteLLM, add Qwen 3.6, deadline + fallback by Movm · Pull Request #694 · netzbegruenung/Gruenerator

Movm · 2026-04-26T00:41:37Z

Summary

Regolo's gemma4-31b endpoint hangs upstream — every notebook chat with the default Gemma model spun forever because the AI SDK has no built-in first-token deadline. This PR fixes the immediate hang, adds resilience for similar future outages, and adds Qwen 3.6 27B as a new selectable model.

What changes for users

Gemma 4 still appears as a model option but now routes through LiteLLM (Verdigado-hosted gemma4) instead of Regolo. No UI change; old gemma-regolo selections are migrated to gemma-litellm on next page load.
Qwen 3.6 27B is a new selectable model. Reasoning streamer wires up automatically (already in REGOLO_REASONING_MODELS).
If a model fails to produce a first content token within 20s, the server silently falls back to a sibling on the other provider. No user-visible UI for fallbacks — just a log line in the browser console (per request from product).

Chinese-only-when-selected firewall

This is the load-bearing constraint of this PR. Qwen entries in AVAILABLE_MODELS intentionally have NO fallback field. The firewall works both directions:

Never auto-route INTO Qwen (would violate informed-consent boundary — users see the explicit "Chinesisches Modell – unterliegt staatlicher Zensur" warning before opting in)
Never silently auto-route OUT of Qwen (the user picked Qwen for a reason; substituting hides the failure)

Encoded in code, not just docs: streamWithFallback early-returns when primary.config?.fallback is undefined. Reviewers can verify the firewall by reading AVAILABLE_MODELS — Qwen entries omit the field, comment block at `ModelConfig.fallback` explains why.

Pre-existing bug fixed

`getModel('litellm', modelId)` ignored the modelId argument and always used `LITELLM_DEFAULT_MODEL`. Latent footgun — masked today because both LiteLLM-mapped IDs (`litellm`, `gemma-litellm`) want the same physical model. Adding any second LiteLLM-served model would have silently routed to the wrong one. One-line fix included.

Implementation highlights

New `litellmFetchWithThinkingDisabled` (sibling to `regoloFetchWithThinkingDisabled`) injects Ollama's top-level `think: false` so gemma streams content into `delta.content` instead of burning the entire token budget on `reasoning`.
20s first-token deadline guards against silent upstream hangs. Single shared deadline across initial probes (was accidentally giving 40s grace via per-call setTimeout).
Reasoning streamer split into Phase-1 (race vs deadline until first text chunk) + Phase-2 (drain without race) — eliminates per-chunk Promise.race microtask hops after first content arrives.
Native `AbortSignal.any()` (Node 20.3+) replaces a hand-rolled `composeAbortSignals` helper.
`streamAndAccumulate` / `streamAndAccumulateWithReasoning` keep their old null-on-failure shape for existing chat router callers via a shared `wrapWithCompatCatch` factory; they get the deadline + empty-completion safety nets for free.

Out of scope (flagged for follow-up)

Vision pipeline also uses broken Regolo gemma (`providers.ts:20` — `VISION_MODEL = { provider: 'regolo', model: 'gemma4-31b' }`). Same outage hits image analysis; same fallback policy could apply. Separate PR.
Worker pool / monitor / voice paths also call `getModel()` but are non-streaming, so failure shape differs. Bigger refactor.

Test plan

Manual: Berlin notebook chat with Gemma 4 (default) returns an answer. Pre-PR: spins forever. Post-PR: gemma streams via LiteLLM (~10s reasoning preamble swallowed by AI SDK, then content streams).
Manual: Pick Qwen 3.6 27B in the model picker — answer streams with visible reasoning chunks via the existing reasoning UI.
Manual: Pick Qwen 120B and ask a question — works as before; if Regolo Qwen happens to be down, error message says "Antwort konnte nicht generiert werden" (no silent fallback to non-Chinese model).
Manual: With browser DevTools open, force Regolo to fail (block `api.regolo.ai` in Network panel) and pick GPT-OSS — server falls back to LiteLLM gemma; no UI banner; `[Notebook] Model fallback: gpt-oss-regolo → gemma-litellm (first_token_timeout)` appears in console.
Persistence: load app with stale localStorage `selectedModel: 'gemma-regolo'` — chatStore migration v6 upgrades to `gemma-litellm` automatically.
Regression: existing chat routes (`chatGraphContractRouter`, `chatGraphController`, `searchGraphController`) still work — they use the backward-compatible `streamAndAccumulate` wrapper; behavior unchanged plus the new 20s deadline as bonus safety.
`pnpm --filter @gruenerator/api typecheck`
`pnpm --filter @gruenerator/chat typecheck`
`pnpm --filter @gruenerator/api test` (specifically `responseStreaming` if any tests exist)

Notes

A duplicate of this commit exists on `feat/notebook-startpage-tabs-and-stats` (PR refactor(notebook): startpage tab/label cleanup + ts-rest auth status #693) because it was committed there first before being moved to its own branch. When this PR merges, the duplicate becomes a no-op for refactor(notebook): startpage tab/label cleanup + ts-rest auth status #693's rebase.
This change introduces no new tests — the streaming layer is hard to unit-test against a real LLM. Worth adding mocked vitests for `streamWithFallback` (4 cases: success, primary-timeout-fallback-success, both-fail, no-fallback-configured) as a follow-up.

Regolo's gemma4-31b endpoint hangs upstream — every notebook chat with the default Gemma model just spun forever because the AI SDK has no built-in first-token deadline. This change: - Adds litellmFetchWithThinkingDisabled (sibling to regoloThinkingFetch) injecting Ollama's `think: false` so LiteLLM-served gemma streams content instead of burning its entire token budget on `reasoning`. - Re-routes the user-facing "Gemma 4" model from Regolo → LiteLLM. Old `gemma-regolo` ID is aliased server-side and migrated client-side (chatStore v6) to the new `gemma-litellm` ID. - Adds Qwen 3.6 27B as a selectable model (already in the existing Regolo reasoning-stream allowlist, so no extra wiring). - Introduces a 20s first-token deadline + single-step cross-provider fallback (gemma-litellm ↔ gpt-oss-regolo) in responseStreamingService. Qwen entries intentionally have no `fallback` field — the Chinese-only-when-selected firewall (informed-consent boundary, documented in ModelConfig). - Fixes pre-existing bug: getModel('litellm', modelId) ignored the modelId arg and always used LITELLM_DEFAULT_MODEL. Fallback is silent end-user-side: server emits a `fallback` SSE event, both runtime adapters log it to the browser console, no UI banner. Implementation notes: - streamAndAccumulate / streamAndAccumulateWithReasoning now have a shared `wrapWithCompatCatch` factory and an `*OrThrow` internal layer used by streamWithFallback. Existing chat router callers see the same null-on-failure shape, plus the new deadline + empty-completion safety nets for free. - Single shared deadline across initial-probe iterations (was accidentally giving 40s grace via per-call setTimeout). - Reasoning streamer split into Phase-1 (race vs deadline until first text) + Phase-2 (drain without race) — eliminates wasted Promise.race microtask hops on every reasoning chunk after first content. - Uses native AbortSignal.any() (Node 20.3+) instead of a hand-rolled composeAbortSignals helper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chat): route Gemma via LiteLLM, add Qwen 3.6, deadline + fallback#694

feat(chat): route Gemma via LiteLLM, add Qwen 3.6, deadline + fallback#694
Movm wants to merge 1 commit intomasterfrom
feat/chat-fallback-gemma-litellm

Movm commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Movm commented Apr 26, 2026

Summary

What changes for users

Chinese-only-when-selected firewall

Pre-existing bug fixed

Implementation highlights

Out of scope (flagged for follow-up)

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant