Skip to content

feat(chat): route Gemma via LiteLLM, add Qwen 3.6, deadline + fallback#694

Open
Movm wants to merge 1 commit intomasterfrom
feat/chat-fallback-gemma-litellm
Open

feat(chat): route Gemma via LiteLLM, add Qwen 3.6, deadline + fallback#694
Movm wants to merge 1 commit intomasterfrom
feat/chat-fallback-gemma-litellm

Conversation

@Movm
Copy link
Copy Markdown
Collaborator

@Movm Movm commented Apr 26, 2026

Summary

Regolo's gemma4-31b endpoint hangs upstream — every notebook chat with the default Gemma model spun forever because the AI SDK has no built-in first-token deadline. This PR fixes the immediate hang, adds resilience for similar future outages, and adds Qwen 3.6 27B as a new selectable model.

What changes for users

  • Gemma 4 still appears as a model option but now routes through LiteLLM (Verdigado-hosted gemma4) instead of Regolo. No UI change; old gemma-regolo selections are migrated to gemma-litellm on next page load.
  • Qwen 3.6 27B is a new selectable model. Reasoning streamer wires up automatically (already in REGOLO_REASONING_MODELS).
  • If a model fails to produce a first content token within 20s, the server silently falls back to a sibling on the other provider. No user-visible UI for fallbacks — just a log line in the browser console (per request from product).

Chinese-only-when-selected firewall

This is the load-bearing constraint of this PR. Qwen entries in AVAILABLE_MODELS intentionally have NO fallback field. The firewall works both directions:

  • Never auto-route INTO Qwen (would violate informed-consent boundary — users see the explicit "Chinesisches Modell – unterliegt staatlicher Zensur" warning before opting in)
  • Never silently auto-route OUT of Qwen (the user picked Qwen for a reason; substituting hides the failure)

Encoded in code, not just docs: streamWithFallback early-returns when primary.config?.fallback is undefined. Reviewers can verify the firewall by reading AVAILABLE_MODELS — Qwen entries omit the field, comment block at `ModelConfig.fallback` explains why.

Pre-existing bug fixed

`getModel('litellm', modelId)` ignored the modelId argument and always used `LITELLM_DEFAULT_MODEL`. Latent footgun — masked today because both LiteLLM-mapped IDs (`litellm`, `gemma-litellm`) want the same physical model. Adding any second LiteLLM-served model would have silently routed to the wrong one. One-line fix included.

Implementation highlights

  • New `litellmFetchWithThinkingDisabled` (sibling to `regoloFetchWithThinkingDisabled`) injects Ollama's top-level `think: false` so gemma streams content into `delta.content` instead of burning the entire token budget on `reasoning`.
  • 20s first-token deadline guards against silent upstream hangs. Single shared deadline across initial probes (was accidentally giving 40s grace via per-call setTimeout).
  • Reasoning streamer split into Phase-1 (race vs deadline until first text chunk) + Phase-2 (drain without race) — eliminates per-chunk Promise.race microtask hops after first content arrives.
  • Native `AbortSignal.any()` (Node 20.3+) replaces a hand-rolled `composeAbortSignals` helper.
  • `streamAndAccumulate` / `streamAndAccumulateWithReasoning` keep their old null-on-failure shape for existing chat router callers via a shared `wrapWithCompatCatch` factory; they get the deadline + empty-completion safety nets for free.

Out of scope (flagged for follow-up)

  • Vision pipeline also uses broken Regolo gemma (`providers.ts:20` — `VISION_MODEL = { provider: 'regolo', model: 'gemma4-31b' }`). Same outage hits image analysis; same fallback policy could apply. Separate PR.
  • Worker pool / monitor / voice paths also call `getModel()` but are non-streaming, so failure shape differs. Bigger refactor.

Test plan

  • Manual: Berlin notebook chat with Gemma 4 (default) returns an answer. Pre-PR: spins forever. Post-PR: gemma streams via LiteLLM (~10s reasoning preamble swallowed by AI SDK, then content streams).
  • Manual: Pick Qwen 3.6 27B in the model picker — answer streams with visible reasoning chunks via the existing reasoning UI.
  • Manual: Pick Qwen 120B and ask a question — works as before; if Regolo Qwen happens to be down, error message says "Antwort konnte nicht generiert werden" (no silent fallback to non-Chinese model).
  • Manual: With browser DevTools open, force Regolo to fail (block `api.regolo.ai` in Network panel) and pick GPT-OSS — server falls back to LiteLLM gemma; no UI banner; `[Notebook] Model fallback: gpt-oss-regolo → gemma-litellm (first_token_timeout)` appears in console.
  • Persistence: load app with stale localStorage `selectedModel: 'gemma-regolo'` — chatStore migration v6 upgrades to `gemma-litellm` automatically.
  • Regression: existing chat routes (`chatGraphContractRouter`, `chatGraphController`, `searchGraphController`) still work — they use the backward-compatible `streamAndAccumulate` wrapper; behavior unchanged plus the new 20s deadline as bonus safety.
  • `pnpm --filter @gruenerator/api typecheck`
  • `pnpm --filter @gruenerator/chat typecheck`
  • `pnpm --filter @gruenerator/api test` (specifically `responseStreaming` if any tests exist)

Notes

Regolo's gemma4-31b endpoint hangs upstream — every notebook chat with
the default Gemma model just spun forever because the AI SDK has no
built-in first-token deadline. This change:

- Adds litellmFetchWithThinkingDisabled (sibling to regoloThinkingFetch)
  injecting Ollama's `think: false` so LiteLLM-served gemma streams
  content instead of burning its entire token budget on `reasoning`.
- Re-routes the user-facing "Gemma 4" model from Regolo → LiteLLM. Old
  `gemma-regolo` ID is aliased server-side and migrated client-side
  (chatStore v6) to the new `gemma-litellm` ID.
- Adds Qwen 3.6 27B as a selectable model (already in the existing
  Regolo reasoning-stream allowlist, so no extra wiring).
- Introduces a 20s first-token deadline + single-step cross-provider
  fallback (gemma-litellm ↔ gpt-oss-regolo) in responseStreamingService.
  Qwen entries intentionally have no `fallback` field — the
  Chinese-only-when-selected firewall (informed-consent boundary,
  documented in ModelConfig).
- Fixes pre-existing bug: getModel('litellm', modelId) ignored the
  modelId arg and always used LITELLM_DEFAULT_MODEL.

Fallback is silent end-user-side: server emits a `fallback` SSE event,
both runtime adapters log it to the browser console, no UI banner.

Implementation notes:
- streamAndAccumulate / streamAndAccumulateWithReasoning now have a
  shared `wrapWithCompatCatch` factory and an `*OrThrow` internal layer
  used by streamWithFallback. Existing chat router callers see the same
  null-on-failure shape, plus the new deadline + empty-completion
  safety nets for free.
- Single shared deadline across initial-probe iterations (was
  accidentally giving 40s grace via per-call setTimeout).
- Reasoning streamer split into Phase-1 (race vs deadline until first
  text) + Phase-2 (drain without race) — eliminates wasted Promise.race
  microtask hops on every reasoning chunk after first content.
- Uses native AbortSignal.any() (Node 20.3+) instead of a hand-rolled
  composeAbortSignals helper.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant