[Bug]: `runtime_fallback` shows "Model Fallback" toast but primary session keeps retrying exhausted OpenAI model (OpenCode 1.17.5)

### Prerequisites

- [x] I will write this issue in English (see our [Language Policy](https://github.com/code-yeongyu/oh-my-openagent/blob/dev/CONTRIBUTING.md#language-policy))
- [x] I have searched existing issues to avoid duplicates
- [x] I am using the latest version of oh-my-openagent
- [x] I have read the [documentation](https://github.com/code-yeongyu/oh-my-openagent#readme) or asked an AI coding agent with this project's GitHub URL loaded and couldn't find the answer

### Bug Description

## Summary

With `runtime_fallback.enabled: true`, the TUI displays **"Model Fallback — Switching to …"** when the primary model hits quota / usage-limit errors, but OpenCode logs show the **primary ultraworker session continues streaming on the same exhausted provider/model** (`openai/gpt-5.4`). The configured fallback (`litellm/openai.eu.gpt-5.5`) is never used for that session.

This is **not** a model-name mismatch: fallback models are registered in OpenCode, listed by `opencode models`, and match the gateway catalog. Subagent fallback **does** work via a different code path (background-agent respawn on LiteLLM).

---

## Environment

| Component | Version |
|-----------|---------|
| OpenCode | 1.17.5 |
| oh-my-openagent | 4.11.1 |
| OS | macOS (darwin 25.4.0) |
| Agent | `Sisyphus - ultraworker` (config key: `sisyphus`) |
| Primary model | `openai/gpt-5.4` |
| First fallback | `litellm/openai.eu.gpt-5.5` |

### Relevant `oh-my-openagent.json` snippet

```json
{
  "runtime_fallback": {
    "enabled": true,
    "max_fallback_attempts": 5,
    "cooldown_seconds": 60,
    "notify_on_fallback": true
  },
  "agents": {
    "sisyphus": {
      "model": "openai/gpt-5.4",
      "variant": "high",
      "fallback_models": [
        { "model": "litellm/openai.eu.gpt-5.5", "variant": "high" },
        { "model": "litellm/vertex_ai.anthropic.claude-opus-4-8", "variant": "max" },
        { "model": "opencode/deepseek-v4-flash-free" },
        { "model": "lmstudio/qwen/qwen3-coder-30b" }
      ]
    }
  }
}
```

Fallback LiteLLM models are also registered under `opencode.json` → `provider.litellm.models` and appear in `opencode models`.


## Root cause analysis (from omO 4.11.1 bundled source)

Investigated in `node_modules/oh-my-openagent/dist/index.js` (maps to `packages/omo-opencode/src/hooks/runtime-fallback/*`).

### 1. Toast is shown **before** auto-retry dispatch succeeds

`dispatchFallbackRetry()` calls `prepareFallback()`, shows the toast, **then** calls `autoRetryWithFallback()`:

```javascript
// fallback-retry-dispatcher.ts (conceptual — from dist/index.js ~103196)
async function dispatchFallbackRetry(deps, helpers, options) {
  const result = prepareFallback(...);
  if (result.success && deps.config.notify_on_fallback) {
    await deps.ctx.client.tui.showToast({
      body: {
        title: "Model Fallback",
        message: `Switching to ${result.newModel?.split("/").pop()} for next request`,
        ...
      }
    });
  }
  if (result.success && result.newModel) {
    await helpers.autoRetryWithFallback(...);  // may fail or be skipped AFTER toast
  }
}
```

**Impact:** Users see "Switching to …" even when `autoRetryWithFallback` is gated, fails, or loses a race.

### 2. Auto-retry can be silently skipped by `promptAsync` gate

`createAutoRetryDispatcher()` dispatches via `dispatchInternalPrompt({ mode: "async", ... })` and only treats `status === "dispatched" || status === "queued"` as success (`isInternalPromptDispatchAccepted`). Otherwise it logs and returns without fallback:

- `Auto-retry skipped by promptAsync gate`
- `Retry already in flight, skipping`
- `Session active, queueing fallback dispatch` (may not result in fallback model stream)

OpenCode’s built-in post-error loop (`stream error` → `cancel session` → retry same model ~30–44s later) appears to win this race.

### 3. `chat.message` hook restores agent primary after cooldown

When `runtime_fallback` is enabled, `createChatMessageHandler2()` can **override user messages back to `originalModel`** (agent config primary) when:

- `currentModel !== originalModel`
- `pendingFallbackModel` is cleared
- `originalModel` is not in cooldown (60s default)

```javascript
// chat-message-handler.ts (conceptual — from dist/index.js ~103079)
if (state.currentModel !== state.originalModel && !state.pendingFallbackModel
    && !isModelInCooldown(state.originalModel, state, config.cooldown_seconds)) {
  // "Restoring preferred primary model"
  output.message.model = { providerID: "openai", modelID: "gpt-5.4" };
  return;
}
```

**Impact:** Even after a successful fallback, the next user message can revert the session to the exhausted OpenAI primary (matches log section D).

### 4. Two fallback systems — primary uses the fragile one

When `runtime_fallback.enabled === true`, legacy `model-fallback` event handlers are disabled (`shouldHandleModelFallback()` returns false). Primary ultraworker depends entirely on `runtime_fallback`. Subagents use **background-agent fallback** (`tryFallbackRetry` / respawn), which continues to work.

### 5. Quota errors are classified retryable, but integration is incomplete

`classifyRuntimeFallbackError()` correctly maps `"The usage limit has been reached"` → `quota_exceeded`, and `isRuntimeFallbackRetryableError()` returns `true` for that type. The classifier is fine; the **dispatch / race / restore** logic is not.

### 6. OmO `[runtime-fallback]` logs not visible in OpenCode log file

No `[runtime-fallback]` lines appear in `opencode.log`, making production diagnosis difficult. Consider routing plugin logs to the same sink or documenting where to find them.

---

## Suggested fixes (for maintainers)

### Fix 1 — Defer toast until dispatch is accepted (high priority)

**File:** `packages/omo-opencode/src/hooks/runtime-fallback/fallback-retry-dispatcher.ts`

Move toast **after** successful `autoRetryWithFallback`, or pass a callback:

```typescript
export async function dispatchFallbackRetry(deps, helpers, options) {
  const result = prepareFallback(options.sessionID, options.state, options.fallbackModels, deps.config);
  if (!result.success || !result.newModel) {
    log(`[runtime-fallback] Fallback preparation failed`, { ... });
    return { ok: false, reason: result.error };
  }

  const dispatchResult = await helpers.autoRetryWithFallback(
    options.sessionID,
    result.newModel,
    options.resolvedAgent,
    options.source,
  );

  if (dispatchResult?.accepted && deps.config.notify_on_fallback) {
    await deps.ctx.client.tui.showToast({
      body: {
        title: "Model Fallback",
        message: `Switched to ${formatModel(result.newModel)}`,
        variant: "warning",
        duration: 5000,
      },
    }).catch(() => {});
  } else if (deps.config.notify_on_fallback) {
    await deps.ctx.client.tui.showToast({
      body: {
        title: "Model Fallback Failed",
        message: `Could not switch to ${formatModel(result.newModel)}. ${dispatchResult?.reason ?? "Retry blocked."}`,
        variant: "error",
        duration: 8000,
      },
    }).catch(() => {});
  }

  return { ok: dispatchResult?.accepted ?? false, newModel: result.newModel };
}
```

`autoRetryWithFallback` should return `{ accepted: boolean, reason?: string }` instead of `void`.

### Fix 2 — Abort OpenCode same-model retry before fallback dispatch (high priority)

**Files:**
- `auto-retry-dispatch.ts`
- `message-update-handler.ts`

On `quota_exceeded` / usage-limit errors, **always** call `abortSessionRequest(sessionID, "message.updated.quota-fallback")` and add the session to `internallyAbortedSessions` **before** `dispatchFallbackRetry`, similar to the existing `session.status.retry-signal` path.

Today, abort with internal marker is only guaranteed for:

```typescript
source === "session.status.retry-signal"
  || source === "message.updated.retry-signal"
  || source === "session.timeout"
```

Extend to quota / usage-limit classification so `session.error` from the abort does not call `resetRetryState()`.

### Fix 3 — Do not restore primary while provider is quota-blocked (medium priority)

**File:** `chat-message-handler.ts`

Skip "Restoring preferred primary model" when:

- `originalModel` provider recently failed with `quota_exceeded`, or
- `state.failedModels.has(originalModel)` and still in cooldown, or
- `state.currentModel` is a successful fallback (fallbackIndex >= 0)

```typescript
function shouldRestorePrimary(state: FallbackState, config: RuntimeFallbackConfig): boolean {
  if (state.pendingFallbackModel) return false;
  if (state.fallbackIndex >= 0 && state.currentModel !== state.originalModel) {
    return false; // stay on active fallback until user explicitly changes model
  }
  if (isModelInCooldown(state.originalModel, state, config.cooldown_seconds)) {
    return false;
  }
  return state.currentModel !== state.originalModel;
}
```

Optionally add config: `runtime_fallback.restore_primary_after_cooldown: false` (default `false` when fallbacks configured).

### Fix 4 — Persist fallback model on session record (medium priority)

After accepted fallback dispatch, call OpenCode session update so the core loop picks up the new model:

```typescript
await ctx.client.session.update({
  path: { id: sessionID },
  body: {
    model: {
      providerID: parsed.providerID,
      modelID: parsed.modelID,
    },
  },
  query: { directory: ctx.directory },
});
```

This reduces reliance on winning the `promptAsync` race against OpenCode’s internal retry.

### Fix 5 — Surface plugin logs in OpenCode log (low priority)

Route `[runtime-fallback]` log lines to the same structured logger OpenCode uses, or document `OMO_LOG_LEVEL=debug` and output path. Would have saved hours of diagnosis.

### Fix 6 — Integration test (recommended)

Add a test that simulates:

1. Primary stream error with message `The usage limit has been reached`
2. `message.updated` with assistant `error`
3. Assert `promptAsync` body contains fallback model
4. Assert no second primary stream on original provider without fallback dispatch
5. Assert toast fires only after accepted dispatch

---

## Workaround (config only — not a fix)

Set agent primary to LiteLLM so quota errors hit the corporate gateway first:

```json
"sisyphus": {
  "model": "litellm/openai.eu.gpt-5.5",
  "variant": "high",
  "fallback_models": [
    { "model": "litellm/vertex_ai.anthropic.claude-opus-4-8", "variant": "max" },
    ...
  ]
}
```

This avoids the broken OpenAI-primary → cross-provider fallback path but does not fix the underlying bug.




### Steps to Reproduce


1. Configure `sisyphus` primary on direct OpenAI (`openai/gpt-5.4`) with LiteLLM fallbacks as above.
2. Enable `runtime_fallback` with `notify_on_fallback: true`.
3. Start a long-lived **Sisyphus - ultraworker** session on a project.
4. Exhaust the OpenAI subscription quota (or trigger repeated `The usage limit has been reached` errors).
5. Observe the TUI toast: **"Model Fallback — Switching to openai.eu.gpt-5.5(high) for next request"** (or similar).
6. Inspect OpenCode logs (`~/.local/share/opencode/log/opencode.log`) for `stream providerID=…` lines on the **same session ID**.

---

## Checklist for repro confirmation

- [ ] `runtime_fallback.enabled: true`
- [ ] Primary on direct OpenAI (`openai/*`), fallback on different provider (`litellm/*`)
- [ ] OpenAI quota exhausted (`The usage limit has been reached`)
- [ ] Long-lived session (not a fresh session)
- [ ] Compare `stream providerID=` lines vs TUI toast for same `session.id`


### Expected Behavior


After a quota / usage-limit error on the primary model:

1. OmO selects the next fallback from `fallback_models`.
2. OmO dispatches a retry (`promptAsync`) with the fallback model in the request body.
3. OpenCode logs show subsequent primary streams on the fallback provider, e.g. `providerID=litellm modelID=openai.eu.gpt-5.5`.
4. Toast is shown **only after** the fallback dispatch is accepted (or clearly indicates failure).



### Actual Behavior


1. TUI toast appears announcing fallback to `litellm/openai.eu.gpt-5.5`.
2. OpenCode logs show **repeated** streams on the **same exhausted model**:
   - `providerID=openai modelID=gpt-5.4`
   - `ERROR … The usage limit has been reached`
   - `cancel session`
   - ~30–44s later → same `openai/gpt-5.4` stream again
3. **`litellm/openai.eu.gpt-5.5` never appears** in logs for the affected session after quota hit.
4. Session can remain stuck retrying OpenAI for hours until the user manually picks a LiteLLM model in the TUI.



### Doctor Output

```shell
~ bunx oh-my-openagent doctor --verbose

 oMoMoMoMo Doctor

System Information
────────────────────────────────────────
  ✓ opencode    1.17.5
  ✓ oh-my-openagent 4.11.1
  ✓ loaded      4.11.1
  ✓ bun         1.3.14
  ✓ path        /opt/homebrew/bin/opencode

Configuration
────────────────────────────────────────
  ✓ /Users/sfarida002/.config/opencode/opencode.jsonc (valid)

Tools
────────────────────────────────────────
  ✓ LSP         1 server
                    lsp-tools-mcp (*)
  ✓ ast-grep CLI installed
  ✓ comment-checker installed
  ✓ gh CLI installed (samer-farida_pwcit)

MCPs
────────────────────────────────────────
  ✓ websearch
  ✓ context7
  ✓ grep_app
  ✓ lsp

System
────────────────────────────────────────
OpenCode: 1.17.5
Plugin expected: 4.11.1
Plugin loaded: 4.11.1
Bun: 1.3.14

Configuration
────────────────────────────────────────
Path: /Users/sfarida002/.config/opencode/oh-my-openagent.json

TUI Plugin
────────────────────────────────────────
opencode.json: /Users/sfarida002/.config/opencode/opencode.jsonc
tui.json: /Users/sfarida002/.config/opencode/tui.json

Tools
────────────────────────────────────────
AST-Grep CLI: yes
Comment checker: yes
LSP: 1 server(s)
GH CLI: installed (authenticated)
MCP: builtin=4, user=0

Models
────────────────────────────────────────
═══ Available Models (from cache) ═══

  Providers in cache: 146
  Sample: requesty, qiniu-ai, alibaba-cn, regolo-ai, stackit, vercel...
  Total models: 5278
  Cache: /Users/sfarida002/.cache/opencode/models.json
  ℹ Runtime: only connected providers used
  Refresh: opencode models --refresh

═══ Configured Models ═══

Agents:
  ● sisyphus: openai/gpt-5.4 (high) [capabilities: snapshot-backed]
  ● hephaestus: openai/gpt-5.4-mini (medium) [capabilities: snapshot-backed]
  ● oracle: openai/gpt-5.4-mini (high) [capabilities: snapshot-backed]
  ● librarian: openai/gpt-5.4-mini-fast [capabilities: snapshot-backed]
  ● explore: openai/gpt-5.4-mini-fast [capabilities: snapshot-backed]
  ● multimodal-looker: openai/gpt-5.4-mini (medium) [capabilities: snapshot-backed]
  ● prometheus: openai/gpt-5.4 (high) [capabilities: snapshot-backed]
  ● metis: openai/gpt-5.4-mini [capabilities: snapshot-backed]
  ● momus: openai/gpt-5.4-mini (high) [capabilities: snapshot-backed]
  ● atlas: openai/gpt-5.4-mini [capabilities: snapshot-backed]
  ● sisyphus-junior: openai/gpt-5.4-mini [capabilities: snapshot-backed]

Categories:
  ● visual-engineering: openai/gpt-5.4-mini (high) [capabilities: snapshot-backed]
  ● ultrabrain: openai/gpt-5.4 (high) [capabilities: snapshot-backed]
  ● deep: openai/gpt-5.4-mini (medium) [capabilities: snapshot-backed]
  ● artistry: openai/gpt-5.4-mini (high) [capabilities: snapshot-backed]
  ● quick: openai/gpt-5.4-mini-fast [capabilities: snapshot-backed]
  ● unspecified-low: openai/gpt-5.4-mini [capabilities: snapshot-backed]
  ● unspecified-high: openai/gpt-5.4-mini (max) [capabilities: snapshot-backed]
  ● writing: openai/gpt-5.4-mini-fast [capabilities: snapshot-backed]

● = user override, ○ = provider fallback

Summary
────────────────────────────────────────
  5 passed, 0 failed, 0 warnings
  Total: 6 checks in 511ms
```

### Error Logs

```shell
---



## Log excerpts

Log file: `~/.local/share/opencode/log/opencode.log`

### A. Failing primary session — toast shown, fallback never used

Session: `ses_123c60247ffed3UX9zA1bZ0nNO`  
Agent: `Sisyphus - ultraworker`


timestamp=2026-06-19T13:48:54.839Z level=INFO run=2093686d message=stream providerID=openai modelID=gpt-5.4 session.id=ses_123c60247ffed3UX9zA1bZ0nNO small=false agent="Sisyphus - ultraworker" mode=primary
timestamp=2026-06-19T13:48:55.317Z level=ERROR run=2093686d message="stream error" providerID=openai modelID=gpt-5.4 session.id=ses_123c60247ffed3UX9zA1bZ0nNO small=false agent="Sisyphus - ultraworker" mode=primary error.error="AI_APICallError: The usage limit has been reached"
timestamp=2026-06-19T13:48:55.321Z level=INFO run=2093686d message=cancel session.id=ses_123c60247ffed3UX9zA1bZ0nNO

timestamp=2026-06-19T13:49:39.092Z level=INFO run=2093686d message=stream providerID=openai modelID=gpt-5.4 session.id=ses_123c60247ffed3UX9zA1bZ0nNO small=false agent="Sisyphus - ultraworker" mode=primary
timestamp=2026-06-19T13:49:39.633Z level=ERROR run=2093686d message="stream error" providerID=openai modelID=gpt-5.4 session.id=ses_123c60247ffed3UX9zA1bZ0nNO small=false agent="Sisyphus - ultraworker" mode=primary error.error="AI_APICallError: The usage limit has been reached"

timestamp=2026-06-19T13:54:55.343Z level=INFO run=2093686d message=stream providerID=openai modelID=gpt-5.4 session.id=ses_123c60247ffed3UX9zA1bZ0nNO small=false agent="Sisyphus - ultraworker" mode=primary
timestamp=2026-06-19T13:54:56.219Z level=ERROR run=2093686d message="stream error" providerID=openai modelID=gpt-5.4 session.id=ses_123c60247ffed3UX9zA1bZ0nNO small=false agent="Sisyphus - ultraworker" mode=primary error.error="AI_APICallError: The usage limit has been reached"
timestamp=2026-06-19T13:54:56.221Z level=INFO run=2093686d message=cancel session.id=ses_123c60247ffed3UX9zA1bZ0nNO


**Note:** A repo-wide search for `openai.eu.gpt-5.5` on this session ID returns **zero** matches. The session eventually moved to `litellm/openai.eu.gpt-5.3-codex` only after manual model selection (~15:05), not via the configured fallback chain.

### B. Session where fallback eventually worked (after new user message)

Session: `ses_12458f0d8ffelx0gCO7PDXmJgU`

Last OpenAI failure:


timestamp=2026-06-18T19:08:43.382Z level=INFO run=97a2d187 message=stream providerID=openai modelID=gpt-5.5 session.id=ses_12458f0d8ffelx0gCO7PDXmJgU small=false agent="Sisyphus - ultraworker" mode=primary
timestamp=2026-06-18T19:08:44.509Z level=ERROR run=97a2d187 message="stream error" providerID=openai modelID=gpt-5.5 session.id=ses_12458f0d8ffelx0gCO7PDXmJgU small=false agent="Sisyphus - ultraworker" mode=primary error.error="AI_APICallError: The usage limit has been reached"


First successful LiteLLM stream (~17 minutes later, new run id, new user message):


timestamp=2026-06-18T19:25:08.436Z level=INFO run=82341f16 message=stream providerID=litellm modelID=openai.eu.gpt-5.5 session.id=ses_12458f0d8ffelx0gCO7PDXmJgU small=false agent="Sisyphus - ultraworker" mode=primary


### C. Subagent fallback works (different code path)

Same parent session `ses_123c60247ffed3UX9zA1bZ0nNO`, subagents after OpenAI quota errors:


timestamp=2026-06-18T19:40:35.049Z level=INFO … model.id=vertex_ai.anthropic.claude-sonnet-4-6 model.providerID=litellm … parentID=ses_123c60247ffed3UX9zA1bZ0nNO


Subsequent subagent streams use `providerID=litellm modelID=vertex_ai.anthropic.claude-sonnet-4-6`. Background-agent fallback respawn works; **primary `runtime_fallback` does not.**

### D. Primary session forced back to OpenAI on user message

Session had been on `litellm/openai.eu.gpt-5.4-mini` for ~16 hours. On new user messages at 11:28, streams switched to agent-config primary:


timestamp=2026-06-19T11:28:20.163Z level=INFO run=2093686d message=stream providerID=openai modelID=gpt-5.4 session.id=ses_123c60247ffed3UX9zA1bZ0nNO small=false agent="Sisyphus - ultraworker" mode=primary


This aligns with `chat.message` hook behavior that restores agent-config primary after cooldown (see root cause).
```

### Configuration

```json

```

### Additional Context

- OpenCode version: 1.17.5
- oh-my-openagent version: 4.11.1 (`bunx oh-my-opencode doctor --verbose` — all checks passed)
- Model names verified: all `litellm/*` fallbacks exist in `opencode.json`, `opencode models`, and gateway catalog
- When `runtime_fallback` is disabled, legacy `model-fallback` is also disabled for events — there is no automatic cross-provider fallback for primary sessions in the current config shape


### Operating System

macOS

### OpenCode Version

1.17.5

Component	Version
OpenCode	1.17.5
oh-my-openagent	4.11.1
OS	macOS (darwin 25.4.0)
Agent	`Sisyphus - ultraworker` (config key: `sisyphus`)
Primary model	`openai/gpt-5.4`
First fallback	`litellm/openai.eu.gpt-5.5`

[Bug]: runtime_fallback shows "Model Fallback" toast but primary session keeps retrying exhausted OpenAI model (OpenCode 1.17.5) #5435

Description

Prerequisites

Bug Description

Summary

Environment

Relevant oh-my-openagent.json snippet

Root cause analysis (from omO 4.11.1 bundled source)

1. Toast is shown before auto-retry dispatch succeeds

2. Auto-retry can be silently skipped by promptAsync gate

3. chat.message hook restores agent primary after cooldown

4. Two fallback systems — primary uses the fragile one

5. Quota errors are classified retryable, but integration is incomplete

6. OmO [runtime-fallback] logs not visible in OpenCode log file

Suggested fixes (for maintainers)

Fix 1 — Defer toast until dispatch is accepted (high priority)

Fix 2 — Abort OpenCode same-model retry before fallback dispatch (high priority)

Fix 3 — Do not restore primary while provider is quota-blocked (medium priority)

Fix 4 — Persist fallback model on session record (medium priority)

Fix 5 — Surface plugin logs in OpenCode log (low priority)

Fix 6 — Integration test (recommended)

Workaround (config only — not a fix)

Steps to Reproduce

Checklist for repro confirmation

Expected Behavior

Actual Behavior

Doctor Output

Error Logs

Configuration

Additional Context

Operating System

OpenCode Version

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: `runtime_fallback` shows "Model Fallback" toast but primary session keeps retrying exhausted OpenAI model (OpenCode 1.17.5) #5435

Relevant `oh-my-openagent.json` snippet

2. Auto-retry can be silently skipped by `promptAsync` gate

3. `chat.message` hook restores agent primary after cooldown

6. OmO `[runtime-fallback]` logs not visible in OpenCode log file