|
| 1 | +# google-gemini/gemini-cli PR #26225 — fix(core): fail fast in MessageBus.request() on publish failure |
| 2 | + |
| 3 | +- PR: https://github.com/google-gemini/gemini-cli/pull/26225 |
| 4 | +- Head SHA: `ab4c6461dbba40c64dbac3607f70d5823432ecd1` |
| 5 | +- Files touched: `packages/core/src/agents/local-invocation.ts` (+7/-5), `packages/core/src/confirmation-bus/message-bus.test.ts` (+28/-19), `packages/core/src/confirmation-bus/message-bus.ts` (+6/-2), `packages/core/src/scheduler/scheduler.ts` (+11/-6), `packages/core/src/scheduler/state-manager.ts` (+7/-5), `packages/core/src/tools/tools.ts` (+10/-10) (+69/-47, 6 files). Fixes #22588. |
| 6 | + |
| 7 | +## Specific citations |
| 8 | + |
| 9 | +- The bug class: `MessageBus.publish()` at `message-bus.ts:140-147` had a `try/catch` that emitted on the `'error'` event but **swallowed the error** (no rethrow). `MessageBus.request()` at `:240` then did `// eslint-disable-next-line ... this.publish(...)` (a floating promise) and waited up to 60s on the response timer. Net effect: any publish failure (e.g. policy check rejection, schema validation failure) caused `request()` to silently hang for 60s instead of rejecting immediately. The fix is two lines at `message-bus.ts:147` (`throw error;` after `this.emit('error', error)`) and at `:243-246` (`.catch((error) => { cleanup(); reject(error); })` on the publish call inside `request()`). |
| 10 | +- The `cleanup()` invocation at `:244` is the critical detail — without it, the response-handler subscription and the timeout would leak even after the early reject. The diff shows `cleanup` is defined earlier in `request()` (not visible in this slice) and removes the response listener + clears the timeout. Verified by `request-flow` tests at the message-bus test file. |
| 11 | +- Five "fire-and-forget publish" call sites are hardened with `.catch(() => {})` to suppress unhandled-rejection process crashes now that `publish()` rethrows: `local-invocation.ts:93-97` (`SUBAGENT_ACTIVITY`), `state-manager.ts:247-252` (`TOOL_CALLS_UPDATE`), `tools.ts:245-251` (`UPDATE_POLICY`), and the test-only callers at `message-bus.test.ts:296-302, :340-346`. The scheduler at `scheduler.ts:170-178` uses an explicit `try { await ... } catch (error) { debugLogger.error('Failed to publish confirmation response', error); }` — better, because that path is meaningful for diagnosing why a confirmation response didn't reach the requester. |
| 12 | +- The `tools.ts:366-369` change is the subtlest: the prior `try { void this.messageBus.publish(request); } catch { cleanup(); resolve('allow'); }` could not catch a *promise rejection* (only a synchronous throw) because of the `void`. The fix is `this.messageBus.publish(request).catch(() => { cleanup(); resolve('allow'); });` — correctly awaits the rejection via `.catch` and falls back to `resolve('allow')` (the "fail-open" policy of the previous code is preserved on the unhappy path). |
| 13 | +- Test changes at `message-bus.test.ts:31-38, :49-58, :253-262` flip the assertion shape from `await messageBus.publish({...invalid...})` (silently absorbed) to `await expect(messageBus.publish({...})).rejects.toThrow('Invalid message structure')` and `.rejects.toThrow('Policy check failed')`. Three sites updated, all consistent. The `errorHandler` mock is still asserted to be called — both the emit AND the throw are now expected, locking the new contract that the `'error'` event is preserved AND the promise rejects. |
| 14 | + |
| 15 | +## Verdict: merge-after-nits |
| 16 | + |
| 17 | +## Concerns / nits |
| 18 | + |
| 19 | +1. **Three `.catch(() => {})` no-op handlers** at `local-invocation.ts:97`, `state-manager.ts:252`, `tools.ts:251` silently drop publish errors that the new contract explicitly emits. This is the right runtime behavior (these are fire-and-forget activity/policy-update broadcasts), but a one-line `debugLogger.error('Failed to publish X', err)` (matching the scheduler's pattern at `scheduler.ts:177`) would preserve diagnostic visibility without changing the swallow-and-continue semantics. Three nearly-identical no-op catches in the same PR is a smell that the diagnostic pattern should be uniform. |
| 20 | +2. **The `tools.ts:366-369` fail-open `resolve('allow')`** on publish failure is surprising for a confirmation-bus path. If the bus is down, defaulting to "allow" is a fail-open policy decision that should at minimum log a warning. The pre-PR code had the same fail-open behavior (in the synchronous `catch` branch), so this PR isn't introducing the policy — it's just making the async case match — but worth a one-line comment at the call site explaining *why* fail-open is correct for this path (probably: "if the bus isn't running, there's no UI to gate on, so the request must proceed"). Without the comment a future security review may flag this as a regression. |
| 21 | +3. **`scheduler.ts:170-178` has a `try/await/catch`** while sibling sites use `.publish(...).catch(...)`. Both are correct but mixing styles in the same PR for the same operation makes the change harder to read. Either pattern is fine — the file should pick one and stick to it. |
| 22 | +4. **`message-bus.test.ts:255` change from `.resolves.not.toThrow()` to `.rejects.toThrow('Policy check failed')`** is a behavior change masquerading as a test fix. Previously the test asserted that a policy-rejected publish *did not throw*; now it asserts it *does* throw. This is the documented intent of the PR (fail-fast), but the previous test was locking the *old* swallow-and-emit behavior — which means there may be downstream callers in the codebase outside the diff slice that were relying on policy-rejected publishes returning normally. A grep for `messageBus.publish(` followed by no `.catch` in the rest of the codebase (only the 4-5 sites visible here are hardened) would identify any remaining hangs. |
| 23 | +5. **No regression added for `request()` itself failing fast on publish failure** — the test changes assert that `publish()` rejects, but the actual headline bug class is "request() hangs for 60s when publish fails". A test that does `messageBus.publish = vi.fn(() => Promise.reject(new Error('boom'))); await expect(messageBus.request(...)).rejects.toThrow('boom')` with a tight `vi.useFakeTimers()` assertion that the rejection happens within ~ms (not seconds) would lock the headline fix. The current test surface locks the underlying primitive (`publish` rethrows) but not the user-visible property (`request` doesn't hang). |
0 commit comments