fix(core): never let shell exit results hang on the output drain (#25166)#27842
fix(core): never let shell exit results hang on the output drain (#25166)#27842MartinCajiao wants to merge 3 commits into
Conversation
The exit result of a PTY execution is gated on the output processing chain: finalize() - the only path that resolves the result - ran exclusively through Promise.race(processingChain, abortFired) with no rejection handling. A chunk that threw anywhere in the rendering pipeline poisoned the chain, the race rejected, and finalize() never ran. The tool call then stayed `executing` forever and the UI kept reporting the shell as awaiting input after the process had already exited (google-gemini#25166); the global unhandledRejection handler logs and continues, so this manifested as a silent hang rather than a crash. - Settle every output chunk even when it throws (try/catch around the chunk executor, try/finally around the terminal write callback), logging instead of poisoning the chain. - Treat a rejected chain as drained and run finalize() on both race outcomes. - Make finalize() idempotent and throw-proof: failures while rendering or serializing the final buffer degrade the captured output instead of skipping completeWithResult(). - Guard the deferred (debounced) render: it runs in a timer outside any caller try/catch, where a throw becomes an uncaught exception that kills the CLI.
Even with every chunk settling, the exit result still waits on the output chain draining through the headless terminal, and a write callback that is never invoked (xterm swallows callbacks on disposed or paused terminals; Windows ConPTY keeps flushing data after exit while the PTY is destroyed immediately) used to leave the execution unresolved forever - the visible symptom of google-gemini#25166. After exit, an idle watchdog now polls drain progress: every settled chunk resets the window, so a slow but advancing drain (large final bursts against a 300k-line scrollback) is never cut short, and only a genuinely stuck chain trips it. When it fires, the execution finalizes with the output buffered so far and logs a warning so field reports can confirm which vector was hit. The interval is unref d and cleared by finalize().
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where shell executions could hang indefinitely in an 'awaiting input' state after the process had actually exited. The root cause was an unguarded output-processing chain that would block the finalization of the execution if any chunk failed or if the terminal write callback was never invoked. The changes introduce a robust, fault-tolerant processing pipeline and an idle-based watchdog that guarantees the execution lifecycle completes, ensuring the UI accurately reflects the process status. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
📊 PR Size: size/L
|
There was a problem hiding this comment.
Code Review
This pull request introduces a post-exit output drain watchdog to resolve issues where shell executions could hang indefinitely due to stalled output processing or swallowed terminal write callbacks. It adds robust error handling around the rendering pipeline and chunk processing to ensure that failures do not block exit finalization. Additionally, new unit tests are added to verify exit finalization resilience and the watchdog behavior. The reviewer feedback suggests replacing Date.now() with performance.now() to provide a monotonic clock source, ensuring that system clock adjustments do not cause premature or delayed timeouts.
| let lastDrainActivityAt = Date.now(); | ||
| const markDrainActivity = () => { | ||
| lastDrainActivityAt = Date.now(); | ||
| }; |
There was a problem hiding this comment.
Using Date.now() for measuring elapsed time or timeouts can be unreliable if the system clock is adjusted (e.g., via NTP synchronization, VM migration, or manual changes). If the clock jumps forward, it can cause premature timeouts; if it jumps backward, it can delay the timeout.
To ensure robustness, prefer using performance.now(), which provides a monotonic clock that is guaranteed to only increase and is immune to system clock adjustments.
| let lastDrainActivityAt = Date.now(); | |
| const markDrainActivity = () => { | |
| lastDrainActivityAt = Date.now(); | |
| }; | |
| let lastDrainActivityAt = performance.now(); | |
| const markDrainActivity = () => { | |
| lastDrainActivityAt = performance.now(); | |
| }; |
| if (Date.now() - lastDrainActivityAt >= DRAIN_STALL_TIMEOUT_MS) { | ||
| res('drain-stalled'); | ||
| } |
There was a problem hiding this comment.
Use performance.now() instead of Date.now() to ensure monotonic time measurement, preventing issues caused by system clock adjustments.
| if (Date.now() - lastDrainActivityAt >= DRAIN_STALL_TIMEOUT_MS) { | |
| res('drain-stalled'); | |
| } | |
| if (performance.now() - lastDrainActivityAt >= DRAIN_STALL_TIMEOUT_MS) { | |
| res('drain-stalled'); | |
| } |
Addresses review feedback: Date.now() is wall-clock time, so an NTP adjustment or VM migration could fire the stall watchdog prematurely (clock jumps forward) or delay it (clock jumps backward). performance.now() is monotonic and immune to clock adjustments. The wall-clock Date.now() uses for history timestamps are untouched.
|
Both suggestions addressed in 5a0083b: the drain watchdog now uses the monotonic clock (performance.now()) for both the activity marker and the stall check, so NTP/wall-clock adjustments can neither fire it prematurely nor delay it. The Date.now() uses for history timestamps are intentionally unchanged (those are genuine wall-clock values). Full suite still green: 71/71, typecheck clean. |
TLDR
Shell commands could complete while the CLI stayed stuck showing the shell as awaiting input (#25166). The exit result of a PTY execution is gated on the output-processing chain, and that gate had no error handling and no bound: a single chunk that threw anywhere in the rendering pipeline — or whose xterm write callback was never invoked — left the execution unresolved forever. The tool call then never left
executing, soactiveBackgroundExecutionIdstayed set and the UI kept reporting an active shell after the process had already exited.Failure chain
useShellInactivityStatusshows the awaiting/focus state whileactivePtyIdis set.activePtyIdderives from the executing tool call (useGeminiStream.ts): it clears only when the tool's result promise settles.finalize()insideptyProcess.onExit(shellExecutionService.ts), which ran exclusively through:processingChainrejects the race, and with no rejection handlerfinalize()never runs (the CLI's globalunhandledRejectionhandler logs and continues, so this manifests as a silent hang, not a crash);headlessTerminal.writecallback never fires (xterm swallows callbacks on disposed/paused terminals; Windows ConPTY keeps flushing data after exit while the PTY is destroyed immediately on exit) leaves the chain pending forever, and no timeout existed;finalize()itself could throw (render(true), final serialization), skippingcompleteWithResult.What changed
Commit 1 — pure correctness, no behavior change on the happy path:
finalize()on both race outcomes;finalize()is idempotent and throw-proof end to end: a failure while rendering or serializing the final buffer degrades the captured output instead of hanging the execution;Commit 2 — bounded drain (idle watchdog):
DRAIN_STALL_TIMEOUT_MSwindow (2s, polled at 250ms,unref'd and cleared on finalize), the execution finalizes with the output buffered so far and logs a warning. The watchdog is idle-based — every settled chunk resets the window — so a slow but advancing drain (large final bursts against a 300k-line scrollback) is never cut short; only a genuinely stuck chain trips it.The exit result now always reaches the scheduler; in the worst pathological case the trailing render is degraded, never the exit code, and the stall is logged for diagnosis.
Tests
shellExecutionService.test.ts(existing harness, real headless terminal):shellExecutionService.drain.test.ts(new, controllable terminal mock):mainand pass with this change.Known boundaries (intentionally out of scope)
node-ptynever emitsonExit, nothing in this file can recover — the watchdog lives inside the exit handler. That variant, if it exists in the wild, needs process-liveness tracking and separate evidence.child_processfallback path waits onclose(stdio drain), which a grandchild holding the pipes can delay indefinitely on Windows — same symptom family, different mechanism, and not the default path (enableInteractiveShelldefaults to true). Happy to follow up separately.Fixes #25166