Skip to content

fix(core): never let shell exit results hang on the output drain (#25166)#27842

Open
MartinCajiao wants to merge 3 commits into
google-gemini:mainfrom
MartinCajiao:fix/shell-exit-stuck-awaiting-input
Open

fix(core): never let shell exit results hang on the output drain (#25166)#27842
MartinCajiao wants to merge 3 commits into
google-gemini:mainfrom
MartinCajiao:fix/shell-exit-stuck-awaiting-input

Conversation

@MartinCajiao

Copy link
Copy Markdown

TLDR

Shell commands could complete while the CLI stayed stuck showing the shell as awaiting input (#25166). The exit result of a PTY execution is gated on the output-processing chain, and that gate had no error handling and no bound: a single chunk that threw anywhere in the rendering pipeline — or whose xterm write callback was never invoked — left the execution unresolved forever. The tool call then never left executing, so activeBackgroundExecutionId stayed set and the UI kept reporting an active shell after the process had already exited.

Failure chain

  1. useShellInactivityStatus shows the awaiting/focus state while activePtyId is set.
  2. For model-initiated commands, activePtyId derives from the executing tool call (useGeminiStream.ts): it clears only when the tool's result promise settles.
  3. That promise settles only in finalize() inside ptyProcess.onExit (shellExecutionService.ts), which ran exclusively through:
    Promise.race([processingChain.then(() => 'processed'), abortFired]).then(() => {
      finalize();
    });
  4. Three structural holes:
    • a rejected processingChain rejects the race, and with no rejection handler finalize() never runs (the CLI's global unhandledRejection handler logs and continues, so this manifests as a silent hang, not a crash);
    • a chunk whose headlessTerminal.write callback never fires (xterm swallows callbacks on disposed/paused terminals; Windows ConPTY keeps flushing data after exit while the PTY is destroyed immediately on exit) leaves the chain pending forever, and no timeout existed;
    • finalize() itself could throw (render(true), final serialization), skipping completeWithResult.

What changed

Commit 1 — pure correctness, no behavior change on the happy path:

  • every output-chunk link settles even if it throws (try/catch around the chunk executor, try/finally around the write callback), with a debug log instead of a poisoned chain;
  • the drain race treats a rejected chain as drained and calls finalize() on both race outcomes;
  • finalize() is idempotent and throw-proof end to end: a failure while rendering or serializing the final buffer degrades the captured output instead of hanging the execution;
  • the deferred (debounced) render is guarded: it runs in a 68ms timer outside any caller's try/catch, so a throw there was an uncaught exception that kills the whole CLI — a sibling failure mode of the same unguarded rendering pipeline, surfaced by the regression tests for this change.

Commit 2 — bounded drain (idle watchdog):

  • after exit, if no chunk settles for a full DRAIN_STALL_TIMEOUT_MS window (2s, polled at 250ms, unref'd and cleared on finalize), the execution finalizes with the output buffered so far and logs a warning. The watchdog is idle-based — every settled chunk resets the window — so a slow but advancing drain (large final bursts against a 300k-line scrollback) is never cut short; only a genuinely stuck chain trips it.

The exit result now always reaches the scheduler; in the worst pathological case the trailing render is degraded, never the exit code, and the stall is logged for diagnosis.

Tests

  • shellExecutionService.test.ts (existing harness, real headless terminal):
    • rendering throws while processing output → result still resolves with the exit code and the buffer-extracted output;
    • a chunk throws before reaching the terminal → result still resolves, warning logged.
  • shellExecutionService.drain.test.ts (new, controllable terminal mock):
    • a write callback that is never invoked → watchdog finalizes after the stall window, warning logged;
    • a slow drain that keeps making progress past the stall window in total time → never cut short, no warning;
    • nothing left to drain → resolves immediately, no watchdog side effects.
  • Red/green: the two resilience tests and the stuck-callback test hang (time out) against main and pass with this change.

Known boundaries (intentionally out of scope)

  • If node-pty never emits onExit, nothing in this file can recover — the watchdog lives inside the exit handler. That variant, if it exists in the wild, needs process-liveness tracking and separate evidence.
  • The child_process fallback path waits on close (stdio drain), which a grandchild holding the pipes can delay indefinitely on Windows — same symptom family, different mechanism, and not the default path (enableInteractiveShell defaults to true). Happy to follow up separately.

Fixes #25166

The exit result of a PTY execution is gated on the output processing
chain: finalize() - the only path that resolves the result - ran
exclusively through Promise.race(processingChain, abortFired) with no
rejection handling. A chunk that threw anywhere in the rendering
pipeline poisoned the chain, the race rejected, and finalize() never
ran. The tool call then stayed `executing` forever and the UI kept
reporting the shell as awaiting input after the process had already
exited (google-gemini#25166); the global unhandledRejection handler logs and
continues, so this manifested as a silent hang rather than a crash.

- Settle every output chunk even when it throws (try/catch around the
  chunk executor, try/finally around the terminal write callback),
  logging instead of poisoning the chain.
- Treat a rejected chain as drained and run finalize() on both race
  outcomes.
- Make finalize() idempotent and throw-proof: failures while rendering
  or serializing the final buffer degrade the captured output instead
  of skipping completeWithResult().
- Guard the deferred (debounced) render: it runs in a timer outside
  any caller try/catch, where a throw becomes an uncaught exception
  that kills the CLI.
Even with every chunk settling, the exit result still waits on the
output chain draining through the headless terminal, and a write
callback that is never invoked (xterm swallows callbacks on disposed
or paused terminals; Windows ConPTY keeps flushing data after exit
while the PTY is destroyed immediately) used to leave the execution
unresolved forever - the visible symptom of google-gemini#25166.

After exit, an idle watchdog now polls drain progress: every settled
chunk resets the window, so a slow but advancing drain (large final
bursts against a 300k-line scrollback) is never cut short, and only a
genuinely stuck chain trips it. When it fires, the execution finalizes
with the output buffered so far and logs a warning so field reports
can confirm which vector was hit. The interval is unref d and cleared
by finalize().
@MartinCajiao MartinCajiao requested a review from a team as a code owner June 11, 2026 01:50
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where shell executions could hang indefinitely in an 'awaiting input' state after the process had actually exited. The root cause was an unguarded output-processing chain that would block the finalization of the execution if any chunk failed or if the terminal write callback was never invoked. The changes introduce a robust, fault-tolerant processing pipeline and an idle-based watchdog that guarantees the execution lifecycle completes, ensuring the UI accurately reflects the process status.

Highlights

  • Resilience in Output Processing: Added comprehensive error handling to the shell output processing chain, ensuring that exceptions during rendering or serialization do not leave the execution in a permanently hung state.
  • Exit Drain Watchdog: Implemented an idle-based watchdog timer that monitors the output drain process after a shell exits, forcing finalization if the drain stalls for more than 2 seconds.
  • Idempotent Finalization: Refactored the finalization logic to be idempotent and robust, ensuring the CLI always completes the execution lifecycle even if final rendering or serialization fails.
  • Regression Testing: Added new unit tests in shellExecutionService.drain.test.ts and regression cases in shellExecutionService.test.ts to verify behavior under stalled drain conditions and rendering failures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions Bot added the size/l A large sized PR label Jun 11, 2026
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

📊 PR Size: size/L

  • Lines changed: 551
  • Additions: +492
  • Deletions: -59
  • Files changed: 3

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a post-exit output drain watchdog to resolve issues where shell executions could hang indefinitely due to stalled output processing or swallowed terminal write callbacks. It adds robust error handling around the rendering pipeline and chunk processing to ensure that failures do not block exit finalization. Additionally, new unit tests are added to verify exit finalization resilience and the watchdog behavior. The reviewer feedback suggests replacing Date.now() with performance.now() to provide a monotonic clock source, ensuring that system clock adjustments do not cause premature or delayed timeouts.

Comment on lines +1065 to +1068
let lastDrainActivityAt = Date.now();
const markDrainActivity = () => {
lastDrainActivityAt = Date.now();
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using Date.now() for measuring elapsed time or timeouts can be unreliable if the system clock is adjusted (e.g., via NTP synchronization, VM migration, or manual changes). If the clock jumps forward, it can cause premature timeouts; if it jumps backward, it can delay the timeout.

To ensure robustness, prefer using performance.now(), which provides a monotonic clock that is guaranteed to only increase and is immune to system clock adjustments.

Suggested change
let lastDrainActivityAt = Date.now();
const markDrainActivity = () => {
lastDrainActivityAt = Date.now();
};
let lastDrainActivityAt = performance.now();
const markDrainActivity = () => {
lastDrainActivityAt = performance.now();
};

Comment on lines +1418 to +1420
if (Date.now() - lastDrainActivityAt >= DRAIN_STALL_TIMEOUT_MS) {
res('drain-stalled');
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use performance.now() instead of Date.now() to ensure monotonic time measurement, preventing issues caused by system clock adjustments.

Suggested change
if (Date.now() - lastDrainActivityAt >= DRAIN_STALL_TIMEOUT_MS) {
res('drain-stalled');
}
if (performance.now() - lastDrainActivityAt >= DRAIN_STALL_TIMEOUT_MS) {
res('drain-stalled');
}

Addresses review feedback: Date.now() is wall-clock time, so an NTP
adjustment or VM migration could fire the stall watchdog prematurely
(clock jumps forward) or delay it (clock jumps backward).
performance.now() is monotonic and immune to clock adjustments. The
wall-clock Date.now() uses for history timestamps are untouched.
@MartinCajiao

Copy link
Copy Markdown
Author

Both suggestions addressed in 5a0083b: the drain watchdog now uses the monotonic clock (performance.now()) for both the activity marker and the stall check, so NTP/wall-clock adjustments can neither fire it prematurely nor delay it. The Date.now() uses for history timestamps are intentionally unchanged (those are genuine wall-clock values). Full suite still green: 71/71, typecheck clean.

@gemini-cli gemini-cli Bot added priority/p1 Important and should be addressed in the near term. area/core Issues related to User Interface, OS Support, Core Functionality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Issues related to User Interface, OS Support, Core Functionality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. priority/p1 Important and should be addressed in the near term. size/l A large sized PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shell command execution gets stuck with "Waiting input" after command completes

1 participant