fix(providers/codex): fresh AbortController per retry attempt (#1266)#1371
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughPer-attempt AbortController chaining was added to Codex sendQuery with upfront aborted checks; tests and resolver tests updated; PiProvider shim creation was hardened and tests added; docs updated to document compiled-binary autodetect and CLAUDE_BIN_PATH fallback behavior. ChangesCodex Provider Per-Attempt Cancellation
Pi shim, tests, and docs
Sequence Diagram(s)sequenceDiagram
actor Caller
participant Provider as CodexProvider
participant AttemptCtrl as AttemptController
participant Stream as streamCodexEvents
participant Codex as Codex CLI/SDK
Caller->>Provider: sendQuery(prompt, options.abortSignal)
loop retry attempts
Provider->>AttemptCtrl: new AbortController()
Provider->>AttemptCtrl: addEventListener once (caller.abort -> attemptCtrl.abort)
Provider->>Stream: streamCodexEvents(turnOptions with attemptCtrl.signal)
Stream->>Codex: spawn({ signal: attemptCtrl.signal })
alt caller aborts
Caller->>Provider: abort()
AttemptCtrl->>Codex: signal abort -> terminate child
Codex-->>Stream: error
Stream-->>Provider: catch -> throw "Query aborted"
else SDK crash/error
Codex-->>Stream: error
Stream-->>Provider: catch -> classify/retry
end
Provider->>AttemptCtrl: finally: remove listener + attemptCtrl.abort()
end
Provider-->>Caller: final result or error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/providers/src/codex/provider.ts (1)
579-585: ConsiderAbortSignal.any()for cleaner signal chaining.The current
addEventListener/removeEventListenerpattern works correctly, butAbortSignal.any([attemptController.signal, requestOptions.abortSignal])is more idiomatic for combining abort signals. Since each retry iteration creates a freshAbortController, you can chain it viaAbortSignal.any()and pass the result torunStreamed—no manual listener bookkeeping needed.This is an optional improvement; the current implementation is correct and handles the retry logic (issue
#1266) appropriately.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/providers/src/codex/provider.ts` around lines 579 - 585, Replace the manual addEventListener/removeEventListener chaining with AbortSignal.any by creating the per-attempt AbortController (attemptController) as before, then compose a combined signal using AbortSignal.any([attemptController.signal, requestOptions?.abortSignal]) and pass that combined signal into runStreamed (instead of wiring onCallerAbort). Remove onCallerAbort and the event listener cleanup; keep attemptController.abort() usage for per-attempt aborts so retries still work the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/providers/src/codex/provider.ts`:
- Around line 579-585: Replace the manual addEventListener/removeEventListener
chaining with AbortSignal.any by creating the per-attempt AbortController
(attemptController) as before, then compose a combined signal using
AbortSignal.any([attemptController.signal, requestOptions?.abortSignal]) and
pass that combined signal into runStreamed (instead of wiring onCallerAbort).
Remove onCallerAbort and the event listener cleanup; keep
attemptController.abort() usage for per-attempt aborts so retries still work the
same.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d9c198b9-6ee1-4eff-abd5-84063276c8c9
📒 Files selected for processing (2)
packages/providers/src/codex/provider.test.tspackages/providers/src/codex/provider.ts
|
Hi @truffle-dev — thanks for opening this PR. This repository uses a PR template at
Could you fill those out (even briefly)? The template helps reviewers understand scope, risk, and rollback — it speeds up review significantly. If a section genuinely doesn't apply, just write "N/A" in it rather than leaving it blank. |
|
Filled in. UX Journey + before/after Architecture Diagrams trace the per-attempt signal lifecycle (the fix's whole point), Label Snapshot, Change Metadata, Security (no/no/no/no), Compatibility, Human Verification, Side Effects, Rollback, Risks all populated. Existing Problem/Change/Regression-tests detail kept and slotted into Summary + Validation Evidence + Risks. |
|
Friendly check-in. 18 days since the PR template was filled in per the review request. The per-attempt AbortController fix still stands — happy to rebase against |
Review SummaryVerdict: minor-fixes-needed Good set of changes. The Codex retry-signal fix for #1266 is the headline — the Blocking issues
Suggested fixes
Minor / nice-to-have
Compliments
Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality, docs-impact. |
Blocking
Suggested fixes
Minor / nice-to-have
All 120 affected tests pass; |
|
Thanks for the abort-pre-check and abort-before-stream logging in `3308df155` @truffle-dev — that addresses the blocking feedback. The PR went CONFLICTING after #1651, #1658, and #1459 landed on dev with overlapping changes in `packages/providers/src/codex/provider.ts`. The local simulation shows it's a single-file conflict (the others auto-merge cleanly), so the rebase should be quick. Could you rebase onto current dev when you have a minute? |
Fixes coleam00#1266. The codex provider's retry loop reused the caller's AbortSignal across every attempt. When attempt N's Codex subprocess crashes, Node's `spawn({ signal })` linkage aborts the shared signal as part of SIGTERM'ing the dying child. On attempt N+1, `runStreamed` passes that same (already-aborted) signal into the next `spawn`, which SIGTERMs the freshly-spawned child before it reads any input. The "Reading prompt from stdin..." line in the resulting error is Codex CLI's normal startup banner, not a crash locus. The fix: signal assignment moves out of buildTurnOptions and into the retry loop. Each attempt gets a brand-new AbortController; the caller's signal (if provided) is chained in via a once-listener so cancellation still propagates. A try/finally removes the listener and aborts the per-attempt controller once the attempt terminates. Regression tests: - `retry after crash receives a fresh (non-aborted) AbortSignal` captures the signal passed to `runStreamed` at call-time (not by mock.calls reference, which would see the mutated .signal after reassignment) and asserts attempt 1 got a distinct, non-aborted signal. - `caller abort forwards into the active per-attempt signal` aborts the caller mid-attempt and asserts the per-attempt signal observes it. - Two existing tests updated: `buildTurnOptions` no longer attaches the caller's signal, so both "passes signal in TurnOptions" tests now assert presence of an AbortSignal without identity-equality against the caller's. Without the fix, these four tests fail and the rest pass (47/51). With the fix, all 51 pass. Out of scope: the binary HTTP timeout class-B path in the issue.
…eview-pass fixes `streamCodexEvents` now checks `abortSignal?.aborted` before entering the `for await` so a caller abort that lands between attempt setup and the first event surfaces immediately instead of waiting on the next event or end-of-stream. The existing between-events check is retained. Also from the same review pass: - Pi shim: wrap `mkdirSync`/`writeFileSync` in try/catch so EACCES/ ENOSPC surfaces as a classified "Pi shim setup failed at <dir>" instead of a raw node:fs error. - Codex retry path: `getLog().debug` before throwing the model-access error from a retry-attempt `startThread`; the outer query_error log only runs for retryable errors. - Docs: ai-assistants.md and configuration.md updated for Claude `~/.local/bin/claude` autodetect; ai-assistants.md gains a Codex autodetect bullet listing the five probed paths. - Tests: `homedir()` instead of `process.env.HOME ?? '/Users/test'` to match the implementation; Windows autodetect probe covered; config-over-autodetect precedence covered.
3308df1 to
d1645b7
Compare
|
Rebased onto current dev at d1645b7. The provider.ts conflict was with #1459's MCP additions to Tests green ( |
Summary
CodexProvider.sendQuery's retry loop reuses the caller'sAbortSignalacross every attempt. When attempt N's Codex subprocess crashes, Node'sspawn({ signal })linkage aborts the linked signal during SIGTERM, and attempt N+1 inherits an already-aborted signal — SIGTERMing the freshly-spawned child before it reads its stdin.buildTurnOptionsinto the retry loop inpackages/providers/src/codex/provider.ts. Each attempt builds a freshAbortController, chains the caller's signal in via a once-listener, and tears the listener back down intry/finally.buildTurnOptionsproduces the same shape minus the signal field. Class B of bug(codex): retry loop reuses caller AbortSignal; crash on attempt N poisons attempt N+1 #1266 (the internal binary HTTP timeout that fires before the provider layer runs) is left out of scope — different surface, different fix. No other provider's retry loop is touched.UX Journey
Before
After
Architecture Diagram
Before
After
Connection inventory:
buildTurnOptionsturnOptions.signalnew AbortController()abortSignaltry/finallysymmetrystreamCodexEventsLabel Snapshot
risk: lowsize: Sadaptersproviders:codexChange Metadata
bugadaptersLinked Issue
Validation Evidence (required)
Security Impact (required)
Compatibility / Migration
sendQueryAPI and abort semantics are unchanged. Caller-suppliedabortSignalstill aborts the active attempt; the only behavioural change is that the retry loop's internal signal lifecycle is isolated from the caller's.Human Verification (required)
runStreamedcall-time viamockImplementation(sincemock.calls[i].signalwould read the post-mutation reference).turnOptions.signalis still always anAbortSignal(the per-attempt one), so downstreamrunStreamedcall-shape is preserved.try/finallytear-down → per-attempt controller is aborted on every exit (success or throw); listener removed.Side Effects / Blast Radius (required)
turnOptions.signal === requestOptions.abortSignal(identity equality) downstream, that assumption no longer holds. The existing test suite did not assert it, and no in-tree caller does.Rollback Plan (required)
git revert <merge-commit-sha>— two files:provider.ts(signal-handling block) andprovider.test.ts(regression coverage).Risks and Mitigations
try/finallyaborting the per-attempt controller on success could surprise downstream code that holds a reference toturnOptions.signalpast the attempt's resolution.runStreamed's consumption of the signal is bounded by the attempt;streamCodexEventsalready disposes of its signal observation when the stream ends. No in-tree consumer holds the signal past the attempt boundary.turnOptions.signalis anAbortSignalbut not identity-equal to the caller's. The new contract is explicit and grep-able.removeEventListener.{ once: true }and removed in finally; the only window for a stale fire is between the abort dispatch and the finally block, both on the same tick. Cleanup is idempotent (controller.abort()on an already-aborted controller is a no-op).Summary by CodeRabbit
Bug Fixes
Documentation
Tests