fix(providers/codex): drop attemptController.abort() from retry finally (#1735)#1740
fix(providers/codex): drop attemptController.abort() from retry finally (#1735)#1740Wirasm wants to merge 1 commit into
Conversation
…ly (#1735) PR #1371 added `attemptController.abort()` in the retry-loop `finally` as a "downstream cleanup" gesture. By the time it fires, codex-sdk's own `finally` has already run `child.removeAllListeners()` + `child.kill()`. The abort then trips Node's internal `spawn({ signal })` listener, which calls `abortChildProcess` -> `emitError` on the now-listenerless child, surfacing as an uncaught process-level AbortError that crashes the workflow ~9-15s into the first codex node. The per-attempt AbortController is short-lived and goes out of scope at iteration end; the caller's signal listener (`onCallerAbort`) is already detached above via `removeEventListener`. No explicit abort needed. Cancel propagation is unaffected: caller abort still flows through the once-listener into the active per-attempt controller while the attempt is running. Adds a regression test asserting that after a clean `runStreamed` completion the per-attempt signal is NOT aborted - the precondition that triggered the crash. The two #1266 regression tests (fresh-signal-per-attempt and caller-abort-forwards) continue to pass unchanged. Fixes #1735
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthrough
ChangesPer-attempt AbortController cleanup race fix
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Automated Code ReviewSummaryPASS with no blockers (0 critical, 0 important). The removal correctly fixes the root cause, cancel propagation is intact, and the existing #1266 regression tests remain valid contracts. The new test is the strongest unit-level guard achievable given the SDK is mocked. Findings✅ Strengths
|
|
Closing in favor of #1739 by @kagura-agent — same root cause analysis, same fix, opened first. Functionally equivalent change to the same two files. Tracking via #1739. |
Summary
AbortControllercleanup (attemptController.abort()in the retry-loopfinally) crashes every codex-provider workflow ~9–15 s into the first node. The abort fires Node's internalspawn({ signal })listener, which callsabortChildProcess→emitErroron a child whose listeners codex-sdk has already torn down — surfacing as an uncaught process-levelAbortError.provider: codexnode (archon-comprehensive-pr-review,archon-smart-pr-review, etc.). No workaround. Process-level uncaught exception bypasses surrounding try/catch.attemptController.abort()(and its 3-line comment) frompackages/providers/src/codex/provider.tsretry-loopfinally. Added one regression test inprovider.test.ts.AbortController-per-attempt mechanism from PR fix(providers/codex): fresh AbortController per retry attempt (#1266) #1371 (the legitimate fix for bug(codex): retry loop reuses caller AbortSignal; crash on attempt N poisons attempt N+1 #1266). Cancel propagation.streamCodexEvents.buildTurnOptions. The Claude provider.@openai/codex-sdkversion.UX Journey
Before
After
Architecture Diagram
Before
After
Connection inventory:
CodexProvider.sendQueryretry-loopfinallyattemptController.abort()abortSignalattemptController(viaonCallerAbort)attemptController.signalthread.runStreamed/streamCodexEventsstreamCodexEventsonly readsaborted.caller.abortSignallisteneronCallerAbortremoveEventListenerstill detaches it infinally.Label Snapshot
risk: lowsize: XScore(provider —@archon/providers)providers:codexChange Metadata
bugcore(provider package)Linked Issue
attemptController.abort()in retry finally crashes via codex-sdk's removeAllListeners (regression from #1371) #1735Validation Evidence (required)
bun run validaterun locally. The codex provider's 59 tests (including the new regression test) all pass; the only failures in the full suite are 3 pre-existing@archon/adapterstelegram-markdown > blockquotestests that also fail on unmodifiedorigin/dev(253321b2) — unrelated to this PR.Security Impact (required)
NoNoNoNoCompatibility / Migration
Yes— removes a destructive call inside a try-blockfinally; no public API change.NoNoHuman Verification (required)
melowllc):archon workflow run archon-comprehensive-pr-review --no-worktree "Review PR #<any>"crashes ~9–15 s in on the first codex node with theAbortErrorstack frame atprovider.ts:848.onCallerAbortonce-listener (existing testcaller abort forwards into the active per-attempt signalconfirms).retry after crash receives a fresh (non-aborted) AbortSignalconfirms).codexexecution against a real binary (reporter validated; CI mocks the SDK).Side Effects / Blast Radius (required)
@archon/providerscodex retry loop only.AbortControlleris local-scoped and unobservable from outside the loop body; removing its terminal.abort()cannot affect anything we don't already control.Rollback Plan (required)
git revert <merge-commit>ondev. Single-file, 4-line additive revert; no migrations.AbortError ... at sendQuery (.../codex/provider.ts:NNN:27).Risks and Mitigations
streamCodexEventsor another downstream consumer could begin to rely on the per-attempt signal being aborted at iteration boundary.streamCodexEventsonly readsaborted;runStreamedis the only external consumer and the SDK's ownfinallyhas already completed by this point). The new regression test pins the behavior. If a future consumer needs signal teardown, it should subscribe to its own lifecycle, not piggyback on a local controller.Summary by CodeRabbit
Bug Fixes
Tests