Skip to content

fix(providers/codex): drop attemptController.abort() from retry finally (#1735)#1740

Closed
Wirasm wants to merge 1 commit into
devfrom
fix/issue-1735-codex-attempt-abort
Closed

fix(providers/codex): drop attemptController.abort() from retry finally (#1735)#1740
Wirasm wants to merge 1 commit into
devfrom
fix/issue-1735-codex-attempt-abort

Conversation

@Wirasm
Copy link
Copy Markdown
Collaborator

@Wirasm Wirasm commented May 21, 2026

Summary

  • Problem: PR fix(providers/codex): fresh AbortController per retry attempt (#1266) #1371's per-attempt AbortController cleanup (attemptController.abort() in the retry-loop finally) crashes every codex-provider workflow ~9–15 s into the first node. The abort fires Node's internal spawn({ signal }) listener, which calls abortChildProcessemitError on a child whose listeners codex-sdk has already torn down — surfacing as an uncaught process-level AbortError.
  • Why it matters: 100% reproducible regression on any DAG workflow with a provider: codex node (archon-comprehensive-pr-review, archon-smart-pr-review, etc.). No workaround. Process-level uncaught exception bypasses surrounding try/catch.
  • What changed: Removed attemptController.abort() (and its 3-line comment) from packages/providers/src/codex/provider.ts retry-loop finally. Added one regression test in provider.test.ts.
  • What did NOT change: The fresh-AbortController-per-attempt mechanism from PR fix(providers/codex): fresh AbortController per retry attempt (#1266) #1371 (the legitimate fix for bug(codex): retry loop reuses caller AbortSignal; crash on attempt N poisons attempt N+1 #1266). Cancel propagation. streamCodexEvents. buildTurnOptions. The Claude provider. @openai/codex-sdk version.

UX Journey

Before

User                  Archon (web/CLI/adapters)          codex provider             codex-sdk + spawn()
────                  ─────────────────────────          ──────────────             ───────────────────
runs workflow ─────▶  schedules codex node  ──────────▶  sendQuery() retry loop ─▶  spawn(codex, { signal })
                                                         awaits runStreamed         child runs OK
                                                         ◀───────────────────────── SDK finally:
                                                                                      removeAllListeners()
                                                                                      child.kill()
                                                         finally:
                                                           attemptController.abort()
                                                           └─▶ Node onAbortListener fires
                                                                └─▶ abortChildProcess()
                                                                     └─▶ emitError on listenerless child
                                                                          └─▶ UNCAUGHT AbortError
                                                         ◀── workflow crashes ──
sees crash ◀──────── workflow run failed

After

User                  Archon                              codex provider             codex-sdk + spawn()
────                  ──────                              ──────────────             ───────────────────
runs workflow ─────▶  schedules codex node  ──────────▶  sendQuery() retry loop ─▶  spawn(codex, { signal })
                                                         awaits runStreamed         child runs OK
                                                         ◀───────────────────────── SDK finally:
                                                                                      removeAllListeners()
                                                                                      child.kill()
                                                         finally:
                                                           removeEventListener(...)   [*unchanged*]
                                                           **(no abort)**             [-removed]
                                                         loop iteration ends; controller out of scope, GC'd
                                                         emits chunks ──────────────▶
sees output ◀────────  streams response

Architecture Diagram

Before

[caller]
  │ abortSignal
  ▼
[CodexProvider.sendQuery retry loop]
  ├── new AbortController() per attempt   [unchanged contract]
  ├── caller.addEventListener('abort', onCallerAbort, { once: true })
  ├── turnOptions.signal = attemptController.signal
  ├── thread.runStreamed(prompt, turnOptions)
  │      └─▶ [codex-sdk] spawn(codexBin, { signal })  ─── Node installs internal onAbortListener
  │             ...
  │             finally { rl.close(); child.removeAllListeners(); child.kill(); }
  └── finally {
        caller.removeEventListener('abort', onCallerAbort);   ── correct cleanup
        attemptController.abort();                            ── ✗ trips listenerless child via spawn-internal listener
      }

After

[caller]
  │ abortSignal
  ▼
[CodexProvider.sendQuery retry loop]
  ├── new AbortController() per attempt           [~unchanged]
  ├── caller.addEventListener('abort', onCallerAbort, { once: true })
  ├── turnOptions.signal = attemptController.signal
  ├── thread.runStreamed(prompt, turnOptions)
  │      └─▶ [codex-sdk] spawn(codexBin, { signal })
  │             finally { rl.close(); child.removeAllListeners(); child.kill(); }
  └── finally {
        caller.removeEventListener('abort', onCallerAbort);   ── retained
        [- attemptController.abort()]                         ── REMOVED
      }
  (attemptController falls out of lexical scope at iteration end → GC eligible)

Connection inventory:

From To Status Notes
CodexProvider.sendQuery retry-loop finally attemptController.abort() removed Eliminates the abort that trips the spawn-internal listener after the SDK has cleaned up the child.
Caller's abortSignal attemptController (via onCallerAbort) unchanged Still chained while the attempt is active; cancel propagation unaffected.
attemptController.signal thread.runStreamed / streamCodexEvents unchanged Per-attempt signal still passed; streamCodexEvents only reads aborted.
caller.abortSignal listener onCallerAbort unchanged removeEventListener still detaches it in finally.

Label Snapshot

  • Risk: risk: low
  • Size: size: XS
  • Scope: core (provider — @archon/providers)
  • Module: providers:codex

Change Metadata

  • Change type: bug
  • Primary scope: core (provider package)

Linked Issue

Validation Evidence (required)

bun run type-check    # all packages: Exited with code 0
bun run lint          # eslint --cache: clean, no warnings (max-warnings 0 enforced)
bun run format:check  # All matched files use Prettier code style!
bun test packages/providers/src/codex/provider.test.ts
# 59 pass / 0 fail / 130 expect() calls / 250ms
bun --filter @archon/providers test
# 67 + 1 + ... — 0 fail across the @archon/providers split
  • Evidence provided: full bun run validate run locally. The codex provider's 59 tests (including the new regression test) all pass; the only failures in the full suite are 3 pre-existing @archon/adapters telegram-markdown > blockquotes tests that also fail on unmodified origin/dev (253321b2) — unrelated to this PR.
  • Skipped commands: none.

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes — removes a destructive call inside a try-block finally; no public API change.
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

  • Verified scenarios:
    • Pre-fix repro behavior (per issue reporter melowllc): archon workflow run archon-comprehensive-pr-review --no-worktree "Review PR #<any>" crashes ~9–15 s in on the first codex node with the AbortError stack frame at provider.ts:848.
    • Post-fix: reporter verified locally that the same workflow completes end-to-end (review posted, auto-fix commits pushed).
    • All 59 codex provider unit tests pass on this branch (including the new regression test and the two pre-existing bug(codex): retry loop reuses caller AbortSignal; crash on attempt N poisons attempt N+1 #1266 regression tests that prove cancel propagation and fresh-signal-per-attempt still work).
  • Edge cases checked:
    • Caller abort mid-attempt → still propagates via onCallerAbort once-listener (existing test caller abort forwards into the active per-attempt signal confirms).
    • Retry after crash → still gets a fresh, non-aborted signal (existing test retry after crash receives a fresh (non-aborted) AbortSignal confirms).
    • Clean completion → per-attempt signal stays non-aborted (new test asserts).
  • What was not verified by me directly: end-to-end live codex execution against a real binary (reporter validated; CI mocks the SDK).

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: @archon/providers codex retry loop only.
  • Potential unintended effects: none expected. The AbortController is local-scoped and unobservable from outside the loop body; removing its terminal .abort() cannot affect anything we don't already control.
  • Guardrails/monitoring for early detection: the new regression test fails if the abort is reintroduced. Live workflows surface the crash immediately (uncaught exception), so reintroduction would be visible within minutes.

Rollback Plan (required)

  • Fast rollback command/path: git revert <merge-commit> on dev. Single-file, 4-line additive revert; no migrations.
  • Feature flags or config toggles: none — the abort was unconditional.
  • Observable failure symptoms: codex-provider DAG workflows crashing within ~15 s of starting the first codex node with AbortError ... at sendQuery (.../codex/provider.ts:NNN:27).

Risks and Mitigations

  • Risk: A future caller of streamCodexEvents or another downstream consumer could begin to rely on the per-attempt signal being aborted at iteration boundary.
    • Mitigation: there are no such consumers today (streamCodexEvents only reads aborted; runStreamed is the only external consumer and the SDK's own finally has already completed by this point). The new regression test pins the behavior. If a future consumer needs signal teardown, it should subscribe to its own lifecycle, not piggyback on a local controller.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed issue where internal operation signals were being unintentionally terminated after successful query completion. This resolves potential system instability and prevents negative cascading impacts on subsequent operations and requests.
  • Tests

    • Added regression test to ensure operation signals are properly maintained in their correct state following successful query completion, without unintended termination or interruption.

Review Change Stack

…ly (#1735)

PR #1371 added `attemptController.abort()` in the retry-loop `finally` as a
"downstream cleanup" gesture. By the time it fires, codex-sdk's own `finally`
has already run `child.removeAllListeners()` + `child.kill()`. The abort then
trips Node's internal `spawn({ signal })` listener, which calls
`abortChildProcess` -> `emitError` on the now-listenerless child, surfacing
as an uncaught process-level AbortError that crashes the workflow ~9-15s into
the first codex node.

The per-attempt AbortController is short-lived and goes out of scope at
iteration end; the caller's signal listener (`onCallerAbort`) is already
detached above via `removeEventListener`. No explicit abort needed.

Cancel propagation is unaffected: caller abort still flows through the
once-listener into the active per-attempt controller while the attempt is
running.

Adds a regression test asserting that after a clean `runStreamed` completion
the per-attempt signal is NOT aborted - the precondition that triggered the
crash. The two #1266 regression tests (fresh-signal-per-attempt and
caller-abort-forwards) continue to pass unchanged.

Fixes #1735
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 42d9cd8e-62d9-4698-8c58-3dce8a6fded9

📥 Commits

Reviewing files that changed from the base of the PR and between 253321b and 56230f5.

📒 Files selected for processing (2)
  • packages/providers/src/codex/provider.test.ts
  • packages/providers/src/codex/provider.ts
💤 Files with no reviewable changes (1)
  • packages/providers/src/codex/provider.ts

📝 Walkthrough

Walkthrough

CodexProvider.sendQuery no longer explicitly aborts the per-attempt AbortController in its finally block, and a regression test validates the per-attempt signal remains unaborted after clean completion, fixing a race condition that caused uncaught exceptions when the codex SDK had already cleaned up the child process.

Changes

Per-attempt AbortController cleanup race fix

Layer / File(s) Summary
Remove explicit per-attempt abort and add regression test
packages/providers/src/codex/provider.ts, packages/providers/src/codex/provider.test.ts
Removed the attemptController.abort() call that fired after the SDK's cleanup and caused uncaught exceptions. Added regression test that verifies the per-attempt signal is not aborted after successful turn.completed completion.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related issues

  • coleam00/Archon#1735: Directly addresses the attemptController.abort() cleanup-race crash that occurs when the codex SDK's finally already cleaned up the child process before Archon's abort fires.

Possibly related PRs

  • coleam00/Archon#1371: This PR refines the per-attempt AbortController handling introduced there; PR #1371 added the problematic attemptController.abort() call, and this PR removes it to fix the unintended crash.

Poem

🐰 A signal once tangled in cleanup's quick race,
Now exits gracefully, no crash in its trace,
The test stands as witness: unaborted it stays,
Fresh starts for each attempt, through retry's long maze! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: removing attemptController.abort() from the retry finally block in the codex provider.
Description check ✅ Passed The description comprehensively covers all template sections with detailed problem statement, UX journey, architecture diagrams, validation evidence, and rollback plans.
Linked Issues check ✅ Passed The code changes directly address issue #1735: removing the problematic attemptController.abort() call that crashes workflows via the codex-sdk's abort listener after child cleanup.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to fixing the regression: removing one line from provider.ts and adding one regression test to provider.test.ts, both directly tied to issue #1735.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-1735-codex-attempt-abort

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented May 21, 2026

🔍 Automated Code Review

Summary

PASS with no blockers (0 critical, 0 important). The removal correctly fixes the root cause, cancel propagation is intact, and the existing #1266 regression tests remain valid contracts. The new test is the strongest unit-level guard achievable given the SDK is mocked.

Findings

✅ Strengths

  • Root-cause fix is complete. The only subscribers to attemptController.signal are: onCallerAbort (which writes, not reads), streamCodexEvents (polls aborted, no persistent listener), and turnOptions.signal (consumed by the SDK's already-finished spawn). Nothing left needs the finally abort.
  • Cancel propagation untouched. onCallerAbort is still registered with { once: true } at loop entry, still wired to attemptController.abort(), and removeEventListener in finally is race-safe (no-op if the once listener already fired).
  • Minimal and surgical. 4-line removal + the stale comment block. No interface changes, no new error paths.
  • Existing bug(codex): retry loop reuses caller AbortSignal; crash on attempt N poisons attempt N+1 #1266 tests remain valid. retry after crash receives a fresh (non-aborted) AbortSignal and caller abort forwards into the active per-attempt signal both still exercise the contracts they were written to pin.

⚠️ Suggestions (non-blocking)

  • packages/providers/src/codex/provider.test.ts — the new test asserts capturedSignal?.aborted === false at a single point in time. That's the right property and unavoidable at unit-test granularity (the SDK is mocked, so the actual spawn({ signal }) path that produced the AbortError is not exercised). A marginally stronger guard would also verify the captured signal's object identity is unchanged, but that's cosmetic. The test's comment is honest about what it tests — leave as is.

🔒 Security

  • No security concerns. No new permissions, capabilities, network calls, secret handling, or filesystem access scope.

Checklist

  • Fix addresses root cause from investigation (line 848 abort → SDK-cleaned child → uncaught AbortError)
  • Code follows codebase patterns (matches PR fix(providers/codex): fresh AbortController per retry attempt (#1266) #1371's existing once-listener structure)
  • Tests cover the change (new test pins the post-completion non-aborted property)
  • No obvious bugs introduced; existing regression tests unaffected

Self-reviewed by Claude — ready for human review.

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented May 21, 2026

Closing in favor of #1739 by @kagura-agent — same root cause analysis, same fix, opened first. Functionally equivalent change to the same two files. Tracking via #1739.

@Wirasm Wirasm closed this May 21, 2026
@Wirasm Wirasm deleted the fix/issue-1735-codex-attempt-abort branch May 21, 2026 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(providers/codex): attemptController.abort() in retry finally crashes via codex-sdk's removeAllListeners (regression from #1371)

1 participant