Skip to content

Commit 5283ea9

Browse files
authored
fix(providers/codex): fresh AbortController per retry attempt (#1266) (#1371)
* fix(providers/codex): create a fresh AbortController per retry attempt Fixes #1266. The codex provider's retry loop reused the caller's AbortSignal across every attempt. When attempt N's Codex subprocess crashes, Node's `spawn({ signal })` linkage aborts the shared signal as part of SIGTERM'ing the dying child. On attempt N+1, `runStreamed` passes that same (already-aborted) signal into the next `spawn`, which SIGTERMs the freshly-spawned child before it reads any input. The "Reading prompt from stdin..." line in the resulting error is Codex CLI's normal startup banner, not a crash locus. The fix: signal assignment moves out of buildTurnOptions and into the retry loop. Each attempt gets a brand-new AbortController; the caller's signal (if provided) is chained in via a once-listener so cancellation still propagates. A try/finally removes the listener and aborts the per-attempt controller once the attempt terminates. Regression tests: - `retry after crash receives a fresh (non-aborted) AbortSignal` captures the signal passed to `runStreamed` at call-time (not by mock.calls reference, which would see the mutated .signal after reassignment) and asserts attempt 1 got a distinct, non-aborted signal. - `caller abort forwards into the active per-attempt signal` aborts the caller mid-attempt and asserts the per-attempt signal observes it. - Two existing tests updated: `buildTurnOptions` no longer attaches the caller's signal, so both "passes signal in TurnOptions" tests now assert presence of an AbortSignal without identity-equality against the caller's. Without the fix, these four tests fail and the rest pass (47/51). With the fix, all 51 pass. Out of scope: the binary HTTP timeout class-B path in the issue. * fix(providers/codex): synchronous abort check at stream entry, plus review-pass fixes `streamCodexEvents` now checks `abortSignal?.aborted` before entering the `for await` so a caller abort that lands between attempt setup and the first event surfaces immediately instead of waiting on the next event or end-of-stream. The existing between-events check is retained. Also from the same review pass: - Pi shim: wrap `mkdirSync`/`writeFileSync` in try/catch so EACCES/ ENOSPC surfaces as a classified "Pi shim setup failed at <dir>" instead of a raw node:fs error. - Codex retry path: `getLog().debug` before throwing the model-access error from a retry-attempt `startThread`; the outer query_error log only runs for retryable errors. - Docs: ai-assistants.md and configuration.md updated for Claude `~/.local/bin/claude` autodetect; ai-assistants.md gains a Codex autodetect bullet listing the five probed paths. - Tests: `homedir()` instead of `process.env.HOME ?? '/Users/test'` to match the implementation; Windows autodetect probe covered; config-over-autodetect precedence covered.
1 parent ef5a381 commit 5283ea9

7 files changed

Lines changed: 257 additions & 67 deletions

File tree

packages/docs-web/src/content/docs/getting-started/ai-assistants.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ See [Anthropic's setup guide](https://code.claude.com/docs/en/setup) for the ful
4343

4444
### Binary path configuration (compiled binaries only)
4545

46-
Compiled Archon binaries cannot auto-discover Claude Code at runtime. Supply the path via either:
46+
In compiled Archon binaries, if `claude` is not on the default install path Archon autodetects, supply the path via either:
4747

4848
1. **Environment variable** (highest precedence):
4949
```ini
@@ -55,8 +55,9 @@ Compiled Archon binaries cannot auto-discover Claude Code at runtime. Supply the
5555
claude:
5656
claudeBinaryPath: /absolute/path/to/claude
5757
```
58+
3. **Autodetect** (zero-config fallback): Archon probes `~/.local/bin/claude` (POSIX) and `%USERPROFILE%\.local\bin\claude.exe` (Windows), matching the native curl/PowerShell installer layouts.
5859

59-
If neither is set in a compiled binary, Archon throws with install instructions on first Claude query.
60+
If none of the three resolves in a compiled binary, Archon throws with install instructions on first Claude query.
6061

6162
The Claude Agent SDK accepts either the native compiled binary or a JS `cli.js`.
6263

@@ -173,6 +174,7 @@ In compiled Archon binaries, if `codex` is not on the default PATH Archon expect
173174
codexBinaryPath: /absolute/path/to/codex
174175
```
175176
3. **Vendor directory** (zero-config fallback): drop the native binary at `~/.archon/vendor/codex/codex` (or `codex.exe` on Windows).
177+
4. **Autodetect** (zero-config fallback): if the vendor directory is empty, Archon probes the common npm-global install layouts: `~/.npm-global/bin/codex` (POSIX), `/opt/homebrew/bin/codex` (macOS Apple Silicon), `/usr/local/bin/codex` (macOS Intel and Linux), `%APPDATA%\npm\codex.cmd` and `%USERPROFILE%\.npm-global\codex.cmd` (Windows). For other npm prefixes or custom layouts, set `CODEX_BIN_PATH` or the config path explicitly.
176178

177179
Dev mode (`bun run`) does not require any of the above — the SDK resolves `codex` via `node_modules`.
178180

packages/docs-web/src/content/docs/getting-started/configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Set these in your shell or `.env` file:
1414

1515
| Variable | Required | Description |
1616
|----------|----------|-------------|
17-
| `CLAUDE_BIN_PATH` | Yes (binary builds) | Absolute path to the Claude Code SDK's `cli.js`. Required in compiled Archon binaries unless `assistants.claude.claudeBinaryPath` is set. Dev mode (`bun run`) auto-resolves via `node_modules`. |
17+
| `CLAUDE_BIN_PATH` | No (binary builds autodetect `~/.local/bin/claude`) | Absolute path to the Claude Code binary or SDK `cli.js`. Overrides autodetection in compiled Archon binaries. Falls back to `assistants.claude.claudeBinaryPath`, then to the native-installer path. Dev mode (`bun run`) auto-resolves via `node_modules`. |
1818
| `CLAUDE_USE_GLOBAL_AUTH` | No | Set to `true` to use credentials from `claude /login` (default when no other Claude token is set) |
1919
| `CLAUDE_CODE_OAUTH_TOKEN` | No | OAuth token from `claude setup-token` (alternative to global auth) |
2020
| `CLAUDE_API_KEY` | No | Anthropic API key for pay-per-use (alternative to global auth) |

packages/providers/src/codex/binary-resolver.test.ts

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
* Must run in its own bun test invocation because it mocks @archon/paths
55
* with BUNDLED_IS_BINARY=true, which conflicts with other test files.
66
*/
7+
import { homedir } from 'node:os';
8+
79
import { describe, test, expect, mock, beforeEach, afterAll, spyOn } from 'bun:test';
810
import { createMockLogger } from '../test/mocks/logger';
911

@@ -89,8 +91,7 @@ describe('resolveCodexBinaryPath (binary mode)', () => {
8991

9092
test('autodetects npm global install at ~/.npm-global/bin/codex (POSIX)', async () => {
9193
if (process.platform === 'win32') return; // POSIX-only probe
92-
const home = process.env.HOME ?? '/Users/test';
93-
const expected = `${home}/.npm-global/bin/codex`;
94+
const expected = `${homedir()}/.npm-global/bin/codex`;
9495
fileExistsSpy = spyOn(resolver, 'fileExists').mockImplementation(
9596
(path: string) => path === expected
9697
);
@@ -103,6 +104,33 @@ describe('resolveCodexBinaryPath (binary mode)', () => {
103104
);
104105
});
105106

107+
test('autodetects npm global install at %APPDATA%\\npm\\codex.cmd (Windows)', async () => {
108+
if (process.platform !== 'win32') return; // Windows-only probe
109+
const appData = process.env.APPDATA ?? 'C:\\Users\\test\\AppData\\Roaming';
110+
const expected = `${appData}\\npm\\codex.cmd`;
111+
fileExistsSpy = spyOn(resolver, 'fileExists').mockImplementation(
112+
(path: string) => path === expected
113+
);
114+
115+
const result = await resolver.resolveCodexBinaryPath();
116+
expect(result).toBe(expected);
117+
expect(mockLogger.info).toHaveBeenCalledWith(
118+
{ binaryPath: expected, source: 'autodetect' },
119+
'codex.binary_resolved'
120+
);
121+
});
122+
123+
test('config codexBinaryPath takes precedence over autodetect', async () => {
124+
// Both the explicit config path AND a typical autodetect path are
125+
// present on disk; config must win. Mirrors the env-over-config and
126+
// env-over-autodetect tests above so the four-tier precedence
127+
// (env → config → vendor → autodetect) is fully covered.
128+
fileExistsSpy = spyOn(resolver, 'fileExists').mockReturnValue(true);
129+
130+
const result = await resolver.resolveCodexBinaryPath('/explicit/config/codex');
131+
expect(result).toBe('/explicit/config/codex');
132+
});
133+
106134
test('autodetects homebrew install on Apple Silicon', async () => {
107135
if (process.platform !== 'darwin' || process.arch !== 'arm64') {
108136
// `/opt/homebrew/bin/codex` is only probed on darwin-arm64; on other

packages/providers/src/codex/provider.test.ts

Lines changed: 106 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -686,7 +686,7 @@ describe('CodexProvider', () => {
686686
);
687687
});
688688

689-
test('passes abortSignal as signal in TurnOptions', async () => {
689+
test('passes a per-attempt AbortSignal in TurnOptions when caller provides one', async () => {
690690
mockRunStreamed.mockResolvedValue({
691691
events: (async function* () {
692692
yield { type: 'turn.completed', usage: defaultUsage };
@@ -702,13 +702,16 @@ describe('CodexProvider', () => {
702702
chunks.push(chunk);
703703
}
704704

705-
expect(mockRunStreamed).toHaveBeenCalledWith(
706-
'test prompt',
707-
expect.objectContaining({ signal: controller.signal })
708-
);
705+
// Signal passed to runStreamed is the per-attempt signal, not the
706+
// caller's signal directly. Aborting the caller still propagates via
707+
// the forwarding once-listener (covered by separate tests below).
708+
const call = mockRunStreamed.mock.calls[0];
709+
expect(call[0]).toBe('test prompt');
710+
expect(call[1].signal).toBeInstanceOf(AbortSignal);
711+
expect(call[1].signal).not.toBe(controller.signal);
709712
});
710713

711-
test('passes empty TurnOptions when no outputFormat or abortSignal', async () => {
714+
test('passes a per-attempt AbortSignal in TurnOptions even when caller provides none', async () => {
712715
mockRunStreamed.mockResolvedValue({
713716
events: (async function* () {
714717
yield { type: 'turn.completed', usage: defaultUsage };
@@ -720,7 +723,10 @@ describe('CodexProvider', () => {
720723
chunks.push(chunk);
721724
}
722725

723-
expect(mockRunStreamed).toHaveBeenCalledWith('test prompt', {});
726+
expect(mockRunStreamed).toHaveBeenCalledWith(
727+
'test prompt',
728+
expect.objectContaining({ signal: expect.any(AbortSignal) })
729+
);
724730
});
725731

726732
test('creates a per-call Codex instance when env is provided', async () => {
@@ -1605,4 +1611,97 @@ describe('sendQuery decomposition behaviors', () => {
16051611
expect(systemChunks.length).toBeGreaterThanOrEqual(1);
16061612
expect(systemChunks.some(c => c.type === 'system' && c.content.includes('Task 1'))).toBe(true);
16071613
}, 5_000);
1614+
1615+
// Regression for issue #1266 (crash class A).
1616+
// Before the fix, buildTurnOptions captured the caller's abortSignal once
1617+
// before the retry loop, and the same signal object was passed to every
1618+
// runStreamed attempt. Node.js aborts the spawn-linked signal when a
1619+
// subprocess crashes, so attempt N's crash left `turnOptions.signal`
1620+
// already aborted, and attempt N+1 was SIGTERM'd before it could read the
1621+
// prompt. The fix creates a fresh AbortController per attempt and chains
1622+
// the caller's signal through a once-listener.
1623+
test('retry after crash receives a fresh (non-aborted) AbortSignal', async () => {
1624+
// Capture signals at call-time. Inspecting mockRunStreamed.mock.calls
1625+
// after the fact reads from a shared turnOptions reference whose .signal
1626+
// has since been rewritten; that's fine for the implementation (each
1627+
// spawn() captures the signal at its own call) but misleading here.
1628+
const signalsAtCallTime: Array<{ signal: AbortSignal; aborted: boolean }> = [];
1629+
let callCount = 0;
1630+
mockRunStreamed.mockImplementation((_prompt: unknown, opts: { signal?: AbortSignal }) => {
1631+
const s = opts.signal!;
1632+
signalsAtCallTime.push({ signal: s, aborted: s.aborted });
1633+
callCount++;
1634+
if (callCount === 1) {
1635+
return Promise.reject(new Error('codex exec crashed'));
1636+
}
1637+
return Promise.resolve({
1638+
events: (async function* () {
1639+
yield {
1640+
type: 'item.completed',
1641+
item: { type: 'agent_message', text: 'recovered', id: 'r' },
1642+
};
1643+
yield { type: 'turn.completed', usage: defaultUsage };
1644+
})(),
1645+
});
1646+
});
1647+
1648+
const callerController = new AbortController();
1649+
1650+
const chunks = [];
1651+
for await (const chunk of client.sendQuery('test', '/workspace', undefined, {
1652+
abortSignal: callerController.signal,
1653+
})) {
1654+
chunks.push(chunk);
1655+
}
1656+
1657+
expect(mockRunStreamed).toHaveBeenCalledTimes(2);
1658+
expect(signalsAtCallTime).toHaveLength(2);
1659+
// Distinct signal objects per attempt.
1660+
expect(signalsAtCallTime[1].signal).not.toBe(signalsAtCallTime[0].signal);
1661+
// Attempt 1's signal was NOT aborted at the moment of spawn, even
1662+
// though attempt 0 crashed. This is the exact property that was
1663+
// broken in the old implementation.
1664+
expect(signalsAtCallTime[1].aborted).toBe(false);
1665+
// Caller signal was never aborted.
1666+
expect(callerController.signal.aborted).toBe(false);
1667+
}, 5_000);
1668+
1669+
test('caller abort forwards into the active per-attempt signal', async () => {
1670+
const callerController = new AbortController();
1671+
1672+
let capturedSignal: AbortSignal | undefined;
1673+
mockRunStreamed.mockImplementation((_prompt, opts: { signal?: AbortSignal }) => {
1674+
capturedSignal = opts.signal;
1675+
return Promise.resolve({
1676+
events: (async function* () {
1677+
yield {
1678+
type: 'item.completed',
1679+
item: { type: 'agent_message', text: 'partial', id: '1' },
1680+
};
1681+
// Caller aborts mid-stream; this must surface on the per-attempt signal.
1682+
callerController.abort();
1683+
yield {
1684+
type: 'item.completed',
1685+
item: { type: 'agent_message', text: 'should not appear', id: '2' },
1686+
};
1687+
yield { type: 'turn.completed', usage: defaultUsage };
1688+
})(),
1689+
});
1690+
});
1691+
1692+
const consumeGenerator = async (): Promise<void> => {
1693+
for await (const _ of client.sendQuery('test', '/workspace', undefined, {
1694+
abortSignal: callerController.signal,
1695+
})) {
1696+
// consume
1697+
}
1698+
};
1699+
1700+
await expect(consumeGenerator()).rejects.toThrow('Query aborted');
1701+
// The signal observed by runStreamed is the per-attempt one, and it
1702+
// reflects the caller's abort via the forwarding listener.
1703+
expect(capturedSignal).toBeDefined();
1704+
expect(capturedSignal).not.toBe(callerController.signal);
1705+
expect(capturedSignal?.aborted).toBe(true);
1706+
}, 5_000);
16081707
});

packages/providers/src/codex/provider.ts

Lines changed: 81 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -287,9 +287,10 @@ function buildTurnOptions(requestOptions?: SendQueryOptions): {
287287
if (requestOptions?.nodeConfig?.output_format && !requestOptions?.outputFormat) {
288288
turnOptions.outputSchema = requestOptions.nodeConfig.output_format;
289289
}
290-
if (requestOptions?.abortSignal) {
291-
turnOptions.signal = requestOptions.abortSignal;
292-
}
290+
// Signal assignment is intentionally per-attempt (in sendQuery's retry
291+
// loop), not here. Reusing a single AbortSignal across retries can poison
292+
// later attempts once any earlier attempt's subprocess is SIGTERM'd.
293+
// See issue #1266.
293294
return { turnOptions, hasOutputFormat };
294295
}
295296

@@ -314,6 +315,11 @@ async function* streamCodexEvents(
314315
const state: CodexStreamState = {};
315316
let accumulatedText = '';
316317

318+
if (abortSignal?.aborted) {
319+
getLog().info('query_aborted_before_stream');
320+
throw new Error('Query aborted');
321+
}
322+
317323
// If the iterator closes without a terminal event (e.g. the model was
318324
// rejected before the turn even started), we synthesize a fail-stop result
319325
// after the loop so the dag-executor's `msg.isError` branch catches it
@@ -760,57 +766,86 @@ export class CodexProvider implements IAgentProvider {
760766
throw new Error('Query aborted');
761767
}
762768

763-
if (attempt > 0) {
764-
getLog().debug({ cwd, attempt }, 'starting_new_thread');
765-
try {
766-
thread = codex.startThread(threadOptions);
767-
} catch (startError) {
768-
const err = startError as Error;
769-
if (isModelAccessError(err.message)) {
770-
throw new Error(buildModelAccessMessage(requestOptions?.model));
771-
}
772-
throw new Error(`Codex query failed: ${err.message}`);
773-
}
769+
// Fresh AbortController per attempt. Caller's abortSignal, if any, is
770+
// chained in via a once-listener so cancellation still propagates.
771+
// Without this, a signal aborted during attempt N (e.g. when the
772+
// Codex subprocess crashes and Node.js reacts to the `spawn({ signal })`
773+
// linkage) would wire an already-aborted signal into attempt N+1's
774+
// `spawn`, SIGTERMing the freshly spawned child before it reads any
775+
// input. The "Reading prompt from stdin..." in the resulting error is
776+
// Codex CLI's startup banner, not an indicator of crash location.
777+
// See issue #1266.
778+
const attemptController = new AbortController();
779+
const onCallerAbort = (): void => {
780+
attemptController.abort();
781+
};
782+
if (requestOptions?.abortSignal) {
783+
requestOptions.abortSignal.addEventListener('abort', onCallerAbort, { once: true });
774784
}
785+
turnOptions.signal = attemptController.signal;
775786

776787
try {
777-
// 4. Run streamed turn
778-
const result = await thread.runStreamed(prompt, turnOptions);
779-
780-
// 5. Stream normalized events (fresh state per attempt to avoid dedup leaks)
781-
yield* streamCodexEvents(
782-
result.events as AsyncIterable<Record<string, unknown>>,
783-
hasOutputFormat,
784-
thread.id,
785-
requestOptions?.abortSignal,
786-
Boolean(requestOptions?.nodeConfig?.mcp)
787-
);
788-
return;
789-
} catch (error) {
790-
const err = error as Error;
791-
792-
if (requestOptions?.abortSignal?.aborted) {
793-
throw new Error('Query aborted');
788+
if (attempt > 0) {
789+
getLog().debug({ cwd, attempt }, 'starting_new_thread');
790+
try {
791+
thread = codex.startThread(threadOptions);
792+
} catch (startError) {
793+
const err = startError as Error;
794+
if (isModelAccessError(err.message)) {
795+
getLog().debug({ attempt, errorClass: 'model_access' }, 'query_error_pre_retry');
796+
throw new Error(buildModelAccessMessage(requestOptions?.model));
797+
}
798+
throw new Error(`Codex query failed: ${err.message}`);
799+
}
794800
}
795801

796-
const { enrichedError, errorClass, shouldRetry } = classifyAndEnrichCodexError(
797-
err,
798-
requestOptions?.model
799-
);
802+
try {
803+
// 4. Run streamed turn
804+
const result = await thread.runStreamed(prompt, turnOptions);
805+
806+
// 5. Stream normalized events (fresh state per attempt to avoid dedup leaks)
807+
yield* streamCodexEvents(
808+
result.events as AsyncIterable<Record<string, unknown>>,
809+
hasOutputFormat,
810+
thread.id,
811+
attemptController.signal,
812+
Boolean(requestOptions?.nodeConfig?.mcp)
813+
);
814+
return;
815+
} catch (error) {
816+
const err = error as Error;
800817

801-
getLog().error(
802-
{ err, errorClass, attempt, maxRetries: MAX_SUBPROCESS_RETRIES },
803-
'query_error'
804-
);
818+
if (requestOptions?.abortSignal?.aborted) {
819+
throw new Error('Query aborted');
820+
}
805821

806-
if (!shouldRetry || attempt >= MAX_SUBPROCESS_RETRIES) {
807-
throw enrichedError;
808-
}
822+
const { enrichedError, errorClass, shouldRetry } = classifyAndEnrichCodexError(
823+
err,
824+
requestOptions?.model
825+
);
826+
827+
getLog().error(
828+
{ err, errorClass, attempt, maxRetries: MAX_SUBPROCESS_RETRIES },
829+
'query_error'
830+
);
809831

810-
const delayMs = this.retryBaseDelayMs * Math.pow(2, attempt);
811-
getLog().info({ attempt, delayMs, errorClass }, 'retrying_query');
812-
await new Promise(resolve => setTimeout(resolve, delayMs));
813-
lastError = enrichedError;
832+
if (!shouldRetry || attempt >= MAX_SUBPROCESS_RETRIES) {
833+
throw enrichedError;
834+
}
835+
836+
const delayMs = this.retryBaseDelayMs * Math.pow(2, attempt);
837+
getLog().info({ attempt, delayMs, errorClass }, 'retrying_query');
838+
await new Promise(resolve => setTimeout(resolve, delayMs));
839+
lastError = enrichedError;
840+
}
841+
} finally {
842+
if (requestOptions?.abortSignal) {
843+
requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
844+
}
845+
// Signal to any downstream consumers that this attempt is done.
846+
// Next iteration creates a fresh controller; caller's signal state
847+
// is unchanged.
848+
attemptController.abort();
814849
}
815850
}
816851

0 commit comments

Comments
 (0)