-
Notifications
You must be signed in to change notification settings - Fork 0
Wire structured error propagation through UnitResult #5
Description
Problem
UnitResult has status: "completed" | "cancelled" | "error" but only "completed" and "cancelled" are ever set. The "error" variant is dead code — nothing ever produces it.
All error context (provider errors, timeouts, idle watchdog kills) is discarded at the resolve boundary:
resolveAgentEnd(event)always produces{ status: "completed", event }resolveAgentEndCancelled()always produces{ status: "cancelled" }with zero context
The journal's error classification (added in the OTel improvements, #4) relies on regex-matching message content — fragile and unreliable across providers.
Error info that exists but gets thrown away
| Call Site | File | Available Context | What Gets Passed |
|---|---|---|---|
| agent_end handler | bootstrap/agent-end-recovery.ts:131 |
lastMsg.errorMessage, lastMsg.stopReason, classifyProviderError() result |
Just the raw event |
| Hard timeout | auto-timers.ts:229 |
unitType, unitId, timeout duration |
Nothing (resolveAgentEndCancelled()) |
| Idle watchdog | auto-timers.ts:195 |
idle threshold, lastProgressAt, tool state |
Nothing (resolveAgentEndCancelled()) |
| Session creation failure | auto/run-unit.ts:61 |
The actual Error object |
{ status: "cancelled" } |
| Session timeout | auto/run-unit.ts:67 |
Timeout constant | { status: "cancelled" } |
Key architectural constraint
When stopReason === "error", handleAgentEnd in agent-end-recovery.ts does recovery (network retries → model fallbacks → provider error pause) and never calls resolveAgentEnd. Error-path pauses go through pauseAuto → resolveAgentEndCancelled(). So the cancellation path is where most error context needs to flow.
Proposed Solution
1. Add errorContext to UnitResult (auto/types.ts)
export interface UnitResult {
status: "completed" | "cancelled" | "error";
event?: AgentEndEvent;
errorContext?: {
message: string;
category: "provider" | "timeout" | "idle" | "network" | "aborted" | "session-failed" | "unknown";
stopReason?: string;
isTransient?: boolean;
retryAfterMs?: number;
};
}Single optional object — you have error context or you don't.
2. Extend resolve functions (auto/resolve.ts)
resolveAgentEndCancelled(errorContext?)— accept optional context so callers can say why they cancelledresolveAgentEnd(event)— inspectlastMsg.stopReason/lastMsg.errorMessageand producestatus: "error"witherrorContextwhen appropriate (finally activating the dead"error"variant)
3. Wire context at call sites
| File | Call Site | Change |
|---|---|---|
auto-timers.ts:195 |
idle watchdog | resolveAgentEndCancelled({ message: "Idle watchdog", category: "idle" }) |
auto-timers.ts:229 |
hard timeout | resolveAgentEndCancelled({ message: "Hard timeout", category: "timeout" }) |
run-unit.ts:61 |
session error | return { status: "cancelled", errorContext: { message: msg, category: "session-failed" } } |
run-unit.ts:67 |
session timeout | return { status: "cancelled", errorContext: { message: "Session creation timeout", category: "timeout" } } |
auto.ts:796 |
pauseAuto |
No change — generic pause, too many callers |
4. Replace regex heuristics in journal emit (auto/phases.ts)
Replace the fragile message-content regex classification with direct errorContext field access:
if (unitResult.errorContext) {
errorDetail = unitResult.errorContext.message;
errorType = unitResult.errorContext.category;
} else if (unitResult.status === "cancelled") {
errorDetail = `cancelled:${unitType}/${unitId}`;
errorType = "unknown";
}What we're NOT changing
auto.ts:pauseAuto— generic cancellation path, too many callers to thread context throughagent-end-recovery.ts— the recovery layer handles errors before they reach the resolve boundaryAgentEndEventtype — stays minimal; error extraction happens inresolveAgentEnd
Files touched
src/resources/extensions/gsd/auto/types.ts—UnitResult.errorContextsrc/resources/extensions/gsd/auto/resolve.ts— both resolve functionssrc/resources/extensions/gsd/auto/run-unit.ts— session error/timeout pathssrc/resources/extensions/gsd/auto-timers.ts— idle/hard timeout pathssrc/resources/extensions/gsd/auto/phases.ts— replace regex witherrorContextsrc/resources/extensions/gsd/tests/journal-integration.test.ts— update + new testssrc/resources/extensions/gsd/tests/auto-loop.test.ts— new tests
Verification
npx tsc --noEmit 2>&1 | grep -v "src/cli.ts"
npx tsx --test src/resources/extensions/gsd/tests/journal*.test.ts
npx tsx --test src/resources/extensions/gsd/tests/auto-loop*.test.ts