Problem
UnitResult has status: "completed" | "cancelled" | "error" but only "completed" and "cancelled" are ever set. The "error" variant is dead code — nothing ever produces it.
All error context (provider errors, timeouts, idle watchdog kills) is discarded at the resolve boundary:
resolveAgentEnd(event) always produces { status: "completed", event }
resolveAgentEndCancelled() always produces { status: "cancelled" } with zero context
The journal's error classification (added in the OTel improvements, #4) relies on regex-matching message content — fragile and unreliable across providers.
Error info that exists but gets thrown away
| Call Site |
File |
Available Context |
What Gets Passed |
| agent_end handler |
bootstrap/agent-end-recovery.ts:131 |
lastMsg.errorMessage, lastMsg.stopReason, classifyProviderError() result |
Just the raw event |
| Hard timeout |
auto-timers.ts:229 |
unitType, unitId, timeout duration |
Nothing (resolveAgentEndCancelled()) |
| Idle watchdog |
auto-timers.ts:195 |
idle threshold, lastProgressAt, tool state |
Nothing (resolveAgentEndCancelled()) |
| Session creation failure |
auto/run-unit.ts:61 |
The actual Error object |
{ status: "cancelled" } |
| Session timeout |
auto/run-unit.ts:67 |
Timeout constant |
{ status: "cancelled" } |
Key architectural constraint
When stopReason === "error", handleAgentEnd in agent-end-recovery.ts does recovery (network retries → model fallbacks → provider error pause) and never calls resolveAgentEnd. Error-path pauses go through pauseAuto → resolveAgentEndCancelled(). So the cancellation path is where most error context needs to flow.
Proposed Solution
1. Add errorContext to UnitResult (auto/types.ts)
export interface UnitResult {
status: "completed" | "cancelled" | "error";
event?: AgentEndEvent;
errorContext?: {
message: string;
category: "provider" | "timeout" | "idle" | "network" | "aborted" | "session-failed" | "unknown";
stopReason?: string;
isTransient?: boolean;
retryAfterMs?: number;
};
}
Single optional object — you have error context or you don't.
2. Extend resolve functions (auto/resolve.ts)
resolveAgentEndCancelled(errorContext?) — accept optional context so callers can say why they cancelled
resolveAgentEnd(event) — inspect lastMsg.stopReason/lastMsg.errorMessage and produce status: "error" with errorContext when appropriate (finally activating the dead "error" variant)
3. Wire context at call sites
| File |
Call Site |
Change |
auto-timers.ts:195 |
idle watchdog |
resolveAgentEndCancelled({ message: "Idle watchdog", category: "idle" }) |
auto-timers.ts:229 |
hard timeout |
resolveAgentEndCancelled({ message: "Hard timeout", category: "timeout" }) |
run-unit.ts:61 |
session error |
return { status: "cancelled", errorContext: { message: msg, category: "session-failed" } } |
run-unit.ts:67 |
session timeout |
return { status: "cancelled", errorContext: { message: "Session creation timeout", category: "timeout" } } |
auto.ts:796 |
pauseAuto |
No change — generic pause, too many callers |
4. Replace regex heuristics in journal emit (auto/phases.ts)
Replace the fragile message-content regex classification with direct errorContext field access:
if (unitResult.errorContext) {
errorDetail = unitResult.errorContext.message;
errorType = unitResult.errorContext.category;
} else if (unitResult.status === "cancelled") {
errorDetail = `cancelled:${unitType}/${unitId}`;
errorType = "unknown";
}
What we're NOT changing
auto.ts:pauseAuto — generic cancellation path, too many callers to thread context through
agent-end-recovery.ts — the recovery layer handles errors before they reach the resolve boundary
AgentEndEvent type — stays minimal; error extraction happens in resolveAgentEnd
Files touched
src/resources/extensions/gsd/auto/types.ts — UnitResult.errorContext
src/resources/extensions/gsd/auto/resolve.ts — both resolve functions
src/resources/extensions/gsd/auto/run-unit.ts — session error/timeout paths
src/resources/extensions/gsd/auto-timers.ts — idle/hard timeout paths
src/resources/extensions/gsd/auto/phases.ts — replace regex with errorContext
src/resources/extensions/gsd/tests/journal-integration.test.ts — update + new tests
src/resources/extensions/gsd/tests/auto-loop.test.ts — new tests
Verification
npx tsc --noEmit 2>&1 | grep -v "src/cli.ts"
npx tsx --test src/resources/extensions/gsd/tests/journal*.test.ts
npx tsx --test src/resources/extensions/gsd/tests/auto-loop*.test.ts
Problem
UnitResulthasstatus: "completed" | "cancelled" | "error"but only"completed"and"cancelled"are ever set. The"error"variant is dead code — nothing ever produces it.All error context (provider errors, timeouts, idle watchdog kills) is discarded at the resolve boundary:
resolveAgentEnd(event)always produces{ status: "completed", event }resolveAgentEndCancelled()always produces{ status: "cancelled" }with zero contextThe journal's error classification (added in the OTel improvements, #4) relies on regex-matching message content — fragile and unreliable across providers.
Error info that exists but gets thrown away
bootstrap/agent-end-recovery.ts:131lastMsg.errorMessage,lastMsg.stopReason,classifyProviderError()resultauto-timers.ts:229unitType,unitId, timeout durationresolveAgentEndCancelled())auto-timers.ts:195lastProgressAt, tool stateresolveAgentEndCancelled())auto/run-unit.ts:61Errorobject{ status: "cancelled" }auto/run-unit.ts:67{ status: "cancelled" }Key architectural constraint
When
stopReason === "error",handleAgentEndinagent-end-recovery.tsdoes recovery (network retries → model fallbacks → provider error pause) and never callsresolveAgentEnd. Error-path pauses go throughpauseAuto→resolveAgentEndCancelled(). So the cancellation path is where most error context needs to flow.Proposed Solution
1. Add
errorContexttoUnitResult(auto/types.ts)Single optional object — you have error context or you don't.
2. Extend resolve functions (
auto/resolve.ts)resolveAgentEndCancelled(errorContext?)— accept optional context so callers can say why they cancelledresolveAgentEnd(event)— inspectlastMsg.stopReason/lastMsg.errorMessageand producestatus: "error"witherrorContextwhen appropriate (finally activating the dead"error"variant)3. Wire context at call sites
auto-timers.ts:195resolveAgentEndCancelled({ message: "Idle watchdog", category: "idle" })auto-timers.ts:229resolveAgentEndCancelled({ message: "Hard timeout", category: "timeout" })run-unit.ts:61return { status: "cancelled", errorContext: { message: msg, category: "session-failed" } }run-unit.ts:67return { status: "cancelled", errorContext: { message: "Session creation timeout", category: "timeout" } }auto.ts:796pauseAuto4. Replace regex heuristics in journal emit (
auto/phases.ts)Replace the fragile message-content regex classification with direct
errorContextfield access:What we're NOT changing
auto.ts:pauseAuto— generic cancellation path, too many callers to thread context throughagent-end-recovery.ts— the recovery layer handles errors before they reach the resolve boundaryAgentEndEventtype — stays minimal; error extraction happens inresolveAgentEndFiles touched
src/resources/extensions/gsd/auto/types.ts—UnitResult.errorContextsrc/resources/extensions/gsd/auto/resolve.ts— both resolve functionssrc/resources/extensions/gsd/auto/run-unit.ts— session error/timeout pathssrc/resources/extensions/gsd/auto-timers.ts— idle/hard timeout pathssrc/resources/extensions/gsd/auto/phases.ts— replace regex witherrorContextsrc/resources/extensions/gsd/tests/journal-integration.test.ts— update + new testssrc/resources/extensions/gsd/tests/auto-loop.test.ts— new testsVerification