Skip to content

Wire structured error propagation through UnitResult #5

@igouss

Description

@igouss

Problem

UnitResult has status: "completed" | "cancelled" | "error" but only "completed" and "cancelled" are ever set. The "error" variant is dead code — nothing ever produces it.

All error context (provider errors, timeouts, idle watchdog kills) is discarded at the resolve boundary:

  • resolveAgentEnd(event) always produces { status: "completed", event }
  • resolveAgentEndCancelled() always produces { status: "cancelled" } with zero context

The journal's error classification (added in the OTel improvements, #4) relies on regex-matching message content — fragile and unreliable across providers.

Error info that exists but gets thrown away

Call Site File Available Context What Gets Passed
agent_end handler bootstrap/agent-end-recovery.ts:131 lastMsg.errorMessage, lastMsg.stopReason, classifyProviderError() result Just the raw event
Hard timeout auto-timers.ts:229 unitType, unitId, timeout duration Nothing (resolveAgentEndCancelled())
Idle watchdog auto-timers.ts:195 idle threshold, lastProgressAt, tool state Nothing (resolveAgentEndCancelled())
Session creation failure auto/run-unit.ts:61 The actual Error object { status: "cancelled" }
Session timeout auto/run-unit.ts:67 Timeout constant { status: "cancelled" }

Key architectural constraint

When stopReason === "error", handleAgentEnd in agent-end-recovery.ts does recovery (network retries → model fallbacks → provider error pause) and never calls resolveAgentEnd. Error-path pauses go through pauseAutoresolveAgentEndCancelled(). So the cancellation path is where most error context needs to flow.


Proposed Solution

1. Add errorContext to UnitResult (auto/types.ts)

export interface UnitResult {
  status: "completed" | "cancelled" | "error";
  event?: AgentEndEvent;
  errorContext?: {
    message: string;
    category: "provider" | "timeout" | "idle" | "network" | "aborted" | "session-failed" | "unknown";
    stopReason?: string;
    isTransient?: boolean;
    retryAfterMs?: number;
  };
}

Single optional object — you have error context or you don't.

2. Extend resolve functions (auto/resolve.ts)

  • resolveAgentEndCancelled(errorContext?) — accept optional context so callers can say why they cancelled
  • resolveAgentEnd(event) — inspect lastMsg.stopReason/lastMsg.errorMessage and produce status: "error" with errorContext when appropriate (finally activating the dead "error" variant)

3. Wire context at call sites

File Call Site Change
auto-timers.ts:195 idle watchdog resolveAgentEndCancelled({ message: "Idle watchdog", category: "idle" })
auto-timers.ts:229 hard timeout resolveAgentEndCancelled({ message: "Hard timeout", category: "timeout" })
run-unit.ts:61 session error return { status: "cancelled", errorContext: { message: msg, category: "session-failed" } }
run-unit.ts:67 session timeout return { status: "cancelled", errorContext: { message: "Session creation timeout", category: "timeout" } }
auto.ts:796 pauseAuto No change — generic pause, too many callers

4. Replace regex heuristics in journal emit (auto/phases.ts)

Replace the fragile message-content regex classification with direct errorContext field access:

if (unitResult.errorContext) {
  errorDetail = unitResult.errorContext.message;
  errorType = unitResult.errorContext.category;
} else if (unitResult.status === "cancelled") {
  errorDetail = `cancelled:${unitType}/${unitId}`;
  errorType = "unknown";
}

What we're NOT changing

  • auto.ts:pauseAuto — generic cancellation path, too many callers to thread context through
  • agent-end-recovery.ts — the recovery layer handles errors before they reach the resolve boundary
  • AgentEndEvent type — stays minimal; error extraction happens in resolveAgentEnd

Files touched

  1. src/resources/extensions/gsd/auto/types.tsUnitResult.errorContext
  2. src/resources/extensions/gsd/auto/resolve.ts — both resolve functions
  3. src/resources/extensions/gsd/auto/run-unit.ts — session error/timeout paths
  4. src/resources/extensions/gsd/auto-timers.ts — idle/hard timeout paths
  5. src/resources/extensions/gsd/auto/phases.ts — replace regex with errorContext
  6. src/resources/extensions/gsd/tests/journal-integration.test.ts — update + new tests
  7. src/resources/extensions/gsd/tests/auto-loop.test.ts — new tests

Verification

npx tsc --noEmit 2>&1 | grep -v "src/cli.ts"
npx tsx --test src/resources/extensions/gsd/tests/journal*.test.ts
npx tsx --test src/resources/extensions/gsd/tests/auto-loop*.test.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions