Skip to content

Harness should fail fast on authentication errors instead of retrying #2058

Description

@fullsend-ai-retro

What happened

On PR #2052, the review agent was dispatched (run 27188931660). Claude Code CLI exited immediately (81ms) with authentication_failed and apiKeySource: "none" — zero tokens consumed. The harness treated this as a normal agent failure and retried (iteration 2), which failed identically (58ms, same auth error). After 2 failed iterations, the workflow exited with validation failed after 2 iteration(s). Total wall time wasted on sandbox setup and teardown across both iterations was ~2 minutes, plus the cognitive cost of a human seeing a 'Failure' status on the PR.

What could go better

Authentication failures are infrastructure issues (credential misconfiguration, OIDC token not forwarded into sandbox, Vertex AI setup problems). They will never self-resolve by retrying within the same workflow run. The harness currently treats all agent failures uniformly and retries up to the configured iteration limit. This wastes runner time and produces misleading failure statuses. I am confident this is a systemic issue — any transient credential delivery failure would trigger the same pointless retry. Issue #1032 proposes a pre-flight auth check which would help catch some cases, but even with pre-flight checks, auth can fail at runtime (e.g., token expiry mid-run per #750). A complementary post-failure classification is needed.

Proposed change

In the harness retry loop (the iteration logic that re-runs the agent on validation failure), add exit-reason classification before retrying. When the agent's JSONL transcript or exit output contains authentication_failed or apiKeySource: "none", classify it as a non-retryable infrastructure failure and exit immediately with a clear error message (e.g., Error: agent authentication failed — not retrying. Check credential configuration.). This should be implemented in the harness runner code that handles iteration/retry decisions. The classification could be extended over time to cover other non-retryable failures (e.g., sandbox creation failures, network policy blocks).

Validation criteria

On the next occurrence of an authentication failure in any agent run, the harness should: (1) not retry the agent, (2) exit within seconds of the first failure, and (3) include 'authentication' or 'credential' in the error message posted to the PR. Verify by searching workflow logs for runs where both iterations failed with identical auth errors — these should no longer occur.


Generated by retro agent from #2052

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/harnessAgent harness, config, and skills loadingcomponent/runnerAgent runner behavior and lifecyclefeatureFeature-category issue awaiting human prioritizationtriagedTriaged but awaiting human prioritization

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions