What happened
On PR #2052, the review agent was dispatched (run 27188931660). Claude Code CLI exited immediately (81ms) with authentication_failed and apiKeySource: "none" — zero tokens consumed. The harness treated this as a normal agent failure and retried (iteration 2), which failed identically (58ms, same auth error). After 2 failed iterations, the workflow exited with validation failed after 2 iteration(s). Total wall time wasted on sandbox setup and teardown across both iterations was ~2 minutes, plus the cognitive cost of a human seeing a 'Failure' status on the PR.
What could go better
Authentication failures are infrastructure issues (credential misconfiguration, OIDC token not forwarded into sandbox, Vertex AI setup problems). They will never self-resolve by retrying within the same workflow run. The harness currently treats all agent failures uniformly and retries up to the configured iteration limit. This wastes runner time and produces misleading failure statuses. I am confident this is a systemic issue — any transient credential delivery failure would trigger the same pointless retry. Issue #1032 proposes a pre-flight auth check which would help catch some cases, but even with pre-flight checks, auth can fail at runtime (e.g., token expiry mid-run per #750). A complementary post-failure classification is needed.
Proposed change
In the harness retry loop (the iteration logic that re-runs the agent on validation failure), add exit-reason classification before retrying. When the agent's JSONL transcript or exit output contains authentication_failed or apiKeySource: "none", classify it as a non-retryable infrastructure failure and exit immediately with a clear error message (e.g., Error: agent authentication failed — not retrying. Check credential configuration.). This should be implemented in the harness runner code that handles iteration/retry decisions. The classification could be extended over time to cover other non-retryable failures (e.g., sandbox creation failures, network policy blocks).
Validation criteria
On the next occurrence of an authentication failure in any agent run, the harness should: (1) not retry the agent, (2) exit within seconds of the first failure, and (3) include 'authentication' or 'credential' in the error message posted to the PR. Verify by searching workflow logs for runs where both iterations failed with identical auth errors — these should no longer occur.
Generated by retro agent from #2052
What happened
On PR #2052, the review agent was dispatched (run 27188931660). Claude Code CLI exited immediately (81ms) with
authentication_failedandapiKeySource: "none"— zero tokens consumed. The harness treated this as a normal agent failure and retried (iteration 2), which failed identically (58ms, same auth error). After 2 failed iterations, the workflow exited withvalidation failed after 2 iteration(s). Total wall time wasted on sandbox setup and teardown across both iterations was ~2 minutes, plus the cognitive cost of a human seeing a 'Failure' status on the PR.What could go better
Authentication failures are infrastructure issues (credential misconfiguration, OIDC token not forwarded into sandbox, Vertex AI setup problems). They will never self-resolve by retrying within the same workflow run. The harness currently treats all agent failures uniformly and retries up to the configured iteration limit. This wastes runner time and produces misleading failure statuses. I am confident this is a systemic issue — any transient credential delivery failure would trigger the same pointless retry. Issue #1032 proposes a pre-flight auth check which would help catch some cases, but even with pre-flight checks, auth can fail at runtime (e.g., token expiry mid-run per #750). A complementary post-failure classification is needed.
Proposed change
In the harness retry loop (the iteration logic that re-runs the agent on validation failure), add exit-reason classification before retrying. When the agent's JSONL transcript or exit output contains
authentication_failedorapiKeySource: "none", classify it as a non-retryable infrastructure failure and exit immediately with a clear error message (e.g.,Error: agent authentication failed — not retrying. Check credential configuration.). This should be implemented in the harness runner code that handles iteration/retry decisions. The classification could be extended over time to cover other non-retryable failures (e.g., sandbox creation failures, network policy blocks).Validation criteria
On the next occurrence of an authentication failure in any agent run, the harness should: (1) not retry the agent, (2) exit within seconds of the first failure, and (3) include 'authentication' or 'credential' in the error message posted to the PR. Verify by searching workflow logs for runs where both iterations failed with identical auth errors — these should no longer occur.
Generated by retro agent from #2052