Keep GCP token fresh and raise retry budget for OpenShell Vertex runs#156
Keep GCP token fresh and raise retry budget for OpenShell Vertex runs#156EmilienM wants to merge 2 commits into
Conversation
The OpenShell gateway mints the sandbox's Vertex access token (3600s lifetime) and its refresh worker is meant to rotate it ahead of expiry. Around the hourly boundary it can let the token lapse: a transient mint failure is only retried every 60s while the old token keeps aging. This surfaces as a burst of "unknown" API retries that exhausts the agent's retry budget and kills the run mid-way (see NVIDIA/OpenShell PR #1763). Force-rotate the gateway credential every 20 min from the host while the agent runs, phase-offset by 10 min so rotations land at 10/30/50/70 min and never coincide with the hourly token-expiry boundary. Gated to the OpenShell backend + Vertex auth path (no-op for podman/local and API-key auth). Extracted rotate_token() from provider.py so both the initial setup rotation and the keepalive thread share the same codepath. Signed-off-by: Emilien Macchi <emacchi@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>
The ~hourly Vertex token-rotation lapse produces a burst of retryable "unknown" 401s. On stock 60-min token intervals a long lapse window can exhaust all 10 default retries and kill the run. The 20-min token keepalive shortens those windows, and this widens the retry budget to 20 so even an unlucky long lapse recovers instead of exhausting. Belt-and-suspenders with the keepalive. Set CLAUDE_CODE_MAX_RETRIES=20 in the sandbox env script, only for Vertex auth (API-key auth is unaffected). Overridable via the CLAUDE_CODE_MAX_RETRIES env var on the host. Signed-off-by: Emilien Macchi <emacchi@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>
📝 WalkthroughWalkthrough
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 maintainer-images-build-onlyWaiting for
This rule is failing.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agentic_ci/backends/openshell/provider.py (1)
184-205:⚠️ Potential issue | 🟠 MajorAdd timeout to subprocess call in
rotate_token().
rotate_token()calls_run(..., check=True)without a timeout parameter. Since this function is invoked from the_token_keepalive()daemon thread and the main cleanup path useskeepalive.join(timeout=5), an indefinite subprocess block will leave the thread hanging and prevent graceful shutdown. Addtimeout=15(or appropriate value) to the_run()call, as shown in theprovider_exists()function pattern at line 60.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agentic_ci/backends/openshell/provider.py` around lines 184 - 205, The `rotate_token()` function calls `_run()` without a timeout parameter, which can cause the subprocess to hang indefinitely since this function is invoked from the `_token_keepalive()` daemon thread and the main cleanup path uses `keepalive.join(timeout=5)`, preventing graceful shutdown. Add a `timeout` parameter (such as `timeout=15`) to the `_run()` call in `rotate_token()`, following the same pattern used in the `provider_exists()` function for consistent subprocess handling across the module.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/agentic_ci/backends/openshell/__init__.py`:
- Around line 215-218: The CLAUDE_CODE_MAX_RETRIES environment variable is being
exported without validation, which can cause issues if it contains non-integer,
negative, or excessively large values that break retry behavior or extend run
time/cost. When auth_mode is "vertex", parse the retrieved max_retries value to
validate it is a valid non-negative integer within acceptable bounds before
exporting it. If the value fails validation (cannot be parsed as an integer, is
negative, or exceeds a reasonable maximum), fall back to using
_DEFAULT_MAX_RETRIES instead of exporting invalid data.
---
Outside diff comments:
In `@src/agentic_ci/backends/openshell/provider.py`:
- Around line 184-205: The `rotate_token()` function calls `_run()` without a
timeout parameter, which can cause the subprocess to hang indefinitely since
this function is invoked from the `_token_keepalive()` daemon thread and the
main cleanup path uses `keepalive.join(timeout=5)`, preventing graceful
shutdown. Add a `timeout` parameter (such as `timeout=15`) to the `_run()` call
in `rotate_token()`, following the same pattern used in the `provider_exists()`
function for consistent subprocess handling across the module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: bb8f44ca-db2b-4f34-8646-b8debaef6962
📒 Files selected for processing (4)
src/agentic_ci/backends/openshell/__init__.pysrc/agentic_ci/backends/openshell/provider.pytests/test_backend.pytests/test_openshell_provider.py
| if self.harness.auth_mode == "vertex": | ||
| max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES) | ||
| lines.append(f"export CLAUDE_CODE_MAX_RETRIES={shlex.quote(max_retries)}") | ||
|
|
There was a problem hiding this comment.
Validate and bound CLAUDE_CODE_MAX_RETRIES before exporting it.
This path exports arbitrary env text. Non-integer/negative values can break retry behavior, and very large values can unintentionally extend run time/cost.
Proposed fix
if self.harness.auth_mode == "vertex":
- max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES)
- lines.append(f"export CLAUDE_CODE_MAX_RETRIES={shlex.quote(max_retries)}")
+ raw_max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES)
+ try:
+ max_retries = int(raw_max_retries)
+ except ValueError:
+ max_retries = int(_DEFAULT_MAX_RETRIES)
+ max_retries = max(1, min(max_retries, 50))
+ lines.append(f"export CLAUDE_CODE_MAX_RETRIES={max_retries}")As per coding guidelines, "**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps 4. Performance problems."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/agentic_ci/backends/openshell/__init__.py` around lines 215 - 218, The
CLAUDE_CODE_MAX_RETRIES environment variable is being exported without
validation, which can cause issues if it contains non-integer, negative, or
excessively large values that break retry behavior or extend run time/cost. When
auth_mode is "vertex", parse the retrieved max_retries value to validate it is a
valid non-negative integer within acceptable bounds before exporting it. If the
value fails validation (cannot be parsed as an integer, is negative, or exceeds
a reasonable maximum), fall back to using _DEFAULT_MAX_RETRIES instead of
exporting invalid data.
Source: Coding guidelines
Summary
The OpenShell gateway's token refresh worker has a known race condition (NVIDIA/OpenShell PR #1763) where it can let the Vertex AI access token lapse around the hourly expiry boundary. A transient mint failure is retried only every 60s while the old token ages, producing a burst of 401s that exhausts claude-code's retry budget and kills the run mid-way. This was observed in the rfe-assessor runner (runs dying at ~40% completion on the daily job).
Both fixes are generic to any OpenShell + Vertex runner and belong in the shared framework rather than individual runner repos.
Commit 1: Token keepalive thread
rotate_token()fromprovider.pyso both the initial setup rotation and the keepalive thread share the same codepath.OpenShellBackend.run()that force-rotates the gateway credential every 20 minutes from the host while the agent runs.Commit 2: Retry budget bump
CLAUDE_CODE_MAX_RETRIES=20(default is 10) in the sandbox env script for Vertex auth runs.CLAUDE_CODE_MAX_RETRIESenv var on the host.Credit: Both patterns discovered and validated by Jason Greene in the rfe-assessor runner (commits
2fbedef,c2cf2c4,ad11a32).Test plan
tox -e py313— 580 tests pass (7 new tests added)tox -e lint— cleantox -e check-format— cleantox -e typecheck— clean🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Tests