Skip to content

Keep GCP token fresh and raise retry budget for OpenShell Vertex runs#156

Open
EmilienM wants to merge 2 commits into
mainfrom
openshell-token-keepalive
Open

Keep GCP token fresh and raise retry budget for OpenShell Vertex runs#156
EmilienM wants to merge 2 commits into
mainfrom
openshell-token-keepalive

Conversation

@EmilienM

@EmilienM EmilienM commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

The OpenShell gateway's token refresh worker has a known race condition (NVIDIA/OpenShell PR #1763) where it can let the Vertex AI access token lapse around the hourly expiry boundary. A transient mint failure is retried only every 60s while the old token ages, producing a burst of 401s that exhausts claude-code's retry budget and kills the run mid-way. This was observed in the rfe-assessor runner (runs dying at ~40% completion on the daily job).

Both fixes are generic to any OpenShell + Vertex runner and belong in the shared framework rather than individual runner repos.

Commit 1: Token keepalive thread

  • Extracts rotate_token() from provider.py so both the initial setup rotation and the keepalive thread share the same codepath.
  • Adds a background keepalive thread in OpenShellBackend.run() that force-rotates the gateway credential every 20 minutes from the host while the agent runs.
  • Phase-offsets the first rotation by 10 minutes so the 20-min cadence lands at 10/30/50/70 min and never coincides with the ~hourly token-expiry boundary that the gateway refresh worker and the agent's client token cache already act on.
  • Gated to the OpenShell backend + Vertex auth path only (no-op for podman/local and API-key auth).
  • Thread is properly joined with a timeout in a finally block.

Commit 2: Retry budget bump

  • Sets CLAUDE_CODE_MAX_RETRIES=20 (default is 10) in the sandbox env script for Vertex auth runs.
  • Belt-and-suspenders with the keepalive: even if an unlucky long token lapse occurs, the wider retry budget lets the agent recover instead of exhausting.
  • Overridable via the CLAUDE_CODE_MAX_RETRIES env var on the host.
  • Only injected for Vertex auth (API-key auth is unaffected).

Credit: Both patterns discovered and validated by Jason Greene in the rfe-assessor runner (commits 2fbedef, c2cf2c4, ad11a32).

Test plan

  • tox -e py313 — 580 tests pass (7 new tests added)
  • tox -e lint — clean
  • tox -e check-format — clean
  • tox -e typecheck — clean
  • Validate in a real OpenShell + Vertex run (long-running agent should no longer die from token lapse)

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Implemented automatic token refresh for Vertex authentication to maintain valid credentials and prevent session interruptions during long-running operations
    • Increased code execution retry budget for Vertex-authenticated runs to improve reliability during token rotation scenarios
  • Tests

    • Added test coverage for token keepalive and rotation mechanisms, including error handling

EmilienM and others added 2 commits June 21, 2026 19:49
The OpenShell gateway mints the sandbox's Vertex access token (3600s
lifetime) and its refresh worker is meant to rotate it ahead of expiry.
Around the hourly boundary it can let the token lapse: a transient mint
failure is only retried every 60s while the old token keeps aging. This
surfaces as a burst of "unknown" API retries that exhausts the agent's
retry budget and kills the run mid-way (see NVIDIA/OpenShell PR #1763).

Force-rotate the gateway credential every 20 min from the host while
the agent runs, phase-offset by 10 min so rotations land at
10/30/50/70 min and never coincide with the hourly token-expiry
boundary. Gated to the OpenShell backend + Vertex auth path (no-op for
podman/local and API-key auth).

Extracted rotate_token() from provider.py so both the initial setup
rotation and the keepalive thread share the same codepath.

Signed-off-by: Emilien Macchi <emacchi@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
The ~hourly Vertex token-rotation lapse produces a burst of retryable
"unknown" 401s. On stock 60-min token intervals a long lapse window
can exhaust all 10 default retries and kill the run. The 20-min token
keepalive shortens those windows, and this widens the retry budget to
20 so even an unlucky long lapse recovers instead of exhausting.
Belt-and-suspenders with the keepalive.

Set CLAUDE_CODE_MAX_RETRIES=20 in the sandbox env script, only for
Vertex auth (API-key auth is unaffected). Overridable via the
CLAUDE_CODE_MAX_RETRIES env var on the host.

Signed-off-by: Emilien Macchi <emacchi@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

provider.py gains a rotate_token() function encapsulating the subprocess call to force the OpenShell gateway to mint a fresh GCP access token. __init__.py adds subprocess and threading imports, defines _token_keepalive(stop) — a loop that calls rotate_token() on a timed interval and catches CalledProcessError — and modifies OpenShellBackend.run() to start a daemon thread running this loop for Vertex-auth runs, stopping it via an event in a finally block. _write_env_script() now injects CLAUDE_CODE_MAX_RETRIES (default "20") for Vertex auth. New tests cover rotate_token() propagation behavior, _token_keepalive loop/stop/error-logging paths, and CLAUDE_CODE_MAX_RETRIES env-script generation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
No Sensitive Data In Logs ⚠️ Warning The _token_keepalive function prints exc.stderr directly without redaction, exposing potential GCP tokens or error messages from the openshell provider command. Redact sensitive stderr output in _token_keepalive using a filter similar to provider.py's _run function, or use a structured logger that handles redaction automatically.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the two main changes: token keepalive mechanism for Vertex runs and increased retry budget.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection ✅ Passed PR demonstrates legitimate domain knowledge, references specific upstream issues, modifies multiple files with new tests, and describes problem with technical precision. Single code quality gap (un...
No Hardcoded Secrets ✅ Passed No hardcoded secrets found. Test fixtures use placeholder values ("sk-test", "test-project"). Credentials read from files/env at runtime. Secret redaction logic present. No base64 strings >32 chars...
No Weak Cryptography ✅ Passed No weak cryptographic primitives, custom crypto implementations, or insecure secret comparisons detected. Token rotation delegates to OpenShell CLI, and all new imports are standard library only.
No Injection Vectors ✅ Passed No injection vectors matching CWE-78 (shell=True), CWE-89 (SQL concat), CWE-94 (eval/exec), CWE-502 (pickle/yaml), or CWE-79 (innerHTML) were found. User env var input is safely quoted with shlex.q...
No Privileged Containers ✅ Passed This PR modifies only Python source and test files (OpenShell backend token rotation logic); no Dockerfiles, Kubernetes manifests, Helm templates, or container security configurations are changed....

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify

mergify Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 maintainer-images-build-only

Waiting for

  • #approved-reviews-by>=1
This rule is failing.
  • #approved-reviews-by>=1
  • -conflict
  • -draft
  • check-success = Build ci-openshell
  • check-success = Build ci-podman
  • check-success = Build claude-runner
  • check-success = Build claude-sandbox
  • check-success = Build docs
  • check-success = Build opencode-runner
  • check-success = Build opencode-sandbox
  • check-success = E2E claude-runner
  • check-success = E2E opencode-runner
  • check-success = E2E openshell-sandbox
  • check-success = Lint image scripts
  • check-success = Run Tox (check-format)
  • check-success = Run Tox (lint)
  • check-success = Run Tox (py310)
  • check-success = Run Tox (py311)
  • check-success = Run Tox (py312)
  • check-success = Run Tox (py313)
  • check-success = Run Tox (typecheck)
  • check-success = Test image scripts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/agentic_ci/backends/openshell/provider.py (1)

184-205: ⚠️ Potential issue | 🟠 Major

Add timeout to subprocess call in rotate_token().

rotate_token() calls _run(..., check=True) without a timeout parameter. Since this function is invoked from the _token_keepalive() daemon thread and the main cleanup path uses keepalive.join(timeout=5), an indefinite subprocess block will leave the thread hanging and prevent graceful shutdown. Add timeout=15 (or appropriate value) to the _run() call, as shown in the provider_exists() function pattern at line 60.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agentic_ci/backends/openshell/provider.py` around lines 184 - 205, The
`rotate_token()` function calls `_run()` without a timeout parameter, which can
cause the subprocess to hang indefinitely since this function is invoked from
the `_token_keepalive()` daemon thread and the main cleanup path uses
`keepalive.join(timeout=5)`, preventing graceful shutdown. Add a `timeout`
parameter (such as `timeout=15`) to the `_run()` call in `rotate_token()`,
following the same pattern used in the `provider_exists()` function for
consistent subprocess handling across the module.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/agentic_ci/backends/openshell/__init__.py`:
- Around line 215-218: The CLAUDE_CODE_MAX_RETRIES environment variable is being
exported without validation, which can cause issues if it contains non-integer,
negative, or excessively large values that break retry behavior or extend run
time/cost. When auth_mode is "vertex", parse the retrieved max_retries value to
validate it is a valid non-negative integer within acceptable bounds before
exporting it. If the value fails validation (cannot be parsed as an integer, is
negative, or exceeds a reasonable maximum), fall back to using
_DEFAULT_MAX_RETRIES instead of exporting invalid data.

---

Outside diff comments:
In `@src/agentic_ci/backends/openshell/provider.py`:
- Around line 184-205: The `rotate_token()` function calls `_run()` without a
timeout parameter, which can cause the subprocess to hang indefinitely since
this function is invoked from the `_token_keepalive()` daemon thread and the
main cleanup path uses `keepalive.join(timeout=5)`, preventing graceful
shutdown. Add a `timeout` parameter (such as `timeout=15`) to the `_run()` call
in `rotate_token()`, following the same pattern used in the `provider_exists()`
function for consistent subprocess handling across the module.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bb8f44ca-db2b-4f34-8646-b8debaef6962

📥 Commits

Reviewing files that changed from the base of the PR and between c97dcaf and 862ed71.

📒 Files selected for processing (4)
  • src/agentic_ci/backends/openshell/__init__.py
  • src/agentic_ci/backends/openshell/provider.py
  • tests/test_backend.py
  • tests/test_openshell_provider.py

Comment on lines +215 to +218
if self.harness.auth_mode == "vertex":
max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES)
lines.append(f"export CLAUDE_CODE_MAX_RETRIES={shlex.quote(max_retries)}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate and bound CLAUDE_CODE_MAX_RETRIES before exporting it.

This path exports arbitrary env text. Non-integer/negative values can break retry behavior, and very large values can unintentionally extend run time/cost.

Proposed fix
 if self.harness.auth_mode == "vertex":
-    max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES)
-    lines.append(f"export CLAUDE_CODE_MAX_RETRIES={shlex.quote(max_retries)}")
+    raw_max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES)
+    try:
+        max_retries = int(raw_max_retries)
+    except ValueError:
+        max_retries = int(_DEFAULT_MAX_RETRIES)
+    max_retries = max(1, min(max_retries, 50))
+    lines.append(f"export CLAUDE_CODE_MAX_RETRIES={max_retries}")

As per coding guidelines, "**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps 4. Performance problems."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agentic_ci/backends/openshell/__init__.py` around lines 215 - 218, The
CLAUDE_CODE_MAX_RETRIES environment variable is being exported without
validation, which can cause issues if it contains non-integer, negative, or
excessively large values that break retry behavior or extend run time/cost. When
auth_mode is "vertex", parse the retrieved max_retries value to validate it is a
valid non-negative integer within acceptable bounds before exporting it. If the
value fails validation (cannot be parsed as an integer, is negative, or exceeds
a reasonable maximum), fall back to using _DEFAULT_MAX_RETRIES instead of
exporting invalid data.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant