Keep GCP token fresh and raise retry budget for OpenShell Vertex runs by EmilienM · Pull Request #156 · opendatahub-io/agentic-ci

EmilienM · 2026-06-21T23:52:07Z

Summary

The OpenShell gateway's token refresh worker has a known race condition (NVIDIA/OpenShell PR #1763) where it can let the Vertex AI access token lapse around the hourly expiry boundary. A transient mint failure is retried only every 60s while the old token ages, producing a burst of 401s that exhausts claude-code's retry budget and kills the run mid-way. This was observed in the rfe-assessor runner (runs dying at ~40% completion on the daily job).

Both fixes are generic to any OpenShell + Vertex runner and belong in the shared framework rather than individual runner repos.

Commit 1: Token keepalive thread

Extracts rotate_token() from provider.py so both the initial setup rotation and the keepalive thread share the same codepath.
Adds a background keepalive thread in OpenShellBackend.run() that force-rotates the gateway credential every 20 minutes from the host while the agent runs.
Phase-offsets the first rotation by 10 minutes so the 20-min cadence lands at 10/30/50/70 min and never coincides with the ~hourly token-expiry boundary that the gateway refresh worker and the agent's client token cache already act on.
Gated to the OpenShell backend + Vertex auth path only (no-op for podman/local and API-key auth).
Thread is properly joined with a timeout in a finally block.

Commit 2: Retry budget bump

Sets CLAUDE_CODE_MAX_RETRIES=20 (default is 10) in the sandbox env script for Vertex auth runs.
Belt-and-suspenders with the keepalive: even if an unlucky long token lapse occurs, the wider retry budget lets the agent recover instead of exhausting.
Overridable via the CLAUDE_CODE_MAX_RETRIES env var on the host.
Only injected for Vertex auth (API-key auth is unaffected).

Credit: Both patterns discovered and validated by Jason Greene in the rfe-assessor runner (commits 2fbedef, c2cf2c4, ad11a32).

Test plan

tox -e py313 — 580 tests pass (7 new tests added)
tox -e lint — clean
tox -e check-format — clean
tox -e typecheck — clean
Validate in a real OpenShell + Vertex run (long-running agent should no longer die from token lapse)

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Implemented automatic token refresh for Vertex authentication to maintain valid credentials and prevent session interruptions during long-running operations
- Increased code execution retry budget for Vertex-authenticated runs to improve reliability during token rotation scenarios
Tests
- Added test coverage for token keepalive and rotation mechanisms, including error handling

The OpenShell gateway mints the sandbox's Vertex access token (3600s lifetime) and its refresh worker is meant to rotate it ahead of expiry. Around the hourly boundary it can let the token lapse: a transient mint failure is only retried every 60s while the old token keeps aging. This surfaces as a burst of "unknown" API retries that exhausts the agent's retry budget and kills the run mid-way (see NVIDIA/OpenShell PR #1763). Force-rotate the gateway credential every 20 min from the host while the agent runs, phase-offset by 10 min so rotations land at 10/30/50/70 min and never coincide with the hourly token-expiry boundary. Gated to the OpenShell backend + Vertex auth path (no-op for podman/local and API-key auth). Extracted rotate_token() from provider.py so both the initial setup rotation and the keepalive thread share the same codepath. Signed-off-by: Emilien Macchi <emacchi@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>

The ~hourly Vertex token-rotation lapse produces a burst of retryable "unknown" 401s. On stock 60-min token intervals a long lapse window can exhaust all 10 default retries and kill the run. The 20-min token keepalive shortens those windows, and this widens the retry budget to 20 so even an unlucky long lapse recovers instead of exhausting. Belt-and-suspenders with the keepalive. Set CLAUDE_CODE_MAX_RETRIES=20 in the sandbox env script, only for Vertex auth (API-key auth is unaffected). Overridable via the CLAUDE_CODE_MAX_RETRIES env var on the host. Signed-off-by: Emilien Macchi <emacchi@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-06-21T23:52:22Z

📝 Walkthrough

Walkthrough

provider.py gains a rotate_token() function encapsulating the subprocess call to force the OpenShell gateway to mint a fresh GCP access token. __init__.py adds subprocess and threading imports, defines _token_keepalive(stop) — a loop that calls rotate_token() on a timed interval and catches CalledProcessError — and modifies OpenShellBackend.run() to start a daemon thread running this loop for Vertex-auth runs, stopping it via an event in a finally block. _write_env_script() now injects CLAUDE_CODE_MAX_RETRIES (default "20") for Vertex auth. New tests cover rotate_token() propagation behavior, _token_keepalive loop/stop/error-logging paths, and CLAUDE_CODE_MAX_RETRIES env-script generation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
No Sensitive Data In Logs	⚠️ Warning	The _token_keepalive function prints exc.stderr directly without redaction, exposing potential GCP tokens or error messages from the openshell provider command.	Redact sensitive stderr output in _token_keepalive using a filter similar to provider.py's _run function, or use a structured logger that handles redaction automatically.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly summarizes the two main changes: token keepalive mechanism for Vertex runs and increased retry budget.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection	✅ Passed	PR demonstrates legitimate domain knowledge, references specific upstream issues, modifies multiple files with new tests, and describes problem with technical precision. Single code quality gap (un...
No Hardcoded Secrets	✅ Passed	No hardcoded secrets found. Test fixtures use placeholder values ("sk-test", "test-project"). Credentials read from files/env at runtime. Secret redaction logic present. No base64 strings >32 chars...
No Weak Cryptography	✅ Passed	No weak cryptographic primitives, custom crypto implementations, or insecure secret comparisons detected. Token rotation delegates to OpenShell CLI, and all new imports are standard library only.
No Injection Vectors	✅ Passed	No injection vectors matching CWE-78 (shell=True), CWE-89 (SQL concat), CWE-94 (eval/exec), CWE-502 (pickle/yaml), or CWE-79 (innerHTML) were found. User env var input is safely quoted with shlex.q...
No Privileged Containers	✅ Passed	This PR modifies only Python source and test files (OpenShell backend token rotation logic); no Dockerfiles, Kubernetes manifests, Helm templates, or container security configurations are changed....

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mergify · 2026-06-21T23:52:25Z

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/agentic_ci/backends/openshell/provider.py (1)
184-205: ⚠️ Potential issue | 🟠 Major

Add timeout to subprocess call in rotate_token().

rotate_token() calls _run(..., check=True) without a timeout parameter. Since this function is invoked from the _token_keepalive() daemon thread and the main cleanup path uses keepalive.join(timeout=5), an indefinite subprocess block will leave the thread hanging and prevent graceful shutdown. Add timeout=15 (or appropriate value) to the _run() call, as shown in the provider_exists() function pattern at line 60.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agentic_ci/backends/openshell/provider.py` around lines 184 - 205, The
`rotate_token()` function calls `_run()` without a timeout parameter, which can
cause the subprocess to hang indefinitely since this function is invoked from
the `_token_keepalive()` daemon thread and the main cleanup path uses
`keepalive.join(timeout=5)`, preventing graceful shutdown. Add a `timeout`
parameter (such as `timeout=15`) to the `_run()` call in `rotate_token()`,
following the same pattern used in the `provider_exists()` function for
consistent subprocess handling across the module.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/agentic_ci/backends/openshell/__init__.py`:
- Around line 215-218: The CLAUDE_CODE_MAX_RETRIES environment variable is being
exported without validation, which can cause issues if it contains non-integer,
negative, or excessively large values that break retry behavior or extend run
time/cost. When auth_mode is "vertex", parse the retrieved max_retries value to
validate it is a valid non-negative integer within acceptable bounds before
exporting it. If the value fails validation (cannot be parsed as an integer, is
negative, or exceeds a reasonable maximum), fall back to using
_DEFAULT_MAX_RETRIES instead of exporting invalid data.

---

Outside diff comments:
In `@src/agentic_ci/backends/openshell/provider.py`:
- Around line 184-205: The `rotate_token()` function calls `_run()` without a
timeout parameter, which can cause the subprocess to hang indefinitely since
this function is invoked from the `_token_keepalive()` daemon thread and the
main cleanup path uses `keepalive.join(timeout=5)`, preventing graceful
shutdown. Add a `timeout` parameter (such as `timeout=15`) to the `_run()` call
in `rotate_token()`, following the same pattern used in the `provider_exists()`
function for consistent subprocess handling across the module.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bb8f44ca-db2b-4f34-8646-b8debaef6962

📥 Commits

Reviewing files that changed from the base of the PR and between c97dcaf and 862ed71.

📒 Files selected for processing (4)

src/agentic_ci/backends/openshell/__init__.py
src/agentic_ci/backends/openshell/provider.py
tests/test_backend.py
tests/test_openshell_provider.py

coderabbitai · 2026-06-21T23:57:35Z

+        if self.harness.auth_mode == "vertex":
+            max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES)
+            lines.append(f"export CLAUDE_CODE_MAX_RETRIES={shlex.quote(max_retries)}")
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate and bound CLAUDE_CODE_MAX_RETRIES before exporting it.

This path exports arbitrary env text. Non-integer/negative values can break retry behavior, and very large values can unintentionally extend run time/cost.

Proposed fix

if self.harness.auth_mode == "vertex": - max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES) - lines.append(f"export CLAUDE_CODE_MAX_RETRIES={shlex.quote(max_retries)}") + raw_max_retries = os.environ.get("CLAUDE_CODE_MAX_RETRIES", _DEFAULT_MAX_RETRIES) + try: + max_retries = int(raw_max_retries) + except ValueError: + max_retries = int(_DEFAULT_MAX_RETRIES) + max_retries = max(1, min(max_retries, 50)) + lines.append(f"export CLAUDE_CODE_MAX_RETRIES={max_retries}")

As per coding guidelines, "**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps 4. Performance problems."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agentic_ci/backends/openshell/__init__.py` around lines 215 - 218, The CLAUDE_CODE_MAX_RETRIES environment variable is being exported without validation, which can cause issues if it contains non-integer, negative, or excessively large values that break retry behavior or extend run time/cost. When auth_mode is "vertex", parse the retrieved max_retries value to validate it is a valid non-negative integer within acceptable bounds before exporting it. If the value fails validation (cannot be parsed as an integer, is negative, or exceeds a reasonable maximum), fall back to using _DEFAULT_MAX_RETRIES instead of exporting invalid data.

Source: Coding guidelines

EmilienM and others added 2 commits June 21, 2026 19:49

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep GCP token fresh and raise retry budget for OpenShell Vertex runs#156

Keep GCP token fresh and raise retry budget for OpenShell Vertex runs#156
EmilienM wants to merge 2 commits into
mainfrom
openshell-token-keepalive

EmilienM commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

mergify Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EmilienM commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1: Token keepalive thread

Commit 2: Retry budget bump

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

mergify Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 maintainer-images-build-only

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EmilienM commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

mergify Bot commented Jun 21, 2026 •

edited

Loading