Skip to content

fix: pin supervisor image to match openshell CLI version#2795

Merged
ralphbean merged 1 commit into
mainfrom
fix-pin-supervisor-image
Jun 30, 2026
Merged

fix: pin supervisor image to match openshell CLI version#2795
ralphbean merged 1 commit into
mainfrom
fix-pin-supervisor-image

Conversation

@waynesun09

Copy link
Copy Markdown
Member

Summary

  • Pin the OpenShell supervisor image to ghcr.io/nvidia/openshell/supervisor:${OPENSHELL_VERSION} via gateway.toml so the in-container supervisor always matches the installed gateway/CLI version.
  • Applies to both action.yml (production agent runs) and functional-tests.yml (e2e tests).

Root Cause

OpenShell v0.0.73 was released at 15:31 UTC on June 30, re-pointing the supervisor:latest tag to a build containing NVIDIA/OpenShell#2001 (fix(supervisor): drop sandbox child capability bounding set). The cap_drop_bound() call fails with EINVAL in rootless Podman on GitHub Actions runners because CAP_SETPCAP is unavailable in user namespaces. The supervisor crashes immediately:

WARN openshell_supervisor_process::netns: Failed to delete network namespace
Error: × Invalid argument (os error 22)

The gateway defaults to supervisor:latest (not pinned to its own version), so every new runner after 15:31 UTC pulls the broken v0.0.73 supervisor regardless of what CLI version is installed. This is why reverting the CLI to 0.0.63 (PR #2787) did not fix sandbox creation.

Fix

Write $HOME/.config/openshell/gateway.toml in the gateway configuration step, setting supervisor_image to the version-tagged image matching OPENSHELL_VERSION from openshell-version.sh. This eliminates the dependency on the :latest tag and ensures version lock between CLI ↔ supervisor.

Test plan

  • Verify sandbox creation succeeds in a fullsend-ai/.fullsend agent run after merge
  • Verify functional-tests.yml e2e tests pass with the pinned supervisor

Fixes #2792

The OpenShell gateway defaults to pulling
ghcr.io/nvidia/openshell/supervisor:latest for the sandbox supervisor
binary. When NVIDIA released v0.0.73 (2026-06-30 15:31 UTC), the
:latest tag was re-pointed to a supervisor that drops the Linux
capability bounding set (NVIDIA/OpenShell#2001), which crashes with
EINVAL in rootless Podman on GitHub Actions runners.

Write a gateway.toml that pins supervisor_image to the version from
openshell-version.sh so the supervisor always matches the installed
gateway and is immune to upstream :latest tag changes.

Fixes #2792

Assisted-by: Claude (investigation, fix)
Signed-off-by: Wayne Sun <gsun@redhat.com>
@waynesun09 waynesun09 requested a review from a team as a code owner June 30, 2026 18:39
@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Pin OpenShell supervisor image to installed CLI version via gateway.toml

🐞 Bug fix ⚙️ Configuration changes 🕐 10-20 Minutes

Grey Divider

AI Description

• Pin OpenShell supervisor_image to ${OPENSHELL_VERSION} to avoid :latest regressions.
• Write ~/.config/openshell/gateway.toml during gateway configuration for prod and e2e.
• Ensure GitHub Actions runs use a supervisor matching the installed gateway/CLI.
Diagram

graph TD
  A["GitHub Action / Workflow"] --> B["Configure gateway step"] --> C["source openshell-version.sh"] --> D["Write gateway.toml"] --> E["OpenShell gateway"] --> F{{"Pull supervisor image"}} --> G["ghcr.io/nvidia/openshell/supervisor:${OPENSHELL_VERSION}"]
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Pin supervisor by immutable digest
  • ➕ Fully eliminates tag-mutation risk (even for version tags)
  • ➕ Reproducible sandbox behavior across time and runners
  • ➖ Requires a reliable source-of-truth mapping version→digest
  • ➖ More operational overhead when releasing/rolling versions
2. Upgrade (or patch) to a fixed supervisor release
  • ➕ Removes the root incompatibility for rootless Podman
  • ➕ Avoids accumulating version pin logic across clients
  • ➖ Depends on upstream release timing and adoption
  • ➖ Does not protect against future :latest regressions if defaults remain unpinned
3. Set supervisor image via an action input/env var only
  • ➕ Avoids writing persistent config files under $HOME
  • ➕ Keeps configuration localized to the action interface
  • ➖ Requires gateway to support/consume that mechanism consistently
  • ➖ Harder to reuse across both action.yml and workflow consumers without duplication

Recommendation: The current approach (writing gateway.toml with a version-tagged supervisor_image) is the best immediate mitigation: it restores determinism, aligns CLI↔supervisor versions, and avoids relying on :latest. If the project wants maximum reproducibility, consider a follow-up to pin by digest or add an official action input that maps versions to digests.

Files changed (2) +16 / -0

Bug fix (2) +16 / -0
functional-tests.ymlPin supervisor image during e2e gateway configuration +8/-0

Pin supervisor image during e2e gateway configuration

• Sources 'openshell-version.sh' and writes '~/.config/openshell/gateway.toml' to set 'openshell.gateway.supervisor_image' to the version-tagged supervisor image. This makes functional tests immune to upstream 'supervisor:latest' retags and keeps supervisor aligned with the installed OpenShell version.

.github/workflows/functional-tests.yml

action.ymlPin supervisor image for production agent runs +8/-0

Pin supervisor image for production agent runs

• Sources the action’s 'openshell-version.sh' and writes '~/.config/openshell/gateway.toml' with 'supervisor_image = ghcr.io/nvidia/openshell/supervisor:${OPENSHELL_VERSION}'. This ensures the in-container supervisor matches the gateway/CLI version and avoids breakage from ':latest' changes.

action.yml

@github-actions

Copy link
Copy Markdown

Site preview

Preview: https://4f8247bf-site.fullsend-ai.workers.dev

Commit: facd02e71bcd31b052cb83725a977f742529e98a

@fullsend-ai-review

fullsend-ai-review Bot commented Jun 30, 2026

Copy link
Copy Markdown

🤖 Review · ❌ Terminated · Started 6:42 PM UTC · Ended 6:58 PM UTC
Commit: b95b879 · View workflow run →

@qodo-code-review

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider

Great, no issues found!

Qodo reviewed your code and found no material issues that require review

Grey Divider

Qodo Logo

@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@fullsend-ai-review

Copy link
Copy Markdown

🤖 Finished Review · ❌ Failure · Started 6:42 PM UTC · Completed 6:58 PM UTC
Commit: facd02e · View workflow run →

@ralphbean ralphbean added this pull request to the merge queue Jun 30, 2026
Merged via the queue into main with commit ce66d92 Jun 30, 2026
17 checks passed
@ralphbean ralphbean deleted the fix-pin-supervisor-image branch June 30, 2026 19:08
ralphbean added a commit that referenced this pull request Jun 30, 2026
The composite action (action.yml) is the entry point for all agent runs
in CI. Changes like pinning the supervisor image (#2795) affect sandbox
creation but were not triggering e2e tests. Add action.yml to both the
paths trigger and the relevance grep.

Motivated-by: #2792
Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
ralphbean added a commit that referenced this pull request Jun 30, 2026
The composite action (action.yml) is the entry point for all agent runs
in CI. Changes like pinning the supervisor image (#2795) affect sandbox
creation but were not triggering e2e tests. Add action.yml to both the
paths trigger and the relevance grep.

Motivated-by: #2792
Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
@fullsend-ai-retro

fullsend-ai-retro Bot commented Jun 30, 2026

Copy link
Copy Markdown

🤖 Finished Retro · ❌ Failure · Started 7:13 PM UTC · Completed 7:20 PM UTC
Commit: facd02e · View workflow run →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sandbox enters Error phase immediately on creation across all agent types and openshell versions

3 participants