NVIDIA
diff --git a/‎.agents/skills/build-from-issue/SKILL.md‎
Lines changed: 1 addition & 0 deletions b/‎.agents/skills/build-from-issue/SKILL.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.agents/skills/create-spike/SKILL.md‎
Lines changed: 3 additions & 1 deletion b/‎.agents/skills/create-spike/SKILL.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎.agents/skills/debug-openshell-cluster/SKILL.md‎
Lines changed: 35 additions & 2 deletions b/‎.agents/skills/debug-openshell-cluster/SKILL.md‎
Lines changed: 35 additions & 2 deletions
diff --git a/‎.agents/skills/openshell-cli/SKILL.md‎
Lines changed: 1 addition & 0 deletions b/‎.agents/skills/openshell-cli/SKILL.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.agents/skills/openshell-cli/cli-reference.md‎
Lines changed: 2 additions & 0 deletions b/‎.agents/skills/openshell-cli/cli-reference.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎.agents/skills/test-release-canary/SKILL.md‎
Lines changed: 119 additions & 0 deletions b/‎.agents/skills/test-release-canary/SKILL.md‎
Lines changed: 119 additions & 0 deletions
diff --git a/‎.cargo/config.toml‎
Lines changed: 8 additions & 0 deletions b/‎.cargo/config.toml‎
Lines changed: 8 additions & 0 deletions
@@ -148,6 +148,7 @@ In the prompt, instruct the reviewer to:
    - **Medium**: Multiple files/components, some design decisions, but well-scoped
    - **High**: Cross-cutting changes, architectural decisions needed, significant unknowns
 8. Call out risks, unknowns, and decisions that need stakeholder input.
+9. Assess **LSM compatibility** — if the change touches process identity, `/proc` filesystem access, binary execution, or inter-process visibility, flag whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. In particular, tests that fork+exec into system binaries will fail on SELinux-enforcing hosts due to cross-label `/proc/<pid>/exe` access restrictions.
 
 ### A2: Post the Plan Comment
 
 
@@ -91,7 +91,9 @@ The prompt to the reviewer **must** instruct it to:
 
 9. **Check architecture docs** in the `architecture/` directory for relevant documentation about the affected subsystems.
 
-10. **Determine the issue type:** `feat`, `fix`, `refactor`, `chore`, `perf`, or `docs`.
+10. **Assess Linux Security Module (LSM) impact.** If the change involves process identity, `/proc` filesystem access, file labeling, binary execution, or inter-process visibility, call out whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. For example: reading `/proc/<pid>/exe` across an SELinux domain boundary returns ENOENT, not EACCES. Tests that fork+exec into system binaries (different SELinux label) will fail on enforcing hosts. Flag any LSM-sensitive code paths and recommend mitigations.
+
+11. **Determine the issue type:** `feat`, `fix`, `refactor`, `chore`, `perf`, or `docs`.
 
 ### What makes a good investigation prompt
 
 
@@ -67,13 +67,42 @@ docker run --rm --entrypoint /openshell-sandbox "${OPENSHELL_DOCKER_SUPERVISOR_I
 openshell status
 ```
 
+For Docker GPU failures, check CDI support and NVIDIA CDI discovery separately:
+
+```bash
+docker info --format '{{json .CDISpecDirs}}'
+docker info --format '{{json .DiscoveredDevices}}'
+for dir in /etc/cdi /var/run/cdi; do
+  if [ -d "$dir" ]; then
+    find "$dir" -maxdepth 1 -type f \( -name '*.yaml' -o -name '*.json' \) -print
+  else
+    echo "$dir missing"
+  fi
+done
+systemctl is-enabled nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true
+systemctl is-active nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true
+systemctl status nvidia-cdi-refresh.service nvidia-cdi-refresh.path --no-pager --lines=50
+journalctl -u nvidia-cdi-refresh.service --no-pager --lines=100
+```
+
+When the NVIDIA Container Toolkit CDI refresh units are not enabled or no NVIDIA CDI spec has been generated, enable them and trigger a refresh:
+
+```bash
+sudo systemctl enable --now nvidia-cdi-refresh.path
+sudo systemctl enable --now nvidia-cdi-refresh.service
+sudo systemctl restart nvidia-cdi-refresh.service
+docker info --format '{{json .DiscoveredDevices}}'
+```
+
 Common findings:
 
 - Docker daemon unavailable: start Docker Desktop or Docker Engine.
 - Gateway process stopped: inspect exit status and logs.
 - Sandbox image missing or pull denied: verify image reference and registry credentials.
 - Docker driver cannot initialize because it cannot find `openshell-sandbox`: verify `OPENSHELL_DOCKER_SUPERVISOR_BIN`, the sibling binary next to `openshell-gateway`, or the configured supervisor image contains `/openshell-sandbox`.
 - Sandbox never registers: check gateway logs and supervisor callback endpoint.
+- Supervisor image exits before printing `openshell-sandbox --version`: the image should be the scratch supervisor image from `deploy/docker/Dockerfile.supervisor` and must contain a static executable at `/openshell-sandbox`.
+- `mise run e2e:docker:gpu` fails with `docker info --format json did not report any discovered NVIDIA CDI GPU devices`: Docker may report `CDISpecDirs` while still having no generated NVIDIA CDI specs. Verify `.DiscoveredDevices` contains entries such as `nvidia.com/gpu=all`, verify `/etc/cdi` or `/var/run/cdi` contains a generated NVIDIA spec, and check that `nvidia-cdi-refresh.service` and `nvidia-cdi-refresh.path` from NVIDIA Container Toolkit are enabled and healthy. The service is a one-shot unit, so `inactive (dead)` can be normal after a successful run; use `systemctl status` and `journalctl` to distinguish success from a skipped or failed refresh. NVIDIA recommends enabling the path and service units, and restarting `nvidia-cdi-refresh.service` to regenerate missing or stale CDI specs. If specs are generated but Docker still reports no discovered devices, restart Docker or reload the daemon and re-check `docker info`.
 
 For source checkout development, restart the local gateway with:
 
@@ -113,7 +142,6 @@ Check required Helm deployment secrets:
 
 ```bash
 kubectl -n openshell get secret \
-  openshell-ssh-handshake \
   openshell-server-tls \
   openshell-server-client-ca \
   openshell-client-tls
@@ -126,7 +154,11 @@ kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec
 helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
 ```
 
-The gateway image and `server.supervisorImage` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
+The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
+
+For local/external pull mode (the default local path via `mise run cluster`), local images are tagged to the configured local registry base, pushed to that registry, and pulled by k3s via the `registries.yaml` mirror endpoint. The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).
+
+Gateway image builds stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate, including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`, is copied into the staged workspace there.
 
 For plaintext local evaluation, confirm the chart has:
 
@@ -196,6 +228,7 @@ openshell logs <sandbox-name>
 | `openshell status` fails | Gateway endpoint unreachable or auth mismatch | `openshell gateway info`, gateway logs |
 | Gateway starts but sandbox create fails | Compute driver cannot reach runtime | Docker/Podman/Kubernetes/VM driver logs |
 | Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs |
+| Docker GPU e2e fails before GPU sandbox comparison | NVIDIA CDI specs are missing or Docker has not discovered them | `docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service` |
 | Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod <pod>` |
 | Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell` |
 | CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways/<name>/mtls/` |
 
@@ -141,6 +141,7 @@ openshell sandbox create \
 Key flags:
 - `--provider`: Attach one or more providers (repeatable)
 - `--policy`: Custom policy YAML (otherwise uses built-in default or `OPENSHELL_SANDBOX_POLICY` env var)
+- `--cpu`, `--memory`: Set per-sandbox compute sizing. Docker/Podman apply limits; Kubernetes applies matching requests and limits.
 - `--upload <PATH>[:<DEST>]`: Upload local files into the sandbox (default dest: `/sandbox`)
 - `--no-keep`: Delete the sandbox after the initial command or shell exits
 - `--forward <PORT>`: Forward a local port and keep the sandbox alive
 
@@ -143,6 +143,8 @@ Create a sandbox through the active gateway, wait for readiness, then connect or
 | `--no-keep` | Delete sandbox after the initial command or shell exits |
 | `--provider <NAME>` | Provider to attach (repeatable) |
 | `--policy <PATH>` | Path to custom policy YAML |
+| `--cpu <QUANTITY>` | CPU amount for the sandbox (for example: `500m`, `1`, `2.5`) |
+| `--memory <QUANTITY>` | Memory amount for the sandbox (for example: `512Mi`, `4Gi`, `8G`) |
 | `--forward <PORT>` | Forward local port to sandbox (keeps the sandbox alive) |
 | `--tty` | Force pseudo-terminal allocation |
 | `--no-tty` | Disable pseudo-terminal allocation |
 
@@ -0,0 +1,119 @@
+---
+name: test-release-canary
+description: Manually dispatch and iterate on the Release Canary workflow that smoke-tests published OpenShell artifacts (install.sh on macOS/Ubuntu/Fedora, Helm chart on kind) after each Release Dev publish. Use when changing `.github/workflows/release-canary.yml`, validating a release before tagging, debugging a canary failure, or reproducing a canary job locally. Trigger keywords - release canary, release-canary, canary failed, canary dispatch, test release canary, post-release smoke, install.sh canary, helm chart canary, kind canary, dispatch canary.
+---
+
+# Test Release Canary
+
+The Release Canary (`.github/workflows/release-canary.yml`) smoke-tests the artifacts a `Release Dev` run just published. It is the last automated checkpoint before tagging a public release: if the canary is red, the published `dev` artifacts do not install on a stock environment.
+
+## What the canary verifies
+
+| Job | Runner | Verifies |
+|---|---|---|
+| `macos` | `macos-latest-xlarge` | `install.sh` resolves the Homebrew formula, brew installs the cask, and `openshell status` reaches the brew-services–backed local gateway with the VM driver. |
+| `ubuntu` | `ubuntu-latest` | `install.sh` installs the Debian package, the post-install systemd user service starts, and `openshell status` reaches the local gateway with the Docker driver. |
+| `fedora` | `fedora:latest` container | `install.sh` installs the RPM packages, the local gateway starts under Podman, and `openshell status` succeeds. |
+| `kubernetes` | `ubuntu-latest` + kind | `helm install oci://ghcr.io/nvidia/openshell/helm-chart --version 0.0.0-dev` succeeds in a kind cluster, the gateway pod becomes Ready, port-forward exposes 8080, and the released CLI registers the in-cluster gateway and runs `openshell status` against it. |
+
+`install.sh` defaults to the *latest tagged* release — the canary is therefore checking that the most recent public release still installs, not the just-published `dev` build. The `kubernetes` job is the exception: it pins to `0.0.0-dev` chart + `:dev` images.
+
+## Trigger paths
+
+The workflow has two triggers:
+
+```yaml
+on:
+  workflow_dispatch:
+  workflow_run:
+    workflows: ["Release Dev"]
+    types: [completed]
+```
+
+- **Automatic.** Every successful `Release Dev` run (on `main` or a manual dispatch of Release Dev) fires the canary. Each job gates on `github.event.workflow_run.conclusion == 'success'` so a failed Release Dev does not run the canary.
+- **Manual.** `workflow_dispatch` lets you run the canary on demand against any branch's workflow definition.
+
+When dispatched manually, `github.event.workflow_run.head_sha` is empty and the workflow falls back to `github.sha` (the branch tip) for the `install.sh` URL.
+
+## Manual dispatch
+
+Run the canary as-is on the current branch:
+
+```shell
+gh workflow run release-canary.yml --ref "$(git branch --show-current)"
+```
+
+Watch the run that starts:
+
+```shell
+sleep 5  # let GitHub register the dispatch
+gh run list --workflow release-canary.yml --limit 1
+gh run watch "$(gh run list --workflow release-canary.yml --limit 1 --json databaseId --jq '.[0].databaseId')"
+```
+
+View only failed jobs after completion:
+
+```shell
+gh run view <run-id> --log-failed
+```
+
+## Iterating on the canary itself
+
+When you change `release-canary.yml` on a branch, a manual dispatch on that branch tests *your branch's workflow logic* against *main's published artifacts* (`0.0.0-dev` chart, `:dev` images, latest tagged install.sh assets). This is what you want for iterating on the canary — you're validating that the canary still works against known-good artifacts.
+
+Note `install.sh` is pulled from `raw.githubusercontent.com/NVIDIA/OpenShell/${head_sha}/install.sh`, so changes to `install.sh` on your branch *are* exercised even though the binaries it downloads are from the latest public tag.
+
+## Testing artifacts from a specific SHA
+
+`Release Dev` publishes two chart versions for every dev build (see `.github/actions/release-helm-oci/action.yml:89-102`):
+
+- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` — floating, overwritten on every main push.
+- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev.<sha>` — immutable, `appVersion` set to the same SHA so it pulls `ghcr.io/nvidia/openshell/gateway:<sha>` and `:supervisor:<sha>`.
+
+To smoke-test the chart for a specific dev build, dispatch `Release Dev` on the branch first, then run the kind canary steps locally pointed at the SHA-pinned chart (see "Local kind reproduction" below). The release-canary workflow itself does not currently expose `chart_version` / `image_tag` inputs.
+
+## Local kind reproduction
+
+The `kubernetes` job can be reproduced on any machine with Docker and `mise install`-provided `kubectl` + `helm`:
+
+```shell
+kind create cluster --name release-canary-local
+
+helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart \
+  --version 0.0.0-dev \
+  --namespace openshell --create-namespace \
+  --set server.disableTls=true \
+  --set pkiInitJob.enabled=false \
+  --wait --timeout 5m
+
+kubectl wait --namespace openshell \
+  --for=condition=Ready pod \
+  --selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=openshell" \
+  --timeout=300s
+
+kubectl port-forward --namespace openshell svc/openshell 8080:8080 &
+openshell gateway add http://127.0.0.1:8080 --local --name kind
+openshell status
+```
+
+Swap `0.0.0-dev` for `0.0.0-dev.<sha>` to pin to a specific dev build. Tear down with `kind delete cluster --name release-canary-local`.
+
+Loopback registration auto-derives the gateway name to `openshell` if `--name` is omitted, which collides with the `install.sh`-installed local gateway — always pass `--name kind` (or another distinct name) when registering in addition to a local install.
+
+## Diagnosing failures
+
+| Symptom | Likely cause | Where to look |
+|---|---|---|
+| `macos`/`ubuntu`/`fedora` job fails on `install.sh` | Latest tagged release missing an asset, checksum mismatch, or `install.sh` regression on this branch. | Job log around the `curl … install.sh \| sh` step. |
+| `macos`/`ubuntu`/`fedora` job fails on `openshell status` | Local gateway service did not start (systemd/brew/podman). Often a driver issue. | Service logs in the job log; `OPENSHELL_DRIVERS` env in the "Ensure …" step. |
+| `kubernetes` job fails on `helm install --wait` | Chart did not deploy in 5 min — usually image pull failure or readiness probe failing. | "Diagnostics on failure" step dumps `helm status`, manifest, pod describe, pod logs. |
+| `kubernetes` job fails on `kubectl wait` | Gateway pod stuck `CrashLoopBackOff` or `ImagePullBackOff`. | Diagnostics dump; check `:dev` image existence at `ghcr.io/nvidia/openshell/gateway`. |
+| `kubernetes` job fails on `openshell gateway add` or `status` | Port-forward not reachable, or CLI/gateway proto mismatch. | `port-forward.log` and `openshell gateway list` in the diagnostics dump. |
+
+The `kubernetes` job's diagnostics step (only runs `if: failure()`) emits, in order: helm status, rendered manifest, `kubectl get all`, pod descriptions, pod logs (200 lines per container), port-forward log, gateway list, CLI version. Read it top-to-bottom — most failures fall out by the manifest or pod logs.
+
+## Related
+
+- `helm-dev-environment` skill — local k3d-based dev environment (more featureful than the canary's kind cluster, but uses Skaffold-built local images, not published artifacts).
+- `watch-github-actions` skill — generic `gh run` workflow monitoring.
+- `debug-openshell-cluster` skill — runtime gateway/sandbox diagnostics that pair with the kind job's diagnostics dump.
@@ -0,0 +1,8 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+[env]
+# z3-sys bindgen needs the z3 include path. On some distros (e.g. RHEL/Fedora)
+# the header lives in /usr/include/z3/ rather than /usr/include/. The extra -I
+# is harmless on systems where the path doesn't exist.
+BINDGEN_EXTRA_CLANG_ARGS = "-I/usr/include/z3"