Skip to content

Commit 461e382

Browse files
committed
chore: merge main into l7 wildcard fix
# Conflicts: # mise.lock
2 parents 2ec9fba + e3f009f commit 461e382

335 files changed

Lines changed: 46170 additions & 11783 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/build-from-issue/SKILL.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@ In the prompt, instruct the reviewer to:
148148
- **Medium**: Multiple files/components, some design decisions, but well-scoped
149149
- **High**: Cross-cutting changes, architectural decisions needed, significant unknowns
150150
8. Call out risks, unknowns, and decisions that need stakeholder input.
151+
9. Assess **LSM compatibility** — if the change touches process identity, `/proc` filesystem access, binary execution, or inter-process visibility, flag whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. In particular, tests that fork+exec into system binaries will fail on SELinux-enforcing hosts due to cross-label `/proc/<pid>/exe` access restrictions.
151152

152153
### A2: Post the Plan Comment
153154

.agents/skills/create-spike/SKILL.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,9 @@ The prompt to the reviewer **must** instruct it to:
9191

9292
9. **Check architecture docs** in the `architecture/` directory for relevant documentation about the affected subsystems.
9393

94-
10. **Determine the issue type:** `feat`, `fix`, `refactor`, `chore`, `perf`, or `docs`.
94+
10. **Assess Linux Security Module (LSM) impact.** If the change involves process identity, `/proc` filesystem access, file labeling, binary execution, or inter-process visibility, call out whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. For example: reading `/proc/<pid>/exe` across an SELinux domain boundary returns ENOENT, not EACCES. Tests that fork+exec into system binaries (different SELinux label) will fail on enforcing hosts. Flag any LSM-sensitive code paths and recommend mitigations.
95+
96+
11. **Determine the issue type:** `feat`, `fix`, `refactor`, `chore`, `perf`, or `docs`.
9597

9698
### What makes a good investigation prompt
9799

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,13 +67,42 @@ docker run --rm --entrypoint /openshell-sandbox "${OPENSHELL_DOCKER_SUPERVISOR_I
6767
openshell status
6868
```
6969

70+
For Docker GPU failures, check CDI support and NVIDIA CDI discovery separately:
71+
72+
```bash
73+
docker info --format '{{json .CDISpecDirs}}'
74+
docker info --format '{{json .DiscoveredDevices}}'
75+
for dir in /etc/cdi /var/run/cdi; do
76+
if [ -d "$dir" ]; then
77+
find "$dir" -maxdepth 1 -type f \( -name '*.yaml' -o -name '*.json' \) -print
78+
else
79+
echo "$dir missing"
80+
fi
81+
done
82+
systemctl is-enabled nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true
83+
systemctl is-active nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true
84+
systemctl status nvidia-cdi-refresh.service nvidia-cdi-refresh.path --no-pager --lines=50
85+
journalctl -u nvidia-cdi-refresh.service --no-pager --lines=100
86+
```
87+
88+
When the NVIDIA Container Toolkit CDI refresh units are not enabled or no NVIDIA CDI spec has been generated, enable them and trigger a refresh:
89+
90+
```bash
91+
sudo systemctl enable --now nvidia-cdi-refresh.path
92+
sudo systemctl enable --now nvidia-cdi-refresh.service
93+
sudo systemctl restart nvidia-cdi-refresh.service
94+
docker info --format '{{json .DiscoveredDevices}}'
95+
```
96+
7097
Common findings:
7198

7299
- Docker daemon unavailable: start Docker Desktop or Docker Engine.
73100
- Gateway process stopped: inspect exit status and logs.
74101
- Sandbox image missing or pull denied: verify image reference and registry credentials.
75102
- Docker driver cannot initialize because it cannot find `openshell-sandbox`: verify `OPENSHELL_DOCKER_SUPERVISOR_BIN`, the sibling binary next to `openshell-gateway`, or the configured supervisor image contains `/openshell-sandbox`.
76103
- Sandbox never registers: check gateway logs and supervisor callback endpoint.
104+
- Supervisor image exits before printing `openshell-sandbox --version`: the image should be the scratch supervisor image from `deploy/docker/Dockerfile.supervisor` and must contain a static executable at `/openshell-sandbox`.
105+
- `mise run e2e:docker:gpu` fails with `docker info --format json did not report any discovered NVIDIA CDI GPU devices`: Docker may report `CDISpecDirs` while still having no generated NVIDIA CDI specs. Verify `.DiscoveredDevices` contains entries such as `nvidia.com/gpu=all`, verify `/etc/cdi` or `/var/run/cdi` contains a generated NVIDIA spec, and check that `nvidia-cdi-refresh.service` and `nvidia-cdi-refresh.path` from NVIDIA Container Toolkit are enabled and healthy. The service is a one-shot unit, so `inactive (dead)` can be normal after a successful run; use `systemctl status` and `journalctl` to distinguish success from a skipped or failed refresh. NVIDIA recommends enabling the path and service units, and restarting `nvidia-cdi-refresh.service` to regenerate missing or stale CDI specs. If specs are generated but Docker still reports no discovered devices, restart Docker or reload the daemon and re-check `docker info`.
77106

78107
For source checkout development, restart the local gateway with:
79108

@@ -113,7 +142,6 @@ Check required Helm deployment secrets:
113142

114143
```bash
115144
kubectl -n openshell get secret \
116-
openshell-ssh-handshake \
117145
openshell-server-tls \
118146
openshell-server-client-ca \
119147
openshell-client-tls
@@ -126,7 +154,11 @@ kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec
126154
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
127155
```
128156

129-
The gateway image and `server.supervisorImage` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
157+
The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
158+
159+
For local/external pull mode (the default local path via `mise run cluster`), local images are tagged to the configured local registry base, pushed to that registry, and pulled by k3s via the `registries.yaml` mirror endpoint. The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).
160+
161+
Gateway image builds stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate, including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`, is copied into the staged workspace there.
130162

131163
For plaintext local evaluation, confirm the chart has:
132164

@@ -196,6 +228,7 @@ openshell logs <sandbox-name>
196228
| `openshell status` fails | Gateway endpoint unreachable or auth mismatch | `openshell gateway info`, gateway logs |
197229
| Gateway starts but sandbox create fails | Compute driver cannot reach runtime | Docker/Podman/Kubernetes/VM driver logs |
198230
| Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs |
231+
| Docker GPU e2e fails before GPU sandbox comparison | NVIDIA CDI specs are missing or Docker has not discovered them | `docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service` |
199232
| Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod <pod>` |
200233
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell` |
201234
| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways/<name>/mtls/` |

.agents/skills/openshell-cli/SKILL.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,7 @@ openshell sandbox create \
141141
Key flags:
142142
- `--provider`: Attach one or more providers (repeatable)
143143
- `--policy`: Custom policy YAML (otherwise uses built-in default or `OPENSHELL_SANDBOX_POLICY` env var)
144+
- `--cpu`, `--memory`: Set per-sandbox compute sizing. Docker/Podman apply limits; Kubernetes applies matching requests and limits.
144145
- `--upload <PATH>[:<DEST>]`: Upload local files into the sandbox (default dest: `/sandbox`)
145146
- `--no-keep`: Delete the sandbox after the initial command or shell exits
146147
- `--forward <PORT>`: Forward a local port and keep the sandbox alive

.agents/skills/openshell-cli/cli-reference.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,8 @@ Create a sandbox through the active gateway, wait for readiness, then connect or
143143
| `--no-keep` | Delete sandbox after the initial command or shell exits |
144144
| `--provider <NAME>` | Provider to attach (repeatable) |
145145
| `--policy <PATH>` | Path to custom policy YAML |
146+
| `--cpu <QUANTITY>` | CPU amount for the sandbox (for example: `500m`, `1`, `2.5`) |
147+
| `--memory <QUANTITY>` | Memory amount for the sandbox (for example: `512Mi`, `4Gi`, `8G`) |
146148
| `--forward <PORT>` | Forward local port to sandbox (keeps the sandbox alive) |
147149
| `--tty` | Force pseudo-terminal allocation |
148150
| `--no-tty` | Disable pseudo-terminal allocation |
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
name: test-release-canary
3+
description: Manually dispatch and iterate on the Release Canary workflow that smoke-tests published OpenShell artifacts (install.sh on macOS/Ubuntu/Fedora, Helm chart on kind) after each Release Dev publish. Use when changing `.github/workflows/release-canary.yml`, validating a release before tagging, debugging a canary failure, or reproducing a canary job locally. Trigger keywords - release canary, release-canary, canary failed, canary dispatch, test release canary, post-release smoke, install.sh canary, helm chart canary, kind canary, dispatch canary.
4+
---
5+
6+
# Test Release Canary
7+
8+
The Release Canary (`.github/workflows/release-canary.yml`) smoke-tests the artifacts a `Release Dev` run just published. It is the last automated checkpoint before tagging a public release: if the canary is red, the published `dev` artifacts do not install on a stock environment.
9+
10+
## What the canary verifies
11+
12+
| Job | Runner | Verifies |
13+
|---|---|---|
14+
| `macos` | `macos-latest-xlarge` | `install.sh` resolves the Homebrew formula, brew installs the cask, and `openshell status` reaches the brew-services–backed local gateway with the VM driver. |
15+
| `ubuntu` | `ubuntu-latest` | `install.sh` installs the Debian package, the post-install systemd user service starts, and `openshell status` reaches the local gateway with the Docker driver. |
16+
| `fedora` | `fedora:latest` container | `install.sh` installs the RPM packages, the local gateway starts under Podman, and `openshell status` succeeds. |
17+
| `kubernetes` | `ubuntu-latest` + kind | `helm install oci://ghcr.io/nvidia/openshell/helm-chart --version 0.0.0-dev` succeeds in a kind cluster, the gateway pod becomes Ready, port-forward exposes 8080, and the released CLI registers the in-cluster gateway and runs `openshell status` against it. |
18+
19+
`install.sh` defaults to the *latest tagged* release — the canary is therefore checking that the most recent public release still installs, not the just-published `dev` build. The `kubernetes` job is the exception: it pins to `0.0.0-dev` chart + `:dev` images.
20+
21+
## Trigger paths
22+
23+
The workflow has two triggers:
24+
25+
```yaml
26+
on:
27+
workflow_dispatch:
28+
workflow_run:
29+
workflows: ["Release Dev"]
30+
types: [completed]
31+
```
32+
33+
- **Automatic.** Every successful `Release Dev` run (on `main` or a manual dispatch of Release Dev) fires the canary. Each job gates on `github.event.workflow_run.conclusion == 'success'` so a failed Release Dev does not run the canary.
34+
- **Manual.** `workflow_dispatch` lets you run the canary on demand against any branch's workflow definition.
35+
36+
When dispatched manually, `github.event.workflow_run.head_sha` is empty and the workflow falls back to `github.sha` (the branch tip) for the `install.sh` URL.
37+
38+
## Manual dispatch
39+
40+
Run the canary as-is on the current branch:
41+
42+
```shell
43+
gh workflow run release-canary.yml --ref "$(git branch --show-current)"
44+
```
45+
46+
Watch the run that starts:
47+
48+
```shell
49+
sleep 5 # let GitHub register the dispatch
50+
gh run list --workflow release-canary.yml --limit 1
51+
gh run watch "$(gh run list --workflow release-canary.yml --limit 1 --json databaseId --jq '.[0].databaseId')"
52+
```
53+
54+
View only failed jobs after completion:
55+
56+
```shell
57+
gh run view <run-id> --log-failed
58+
```
59+
60+
## Iterating on the canary itself
61+
62+
When you change `release-canary.yml` on a branch, a manual dispatch on that branch tests *your branch's workflow logic* against *main's published artifacts* (`0.0.0-dev` chart, `:dev` images, latest tagged install.sh assets). This is what you want for iterating on the canary — you're validating that the canary still works against known-good artifacts.
63+
64+
Note `install.sh` is pulled from `raw.githubusercontent.com/NVIDIA/OpenShell/${head_sha}/install.sh`, so changes to `install.sh` on your branch *are* exercised even though the binaries it downloads are from the latest public tag.
65+
66+
## Testing artifacts from a specific SHA
67+
68+
`Release Dev` publishes two chart versions for every dev build (see `.github/actions/release-helm-oci/action.yml:89-102`):
69+
70+
- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` — floating, overwritten on every main push.
71+
- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev.<sha>` — immutable, `appVersion` set to the same SHA so it pulls `ghcr.io/nvidia/openshell/gateway:<sha>` and `:supervisor:<sha>`.
72+
73+
To smoke-test the chart for a specific dev build, dispatch `Release Dev` on the branch first, then run the kind canary steps locally pointed at the SHA-pinned chart (see "Local kind reproduction" below). The release-canary workflow itself does not currently expose `chart_version` / `image_tag` inputs.
74+
75+
## Local kind reproduction
76+
77+
The `kubernetes` job can be reproduced on any machine with Docker and `mise install`-provided `kubectl` + `helm`:
78+
79+
```shell
80+
kind create cluster --name release-canary-local
81+
82+
helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart \
83+
--version 0.0.0-dev \
84+
--namespace openshell --create-namespace \
85+
--set server.disableTls=true \
86+
--set pkiInitJob.enabled=false \
87+
--wait --timeout 5m
88+
89+
kubectl wait --namespace openshell \
90+
--for=condition=Ready pod \
91+
--selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=openshell" \
92+
--timeout=300s
93+
94+
kubectl port-forward --namespace openshell svc/openshell 8080:8080 &
95+
openshell gateway add http://127.0.0.1:8080 --local --name kind
96+
openshell status
97+
```
98+
99+
Swap `0.0.0-dev` for `0.0.0-dev.<sha>` to pin to a specific dev build. Tear down with `kind delete cluster --name release-canary-local`.
100+
101+
Loopback registration auto-derives the gateway name to `openshell` if `--name` is omitted, which collides with the `install.sh`-installed local gateway — always pass `--name kind` (or another distinct name) when registering in addition to a local install.
102+
103+
## Diagnosing failures
104+
105+
| Symptom | Likely cause | Where to look |
106+
|---|---|---|
107+
| `macos`/`ubuntu`/`fedora` job fails on `install.sh` | Latest tagged release missing an asset, checksum mismatch, or `install.sh` regression on this branch. | Job log around the `curl … install.sh \| sh` step. |
108+
| `macos`/`ubuntu`/`fedora` job fails on `openshell status` | Local gateway service did not start (systemd/brew/podman). Often a driver issue. | Service logs in the job log; `OPENSHELL_DRIVERS` env in the "Ensure …" step. |
109+
| `kubernetes` job fails on `helm install --wait` | Chart did not deploy in 5 min — usually image pull failure or readiness probe failing. | "Diagnostics on failure" step dumps `helm status`, manifest, pod describe, pod logs. |
110+
| `kubernetes` job fails on `kubectl wait` | Gateway pod stuck `CrashLoopBackOff` or `ImagePullBackOff`. | Diagnostics dump; check `:dev` image existence at `ghcr.io/nvidia/openshell/gateway`. |
111+
| `kubernetes` job fails on `openshell gateway add` or `status` | Port-forward not reachable, or CLI/gateway proto mismatch. | `port-forward.log` and `openshell gateway list` in the diagnostics dump. |
112+
113+
The `kubernetes` job's diagnostics step (only runs `if: failure()`) emits, in order: helm status, rendered manifest, `kubectl get all`, pod descriptions, pod logs (200 lines per container), port-forward log, gateway list, CLI version. Read it top-to-bottom — most failures fall out by the manifest or pod logs.
114+
115+
## Related
116+
117+
- `helm-dev-environment` skill — local k3d-based dev environment (more featureful than the canary's kind cluster, but uses Skaffold-built local images, not published artifacts).
118+
- `watch-github-actions` skill — generic `gh run` workflow monitoring.
119+
- `debug-openshell-cluster` skill — runtime gateway/sandbox diagnostics that pair with the kind job's diagnostics dump.

.cargo/config.toml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
[env]
5+
# z3-sys bindgen needs the z3 include path. On some distros (e.g. RHEL/Fedora)
6+
# the header lives in /usr/include/z3/ rather than /usr/include/. The extra -I
7+
# is harmless on systems where the path doesn't exist.
8+
BINDGEN_EXTRA_CLANG_ARGS = "-I/usr/include/z3"

0 commit comments

Comments
 (0)