|
| 1 | +--- |
| 2 | +name: test-release-canary |
| 3 | +description: Manually dispatch and iterate on the Release Canary workflow that smoke-tests published OpenShell artifacts (install.sh on macOS/Ubuntu/Fedora, Helm chart on kind) after each Release Dev publish. Use when changing `.github/workflows/release-canary.yml`, validating a release before tagging, debugging a canary failure, or reproducing a canary job locally. Trigger keywords - release canary, release-canary, canary failed, canary dispatch, test release canary, post-release smoke, install.sh canary, helm chart canary, kind canary, dispatch canary. |
| 4 | +--- |
| 5 | + |
| 6 | +# Test Release Canary |
| 7 | + |
| 8 | +The Release Canary (`.github/workflows/release-canary.yml`) smoke-tests the artifacts a `Release Dev` run just published. It is the last automated checkpoint before tagging a public release: if the canary is red, the published `dev` artifacts do not install on a stock environment. |
| 9 | + |
| 10 | +## What the canary verifies |
| 11 | + |
| 12 | +| Job | Runner | Verifies | |
| 13 | +|---|---|---| |
| 14 | +| `macos` | `macos-latest-xlarge` | `install.sh` resolves the Homebrew formula, brew installs the cask, and `openshell status` reaches the brew-services–backed local gateway with the VM driver. | |
| 15 | +| `ubuntu` | `ubuntu-latest` | `install.sh` installs the Debian package, the post-install systemd user service starts, and `openshell status` reaches the local gateway with the Docker driver. | |
| 16 | +| `fedora` | `fedora:latest` container | `install.sh` installs the RPM packages, the local gateway starts under Podman, and `openshell status` succeeds. | |
| 17 | +| `kubernetes` | `ubuntu-latest` + kind | `helm install oci://ghcr.io/nvidia/openshell/helm-chart --version 0.0.0-dev` succeeds in a kind cluster, the gateway pod becomes Ready, port-forward exposes 8080, and the released CLI registers the in-cluster gateway and runs `openshell status` against it. | |
| 18 | + |
| 19 | +`install.sh` defaults to the *latest tagged* release — the canary is therefore checking that the most recent public release still installs, not the just-published `dev` build. The `kubernetes` job is the exception: it pins to `0.0.0-dev` chart + `:dev` images. |
| 20 | + |
| 21 | +## Trigger paths |
| 22 | + |
| 23 | +The workflow has two triggers: |
| 24 | + |
| 25 | +```yaml |
| 26 | +on: |
| 27 | + workflow_dispatch: |
| 28 | + workflow_run: |
| 29 | + workflows: ["Release Dev"] |
| 30 | + types: [completed] |
| 31 | +``` |
| 32 | +
|
| 33 | +- **Automatic.** Every successful `Release Dev` run (on `main` or a manual dispatch of Release Dev) fires the canary. Each job gates on `github.event.workflow_run.conclusion == 'success'` so a failed Release Dev does not run the canary. |
| 34 | +- **Manual.** `workflow_dispatch` lets you run the canary on demand against any branch's workflow definition. |
| 35 | + |
| 36 | +When dispatched manually, `github.event.workflow_run.head_sha` is empty and the workflow falls back to `github.sha` (the branch tip) for the `install.sh` URL. |
| 37 | + |
| 38 | +## Manual dispatch |
| 39 | + |
| 40 | +Run the canary as-is on the current branch: |
| 41 | + |
| 42 | +```shell |
| 43 | +gh workflow run release-canary.yml --ref "$(git branch --show-current)" |
| 44 | +``` |
| 45 | + |
| 46 | +Watch the run that starts: |
| 47 | + |
| 48 | +```shell |
| 49 | +sleep 5 # let GitHub register the dispatch |
| 50 | +gh run list --workflow release-canary.yml --limit 1 |
| 51 | +gh run watch "$(gh run list --workflow release-canary.yml --limit 1 --json databaseId --jq '.[0].databaseId')" |
| 52 | +``` |
| 53 | + |
| 54 | +View only failed jobs after completion: |
| 55 | + |
| 56 | +```shell |
| 57 | +gh run view <run-id> --log-failed |
| 58 | +``` |
| 59 | + |
| 60 | +## Iterating on the canary itself |
| 61 | + |
| 62 | +When you change `release-canary.yml` on a branch, a manual dispatch on that branch tests *your branch's workflow logic* against *main's published artifacts* (`0.0.0-dev` chart, `:dev` images, latest tagged install.sh assets). This is what you want for iterating on the canary — you're validating that the canary still works against known-good artifacts. |
| 63 | + |
| 64 | +Note `install.sh` is pulled from `raw.githubusercontent.com/NVIDIA/OpenShell/${head_sha}/install.sh`, so changes to `install.sh` on your branch *are* exercised even though the binaries it downloads are from the latest public tag. |
| 65 | + |
| 66 | +## Testing artifacts from a specific SHA |
| 67 | + |
| 68 | +`Release Dev` publishes two chart versions for every dev build (see `.github/actions/release-helm-oci/action.yml:89-102`): |
| 69 | + |
| 70 | +- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` — floating, overwritten on every main push. |
| 71 | +- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev.<sha>` — immutable, `appVersion` set to the same SHA so it pulls `ghcr.io/nvidia/openshell/gateway:<sha>` and `:supervisor:<sha>`. |
| 72 | + |
| 73 | +To smoke-test the chart for a specific dev build, dispatch `Release Dev` on the branch first, then run the kind canary steps locally pointed at the SHA-pinned chart (see "Local kind reproduction" below). The release-canary workflow itself does not currently expose `chart_version` / `image_tag` inputs. |
| 74 | + |
| 75 | +## Local kind reproduction |
| 76 | + |
| 77 | +The `kubernetes` job can be reproduced on any machine with Docker and `mise install`-provided `kubectl` + `helm`: |
| 78 | + |
| 79 | +```shell |
| 80 | +kind create cluster --name release-canary-local |
| 81 | +
|
| 82 | +helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart \ |
| 83 | + --version 0.0.0-dev \ |
| 84 | + --namespace openshell --create-namespace \ |
| 85 | + --set server.disableTls=true \ |
| 86 | + --set pkiInitJob.enabled=false \ |
| 87 | + --wait --timeout 5m |
| 88 | +
|
| 89 | +kubectl wait --namespace openshell \ |
| 90 | + --for=condition=Ready pod \ |
| 91 | + --selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=openshell" \ |
| 92 | + --timeout=300s |
| 93 | +
|
| 94 | +kubectl port-forward --namespace openshell svc/openshell 8080:8080 & |
| 95 | +openshell gateway add http://127.0.0.1:8080 --local --name kind |
| 96 | +openshell status |
| 97 | +``` |
| 98 | + |
| 99 | +Swap `0.0.0-dev` for `0.0.0-dev.<sha>` to pin to a specific dev build. Tear down with `kind delete cluster --name release-canary-local`. |
| 100 | + |
| 101 | +Loopback registration auto-derives the gateway name to `openshell` if `--name` is omitted, which collides with the `install.sh`-installed local gateway — always pass `--name kind` (or another distinct name) when registering in addition to a local install. |
| 102 | + |
| 103 | +## Diagnosing failures |
| 104 | + |
| 105 | +| Symptom | Likely cause | Where to look | |
| 106 | +|---|---|---| |
| 107 | +| `macos`/`ubuntu`/`fedora` job fails on `install.sh` | Latest tagged release missing an asset, checksum mismatch, or `install.sh` regression on this branch. | Job log around the `curl … install.sh \| sh` step. | |
| 108 | +| `macos`/`ubuntu`/`fedora` job fails on `openshell status` | Local gateway service did not start (systemd/brew/podman). Often a driver issue. | Service logs in the job log; `OPENSHELL_DRIVERS` env in the "Ensure …" step. | |
| 109 | +| `kubernetes` job fails on `helm install --wait` | Chart did not deploy in 5 min — usually image pull failure or readiness probe failing. | "Diagnostics on failure" step dumps `helm status`, manifest, pod describe, pod logs. | |
| 110 | +| `kubernetes` job fails on `kubectl wait` | Gateway pod stuck `CrashLoopBackOff` or `ImagePullBackOff`. | Diagnostics dump; check `:dev` image existence at `ghcr.io/nvidia/openshell/gateway`. | |
| 111 | +| `kubernetes` job fails on `openshell gateway add` or `status` | Port-forward not reachable, or CLI/gateway proto mismatch. | `port-forward.log` and `openshell gateway list` in the diagnostics dump. | |
| 112 | + |
| 113 | +The `kubernetes` job's diagnostics step (only runs `if: failure()`) emits, in order: helm status, rendered manifest, `kubectl get all`, pod descriptions, pod logs (200 lines per container), port-forward log, gateway list, CLI version. Read it top-to-bottom — most failures fall out by the manifest or pod logs. |
| 114 | + |
| 115 | +## Related |
| 116 | + |
| 117 | +- `helm-dev-environment` skill — local k3d-based dev environment (more featureful than the canary's kind cluster, but uses Skaffold-built local images, not published artifacts). |
| 118 | +- `watch-github-actions` skill — generic `gh run` workflow monitoring. |
| 119 | +- `debug-openshell-cluster` skill — runtime gateway/sandbox diagnostics that pair with the kind job's diagnostics dump. |
0 commit comments