Skip to content

Commit 7bc4d60

Browse files
committed
docs: cover OCI/Azure image-test workflows and link from README
Add per-cloud documentation for the two new sanity-test workflows shipped in 339cacd (OCI) and 286026f (Azure): - OCI_TEST.md — describes oci-test.yml: input shape (Compute Image OCID), display-name regex, arch -> shape map, instance lifecycle, the six in-VM assertions, and OCI-Console links. - AZURE_TEST.md — describes azure-test.yml: Compute Gallery Path input, Kitten vs stable version parsing, VHD-URI-driven arch derivation, the 21 RBAC actions the OIDC SP needs, full VM + peers cleanup, and the same six in-VM assertions. Cross-references AZURE_GALLERY.md for the copy-pasteable compute_gallery_path that the release notification emits. Update the CI/CD Workflows table in README.md to list both new docs alongside the existing per-cloud build/release/marketplace docs, and add one bullet to AZURE_GALLERY.md noting the structured "- Created: ..." line that azure_uploader.sh now emits for AZURE_TEST.md to consume.
1 parent 91ad4a1 commit 7bc4d60

4 files changed

Lines changed: 422 additions & 0 deletions

File tree

AZURE_GALLERY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ Bash script that handles the actual image conversion, upload, and gallery operat
7070
- Supports dry-run mode (default) — pass `-f` to execute actual operations
7171
- Calculates unique image indices to avoid naming collisions
7272
- Replicates images to multiple target regions
73+
- For each created image-version, prints a structured `- Created: '<gallery>/<image_def>/<image_ver>'` line that the release workflow grabs into `IMAGE_INFO_SUMMARY` and forwards to Mattermost. The string is in the exact `compute_gallery_path` format consumed by [`AZURE_TEST.md`](AZURE_TEST.md), so the release notification ends with a copy-pasteable test-workflow input.
7374

7475
**Usage:**
7576
```bash

AZURE_TEST.md

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
# Azure Compute Gallery Image Testing
2+
3+
## Overview
4+
5+
This repository includes a GitHub Actions workflow for post-publish sanity-testing AlmaLinux OS image versions in an Azure Compute Gallery. The workflow launches a fresh VM from a given gallery image version, runs a small set of release / arch / disk / `dnf` assertions over SSH, collects the installed-package list, tears the VM and its auto-created peers down on `always()`, and posts a Mattermost summary.
6+
7+
It is the Azure counterpart of [`OCI_TEST.md`](OCI_TEST.md).
8+
9+
## Files
10+
11+
### `.github/workflows/azure-test.yml`
12+
13+
Workflow for validating a Compute Gallery image version end-to-end.
14+
15+
**What it does:**
16+
- Accepts a `compute_gallery_path` of the form `gallery_name/vm_image_definition/vm_image_version` (e.g. `almalinux/almalinux-9-gen2/9.7.2026050101`)
17+
- Resolves the gallery image-version resource ID and source VHD URI via `az sig image-version show`
18+
- Reverse-engineers the architecture from the source VHD filename using the same regex pair as [`AZURE_GALLERY.md`](AZURE_GALLERY.md) (so any image definition that release publishes is automatically supported)
19+
- Generates an ephemeral ed25519 SSH keypair, creates a test VM with `az vm create --nsg-rule SSH`, waits for SSH, runs the assertions, then deletes the VM, OS disk, NIC, public IP, and NSG by their auto-generated names
20+
- Uploads the package list as a workflow artifact
21+
- Sends a Mattermost notification with portal links to the gallery image and the (now-deleted) test VM
22+
23+
**Usage:**
24+
```
25+
Trigger via GitHub UI: Actions → Azure: Test Image
26+
27+
Inputs:
28+
- compute_gallery_path: gallery_name/vm_image_definition/vm_image_version
29+
(e.g. almalinux/almalinux-9-gen2/9.7.2026050101)
30+
- notify_mattermost: Send notification to Mattermost (default: true)
31+
```
32+
33+
The release workflow [`azure-to-gallery.yml`](AZURE_GALLERY.md) emits a structured `- Created: '<gallery>/<def>/<ver>'` line for every uploaded image-version, so the Mattermost release notification ends with a copy-pasteable `compute_gallery_path` for this workflow.
34+
35+
## Required GitHub Configuration
36+
37+
### Secrets
38+
| Secret | Description |
39+
|--------|-------------|
40+
| `AZURE_CLIENT_ID` | Azure service principal client ID |
41+
| `AZURE_TENANT_ID` | Azure tenant ID |
42+
| `AZURE_SUBSCRIPTION_ID` | Azure subscription ID |
43+
| `MATTERMOST_WEBHOOK_URL` | Mattermost incoming webhook URL |
44+
45+
### Variables (`vars.*`)
46+
| Variable | Description |
47+
|----------|-------------|
48+
| `MATTERMOST_CHANNEL` | Mattermost channel for notifications |
49+
50+
### GitHub Permissions
51+
The workflow requires:
52+
- `id-token: write` — for Azure OIDC authentication via `azure/login@v3`
53+
- `contents: read` — for repository checkout
54+
55+
### Workflow-level `env`
56+
Resource group and region are pinned at the workflow level (matching the convention in `tools/azure_uploader.sh`):
57+
| Env | Value |
58+
|-----|-------|
59+
| `RESOURCE_GROUP` | `rg-alma-images` (holds both the gallery and the test VM) |
60+
| `AZURE_LOCATION` | `East US` |
61+
| `AZURE_PORTAL_BASE_URL` | `https://portal.azure.com/#@/resource` |
62+
| `SSH_USER` (job-level) | `almalinux` |
63+
64+
## Required Azure RBAC
65+
66+
The OIDC service principal behind `AZURE_CLIENT_ID` needs the following actions, assigned at the `rg-alma-images` resource-group scope:
67+
68+
```
69+
Microsoft.Compute/galleries/images/read
70+
Microsoft.Compute/virtualMachines/write
71+
Microsoft.Compute/virtualMachines/delete
72+
Microsoft.Compute/virtualMachines/deletePreservedOSDisk/action
73+
Microsoft.Compute/disks/delete
74+
Microsoft.Network/networkInterfaces/write
75+
Microsoft.Network/networkInterfaces/join/action
76+
Microsoft.Network/networkInterfaces/delete
77+
Microsoft.Network/networkSecurityGroups/read
78+
Microsoft.Network/networkSecurityGroups/write
79+
Microsoft.Network/networkSecurityGroups/join/action
80+
Microsoft.Network/networkSecurityGroups/delete
81+
Microsoft.Network/publicIPAddresses/read
82+
Microsoft.Network/publicIPAddresses/write
83+
Microsoft.Network/publicIPAddresses/join/action
84+
Microsoft.Network/publicIPAddresses/delete
85+
Microsoft.Network/virtualNetworks/write
86+
Microsoft.Network/virtualNetworks/subnets/join/action
87+
Microsoft.Resources/deployments/read
88+
Microsoft.Resources/deployments/write
89+
Microsoft.Resources/deployments/operationStatuses/read
90+
```
91+
92+
The same list is duplicated as a comment in the workflow header so a future maintainer composing a least-privilege custom role doesn't have to rediscover it by trial-and-error dispatches.
93+
94+
## Compute Gallery Path Parsing
95+
96+
The single workflow input is split on `/` into three components, then the version is split on `.` into Major/Minor/Patch shape:
97+
98+
| Shape | Example | `ALMA_VERSION` | `DATESTAMP_ITERATION` | `RELEASE_STRING` |
99+
|-------|---------|----------------|----------------------|-----------------|
100+
| Stable AlmaLinux | `almalinux/almalinux-9-gen2/9.7.2026050101` | `9.7` | `2026050101` | `AlmaLinux release 9.7` |
101+
| Stable AlmaLinux 10 | `almalinux_ci/almalinux-ci-10-arm64-gen2/10.1.202605020` | `10.1` | `202605020` | `AlmaLinux release 10.1` |
102+
| Kitten 10 | `almalinux_ci/almalinux-ci-kitten-10-x64-gen2/10.20260501.0` | `10` | `20260501.0` | `AlmaLinux Kitten release 10` |
103+
104+
A `*kitten*` branch in the parse step handles the Kitten `Major.Datestamp.Iteration` shape (no minor); stable AlmaLinux uses `Major.Minor.Patch`.
105+
106+
`CUSTOM_IMAGE_NAME` (used as the artifact name and notification label) is derived from the source VHD filename without the `.vhd` extension — so it matches the artifact name produced by `azure-to-gallery.yml`.
107+
108+
## Architecture Detection
109+
110+
Architecture is **not** mapped from the gallery name; it is reverse-engineered from the source VHD filename returned by `az sig image-version show`. The workflow tries both regexes the release path uses:
111+
112+
```bash
113+
regex_azure='-([0-9]+\.?[0-9]*)-([0-9]{8,9}(\.[0-9])?).*\.(x86_64|aarch64|arm64)'
114+
regex_simple='almalinux-([0-9]+\.[0-9]+)-(x86_64|aarch64|arm64)\.([0-9]{8})'
115+
```
116+
117+
`arm64` returned by the regex is normalised to `aarch64` so the in-VM `rpm -q ... | grep <arch>` test keeps working. Architecture then maps to a default Azure VM size:
118+
119+
| Architecture | VM size |
120+
|---|---|
121+
| `x86_64` | `Standard_D2as_v5` |
122+
| `aarch64` | `Standard_D2ps_v5` |
123+
124+
The same defaults are used for Gen1 and 64K-page-size variants until a need to differentiate them surfaces.
125+
126+
## Test Assertions
127+
128+
Once SSH is reachable on the VM, the following checks run in sequence (failure of any aborts the workflow):
129+
130+
1. **AlmaLinux release**`grep '<RELEASE_STRING>' /etc/almalinux-release`
131+
2. **Release package**`rpm -qf /etc/almalinux-release` (resolved on the VM, so it works for both stable and Kitten release packages)
132+
3. **System architecture**`rpm -q --qf='%{ARCH}\n' <RELEASE_PACKAGE> | grep '<ALMA_ARCH>'`
133+
4. **Disk and filesystems**`lsblk` listing
134+
5. **Root filesystem resize** — root must be ≥ 98 GiB (the OS-disk-size-gb passed to `az vm create` is 100 GiB)
135+
6. **Updates available**`sudo dnf check-update` (exit code `100` is treated as success — it just means updates are pending)
136+
7. **Installed-package list**`rpm -qa --queryformat '%{NAME}\n' | sort > /tmp/<CUSTOM_IMAGE_NAME>.txt`, then SCP'd back and uploaded as a workflow artifact
137+
138+
## Workflow Process
139+
140+
```mermaid
141+
graph TD
142+
A[Trigger Workflow] --> V[Validate compute_gallery_path]
143+
V --> P[Parse Compute Gallery Path]
144+
P --> D[Install dependencies — netcat-openbsd]
145+
D --> L[Azure login — azure/login@v3]
146+
L --> R[Resolve gallery image version + architecture<br/>az sig image-version show + jq from VHD URI]
147+
R --> K[Generate ephemeral SSH keypair — ed25519]
148+
K --> C[Launch test VM — az vm create --nsg-rule SSH]
149+
C --> IP[Resolve VM public IP]
150+
IP --> W[Wait for SSH — 60 × 10 s nc]
151+
W --> T[Run image tests — release/arch/disk/dnf/packages]
152+
T --> U[Upload packages list artifact]
153+
U --> S[Job summary — portal links]
154+
S --> CL[Terminate test VM<br/>VM + OS disk + NIC + Public IP + NSG]
155+
CL --> N[Send Mattermost notification]
156+
```
157+
158+
## VM Lifecycle
159+
160+
The VM is named `azure-test-${ALMA_VERSION}-${DATESTAMP_ITERATION}-${ALMA_ARCH}-${GITHUB_RUN_ID}` (Azure VM names allow dots, so the version dot is preserved as-is for grep-ability in audit logs). `--nsg-rule SSH` opens port 22 from anywhere for the lifetime of the VM, which is acceptable because the VM is short-lived and the SSH key is ephemeral.
161+
162+
The `Terminate test VM` step runs under `if: always() && env.VM_NAME != ''` and deletes — each call wrapped in `|| true` so cleanup always advances:
163+
164+
| Resource | Auto-generated name | `az` command |
165+
|----------|--------------------|--------------|
166+
| VM | `${VM_NAME}` | `az vm delete --yes --force-deletion true` |
167+
| OS disk | resolved from `az vm show storageProfile.osDisk.name` | `az disk delete --yes --no-wait` |
168+
| NIC | `${VM_NAME}VMNic` | `az network nic delete --no-wait` |
169+
| Public IP | `${VM_NAME}PublicIP` | `az network public-ip delete --no-wait` |
170+
| NSG | `${VM_NAME}NSG` | `az network nsg delete --no-wait` |
171+
172+
The `set -e` step still runs all six `az` calls regardless of any one failing.
173+
174+
## Testing
175+
176+
1. **First test against an aarch64 release** (private CI gallery):
177+
```
178+
compute_gallery_path = almalinux_ci/almalinux-ci-10-arm64-gen2/10.1.202605020
179+
```
180+
2. **First test against an x86_64 stable release** (public gallery):
181+
```
182+
compute_gallery_path = almalinux/almalinux-9-gen2/9.7.2026050101
183+
```
184+
3. **Kitten release**:
185+
```
186+
compute_gallery_path = almalinux_ci/almalinux-ci-kitten-10-x64-gen2/10.20260501.0
187+
```
188+
189+
After each run, verify cleanup with:
190+
```bash
191+
az resource list -g rg-alma-images --query "[?contains(name, '<run_id>')]"
192+
# Expected: []
193+
```
194+
195+
## Troubleshooting
196+
197+
### Common Issues
198+
199+
1. **"Invalid Compute Gallery Path" validation error**
200+
- The regex requires three slash-separated parts and a three-part dot version. Kitten paths (`gallery/def/Major.Datestamp.Iteration`) and stable paths (`gallery/def/Major.Minor.Patch`) are both accepted.
201+
202+
2. **"Gallery image version not found"**
203+
- The `az sig image-version show` call returned a non-zero exit. Confirm the path with `az sig image-version list -g rg-alma-images -r <gallery> -i <def>`.
204+
205+
3. **"Could not extract image-version metadata"**
206+
- The `az` call succeeded but `jq` could not find `id` or the source VHD URI under either `.storageProfile.osDiskImage.source.uri` or `.properties.storageProfile.osDiskImage.source.uri`. The raw JSON is dumped to the run log for inspection.
207+
208+
4. **"Could not parse architecture from VHD source"**
209+
- The source VHD filename did not match either `regex_azure` or `regex_simple`. Inspect the VHD URI in the run log; the parsing rule lives in [`AZURE_GALLERY.md`](AZURE_GALLERY.md) and may need to be extended for the new shape on the release path first.
210+
211+
5. **"AuthorizationFailed" on `Microsoft.Compute/galleries/images/read`**
212+
- The service principal lacks the read permission on the gallery. Grant the 21 RBAC actions listed above at `rg-alma-images` scope (or attach a custom role with the same set).
213+
214+
6. **"SSH did not become reachable within 10 minutes"**
215+
- The VM came up but SSH never opened on port 22 from the runner. Possible causes: NSG rule didn't apply (rare), cloud-init not finished, SSH user wrong (the workflow assumes `almalinux` — older AlmaLinux Azure images sometimes only accept `azureuser`).
216+
217+
7. **"Root filesystem resize check failed"**
218+
- The root filesystem on the test VM did not auto-grow to ≥ 98 GiB. Indicates a `cloud-init` / `growpart` regression in the published image.
219+
220+
8. **`dnf check-update` exits with non-100, non-0 code**
221+
- Repo metadata fetch failure or signed metadata mismatch. Re-run; if persistent, check that `RELEASE_VERSION` repo data matches the image's release.
222+
223+
### Linter Warnings
224+
225+
GitHub Actions YAML linters may show "Context access might be invalid" warnings for environment variables set via `$GITHUB_ENV`. These are false positives — the workflow functions correctly.
226+
227+
## Support
228+
229+
- Azure Portal: https://portal.azure.com
230+
- Azure Compute Gallery docs: https://learn.microsoft.com/en-us/azure/virtual-machines/azure-compute-gallery
231+
- AlmaLinux Cloud SIG Chat: https://chat.almalinux.org/almalinux/channels/sigcloud
232+
- Workflow run logs: GitHub Actions tab in the repository

0 commit comments

Comments
 (0)