Skip to content

feat: Add airgap test example with Azure DevOps pipeline#17

Open
lajosnagyquix wants to merge 56 commits intomainfrom
quix-70784-airgap-cluster-tf
Open

feat: Add airgap test example with Azure DevOps pipeline#17
lajosnagyquix wants to merge 56 commits intomainfrom
quix-70784-airgap-cluster-tf

Conversation

@lajosnagyquix
Copy link

Summary

  • Add examples/airgap-test/ - automated airgap release verification for Quix BYOC deployments (ticket 70784)
  • Creates AKS cluster with NSG-based egress filtering, installs Quix BYOC using only quixregistry.azurecr.io, runs verification tests, then tears down
  • Extends quix-aks module with spot instance support (priority, eviction_policy, spot_max_price fields)

What's included

Infrastructure (main.tf, variables.tf, outputs.tf):

  • AKS cluster with 2 system + 2 spot workload nodes (Standard_D4s_v3)
  • NSG rules: allow Azure services, MCR CDN, deny all other outbound
  • NSG applied after cluster bootstrap to avoid blocking node registration
  • All resources tagged with run_id, purpose=airgap-test, temporary=true
  • Remote azurerm backend for CI/CD state management

Scripts:

  • run-airgap-test.sh - Main orchestration (terraform + helm install + verify + teardown)
  • run-in-container.sh - Local Docker wrapper for consistent tooling
  • scripts/install.sh - Helm install with credential management
  • scripts/verify.sh - 6-point health check (namespaces, pods, job, deployments, ingress)
  • scripts/healthcheck.sh - Detailed pod health checks
  • scripts/teardown.sh - Terraform destroy + Azure CLI fallback + orphan detection

CI/CD:

  • azure-pipelines.yml - Azure DevOps pipeline (manual + nightly schedule)
  • Dockerfile - Container with terraform, kubectl, helm, az-cli
  • byoc-values.yaml.template - Helm values for airgap deployment

Module changes (quix-aks):

  • Add spot instance fields to node pool variable type

Test plan

  • Local testing achieved 13/13 verification tests passing (~35 min total)
  • Azure DevOps pipeline first run after variable group setup
  • Verify terraform state persists across pipeline stages

@lajosnagyquix lajosnagyquix force-pushed the quix-70784-airgap-cluster-tf branch from d5f492f to f5c6c4d Compare February 18, 2026 19:01
Lajos Nagy added 29 commits February 24, 2026 11:26
- Add examples/airgap-test/ for automated airgap release testing
- Include NSG rules for simulating airgap (blocks external, allows Azure/MCR)
- Add healthcheck.sh for comprehensive platform health verification
- Add spot instance support to quix-aks module (priority, eviction_policy, spot_max_price)
- Add node_labels support for additional node pools
- Add validation to prevent Spot on system pools

Part of ticket 70784 - Automate Airgap release test [Azure]
- Remove spot instance configuration from workload pool
- Spot pools require tolerations on ALL pods and new node pools
  can't bootstrap in airgap mode (NSG blocks required traffic)
- Update workload_node_count minimum to 2 (required for platform)
- Fix healthcheck.sh to not exit early on check failures
Spot instances provide ~60-80% cost savings. By NOT applying the
NoSchedule taint, all pods can schedule freely without needing
tolerations in every helm chart and operator.

The spot label is still applied for visibility/debugging.
AKS nodes require unrestricted internet access during bootstrap to
pull system images from MCR and register with the control plane.
The DenyInternet NSG rule was blocking this traffic, causing cluster
creation to hang indefinitely.

Changes:
- NSG association now depends on workload node pool completion
- Removed NSG association from AKS cluster's depends_on
- Added explanatory comments documenting the timing requirement
- Increase system node_count from 1 to 2 (required for full platform deployment)
- Add documented byoc-values.yaml.template with key airgap settings:
  - loggingEnabled=true (deploys MinIO)
  - monitoringEnabled=true (deploys prometheus/grafana)
  - minio storage class=managed-csi (Azure Files doesn't support chmod)
  - helmNamespace="" (fixes double-helm path in workspace-service)
  - Spot instance tolerations for workload scheduling
- run-airgap-test.sh: Complete CI/CD simulation script
  - Creates AKS cluster with terraform
  - Installs BYOC from generated values file
  - Verifies all pods running (especially workspace-service)
  - Always tears down on exit via trap (even on failure)
  - Configurable via env vars and CLI args

- azure-pipelines.yml: Azure DevOps build pipeline
  - Daily scheduled runs at 2 AM UTC
  - Manual trigger with skip-destroy option for debugging
  - Orphan detection stage for cleanup
  - Uses variable group for secrets

Required secrets:
  QUIX_ACR_USERNAME, QUIX_ACR_PASSWORD
  QUIX_ZIP_CLIENT_ID, QUIX_ZIP_CLIENT_SECRET, QUIX_ZIP_TENANT_ID
  QUIX_LICENSE_KEY

Usage: ./run-airgap-test.sh [--run-id ID] [--skip-destroy]
- Fix wc -l parsing bug in wait_for_nodes() that caused arithmetic errors
  on macOS due to embedded newlines. Changed grep -v | wc -l to grep -cv.
- Update default K8s version from 1.30 to 1.33.6 (1.30 is now LTS-only).
- Add .gitignore for generated byoc-values.yaml (contains credentials).
- Complete toolchain for running airgap tests consistently
- Includes: Terraform 1.7.5, Azure CLI, kubectl 1.33.0, Helm 3.14.0
- Based on mcr.microsoft.com/azure-cli:cbl-mariner2.0 (linux/amd64)
- Eliminates macOS vs Linux inconsistencies

Also fixed: strip whitespace from node count to fix macOS arithmetic error
- Uses quixregistry.azurecr.io/airgap-test-runner:latest
- Eliminates need for tool installation steps
- Consistent tooling between local dev and CI/CD
- Updated K8s version to 1.33.6
grep -c returns exit code 1 when count is 0 (no matches).
With set -e, this caused the script to fail immediately when
all nodes were Ready. Added || true to suppress the exit code.
- Uncomment backend "azurerm" {} block so pipeline -backend-config
  flags are applied. Without this, each stage uses ephemeral local
  state and teardown cannot destroy provisioned resources.
- Add terraform state files, .terraform dir, and tfvars to gitignore
  to prevent accidental commits of state or secrets.
…scripts

Add precheck_registry() to run-airgap-test.sh that validates all required
Helm charts and container images exist in quixregistry before spending time
on terraform apply. Replace inline pod-checking verify_deployment() with a
call to scripts/healthcheck.sh for comprehensive 4-section health verification.

Delete dead standalone scripts (install.sh, verify.sh, teardown.sh) that were
superseded by the integrated run-airgap-test.sh orchestration.
…re healthcheck

Fixes discovered during end-to-end smoke test:
- Switch precheck from broken OAuth to az acr CLI for registry validation
- Correct chart/image repo paths to match repomirror config
- Fix values generation: use field-specific sed instead of blanket CHANGEME
- Add local backend override for terraform init without remote storage
- Add AcrPull role assignment for kubelet identity on quixregistry
- Make healthcheck config-aware: read platform-variables configmap to skip
  disabled components instead of failing on them
- Fix grep pipefail crash in healthcheck diagnostics
- Precheck now reads BYOC role version files and checks exact chart:version
  tags in the registry, catching version drift (e.g., new release not mirrored)
- Shows latest available version when a specific version is missing
- Clean up stale helm releases before install (handles --skip-destroy re-runs)
- Separate infrastructure charts (version-checked) from platform charts (existence-checked)
The quixplatform installer kills Ansible roles after 10 minutes by
default. kube-prometheus-stack needs longer on spot instances.
Set ANSIBLE_TIMEOUT=1200 (20m) via image.env in the values template.
dev.sh backgrounds helm and streams installer pod logs via kubectl.
In long-running installs (monitoring takes 12+ min), the log stream
can disconnect, causing dev.sh to exit non-zero even though the
installer pod continues running successfully inside the cluster.

When dev.sh fails, poll the quixplatform-manager-job status directly
via kubectl until it completes or fails, rather than aborting.
az aks get-credentials without --admin may embed a short-lived bearer
token from the SP's OIDC session. When the token expires (~10 min),
helm loses its K8s API connection and exits, even though the installer
pod continues running inside the cluster.

--admin gives client-certificate credentials that don't expire.
--admin appends "-admin" to the kubeconfig context name.
Lajos Nagy added 8 commits February 24, 2026 11:26
When dev.sh disconnects, reconnect to the installer pod's log stream
directly via kubectl logs --follow --since=20m. Logs stream in the
background while the foreground polls job completion status.
This restores visibility into Ansible role output that was previously
lost when dev.sh's log stream broke.
workspace-service constructs OCI chart URLs using the helmNamespace
value from platform-variables configmap. With helmNamespace="" the
Ansible installer correctly defaults to "helm" (Jinja2 default with
boolean=true), but the platform_values.yaml.j2 template passes the
empty string through, causing workspace-service to look for charts
at registry root instead of under helm/.
The airgap test now uses the pre-built installer image that has all
ansible files, helm charts, and dependencies baked in. No zip download
needed at runtime.

- Add INSTALLER_TAG substitution (default: 1.6.7-0.1.20260217631-airgap)
- Disable privateByocStorageAccount (no zip download)
- Remove QUIX_ZIP_* credential requirements
- Log installer image tag at startup
After BYOC install completes, tighten NSG rules to minimum:
- Remove broad AllowAzureCloud (global), AllowMCR, AllowMCR-CDN
  (20.0.0.0/8), AllowACR/Storage-WestEurope
- Replace with regional AzureCloud.westeurope only
- Healthcheck now runs under tight rules proving no hidden deps

Negative test deploys docker.io/library/nginx pod and asserts
ImagePullBackOff to prove external registries are actually blocked.
Skip dev.sh PVC chart upload convenience. The installer pod now
pulls charts directly from quixregistry via registrypullsecret,
matching the actual customer airgap deployment path.
When the installer pod finishes before the poll loop detects
completion, kubectl logs --follow exits and the process dies.
kill on a dead PID returns non-zero, which set -e treats as
fatal - skipping the success log and returning 1 from
install_byoc, triggering the EXIT trap.

Fix: kill ... || true on all three paths (success, failure,
timeout) so a dead process doesn't abort the script.
Parse BYOCVersions container_versions.yaml to extract exact image
tags for all 19 platform images (authapi, portalui, workspace-service,
etc.) and verify each image:tag exists in quixregistry before
provisioning infrastructure.

This catches the most common failure mode: BYOCVersions main updated
with new image tags that haven't been mirrored to quixregistry yet.
Previously the precheck only verified repo existence, not the specific
tag the installer would try to pull.
- Use priority 201 instead of reusing 200 (ARM race with deletion)
- Add 5s sleep between deletes and create for ARM propagation
- Wrap nsg rule create in error handling instead of bare set -e
- Update default installer tag to 1.6.7-0.1.20260116617-airgap
@lajosnagyquix lajosnagyquix force-pushed the quix-70784-airgap-cluster-tf branch from 48c2dcf to a6745df Compare February 24, 2026 11:27
Lajos Nagy added 13 commits February 25, 2026 09:10
Move subscription ID, ACR resource ID, and registry hostname out of
committed code. All org-specific config now comes from environment
variables (ACR_REGISTRY, ACR_ID) set by the private ADO pipeline.

- Remove acr_id default containing subscription ID from variables.tf
- Parameterize registry hostname via ACR_REGISTRY env var in scripts
- Replace hardcoded registry in byoc-values.yaml.template with placeholder
- Delete stale azure-pipelines.yml (real pipeline lives in BYOC repo)
- Clean up org-specific references in comments
Spot instances are unreliable (no capacity available), causing clusters
to remain dead. Use on-demand Standard_D4s_v3 with reserved instances
instead. Remove all spot tolerations from BYOC values.
The fat installer has all BYOCVersions baked in. Without this flag,
dev.sh tries to sync from a non-existent BYOCVersions directory.
- Add -input=false to terraform apply and destroy (fail fast on missing vars)
- Compute JOB_DEADLINE_EPOCH (80 min from start) to cap total install time
- install_byoc uses min(INSTALL_TIMEOUT, remaining budget) as effective timeout
- wait_for_installer_job polls against deadline instead of elapsed counter
- Prevents timeout cascade where install + fallback exceeds job timeout
…p test

With acmeEnabled=false and no pre-existing wildcard secret, the ingress
role falls through to an empty TLS secret causing Traefik cert errors.

Fix by generating a self-signed wildcard cert at runtime and injecting it
as fullchainPemBase64/privkeyPemBase64/customCaPemBase64 Helm values.
These flow through cert_secrets.yaml hook -> agent-certificates secret ->
mounted into installer pod -> ingress role creates wildcard TLS secret.

Also adds trust-manager chart and images to registry prechecks since
customCertificateAuthoritySupplied=true activates that role.
Configure Auth0 as the identity provider with tenant quix-byoc (eu).
Auth0 client credentials are injected from the variable group at runtime.
Resolve quix-byoc.eu.auth0.com at runtime and add an NSG rule to allow
HTTPS to Auth0's IPs. This is the one accepted external dependency for
customers who choose Auth0 over Keycloak.
Use crane to thin-pull container_versions.yaml from the fat installer
image before running registry prechecks. This enables platform image
version validation without requiring a BYOCVersions checkout.
Move crane install from runtime download to Dockerfile build step.
Removes GitHub dependency at pipeline runtime.
- Replace dig (bind-utils, not in container) with getent ahosts (glibc)
  for Auth0 DNS resolution in lockdown_network(). Fixes #66854 crash.
- Platform image version checks now emit [WARN] instead of [MISSING],
  tracked separately so mismatches don't fail the precheck. The fat
  installer has versions baked in that may differ from
  container_versions.yaml.
- Guard crane auth login with if-check so credential failures don't
  crash the script under set -e.
@lajosnagyquix lajosnagyquix force-pushed the quix-70784-airgap-cluster-tf branch from a66f78c to cd7bc9c Compare March 3, 2026 15:05
Lajos Nagy added 6 commits March 3, 2026 15:11
The container runs CBL Mariner with GNU tar, which doesn't support
the --include flag (BSD tar). Use positional path argument instead.
Also capture crane stderr for debugging extraction failures.
The fat installer image uses app/ansible-baked/ not app/ansible/ for
its baked assets directory.
Add api.opsgenie.com (priority 203) for platform operational alerts and
expand Auth0 to include both quix-byoc.eu.auth0.com and quix.eu.auth0.com
(priority 202, combined IPs). Both use dynamic DNS resolution at runtime
with graceful fallback on resolution failure.
- Check in a Root CA (RSA 4096, 10yr) so devs can trust it once;
  wildcard cert is still generated fresh per run, signed by this CA.
- Disable samplesLibrary (github.com), SMTP (sendgrid), bookADemoLink
  (quix.io) - all unreachable in airgap.
- Update template comment to reference certs/ca.crt for trust.
The runner container uses CBL-Mariner 2.0 with OpenSSL 1.1.1 which
does not support -copy_extensions or -addext on req. Use an extfile
to pass the subjectAltName extension instead.
- AKS cluster using quix-aks module with spot workload nodes
- Orchestration script supporting create/deploy/destroy/status actions
- Values template for standard BYOC deployment via dev.sh
- Per-environment state and resource naming (rg-quix-devN, aks-quix-devN)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant