feat: Add airgap test example with Azure DevOps pipeline#17
Open
lajosnagyquix wants to merge 56 commits intomainfrom
Open
feat: Add airgap test example with Azure DevOps pipeline#17lajosnagyquix wants to merge 56 commits intomainfrom
lajosnagyquix wants to merge 56 commits intomainfrom
Conversation
d5f492f to
f5c6c4d
Compare
added 29 commits
February 24, 2026 11:26
- Add examples/airgap-test/ for automated airgap release testing - Include NSG rules for simulating airgap (blocks external, allows Azure/MCR) - Add healthcheck.sh for comprehensive platform health verification - Add spot instance support to quix-aks module (priority, eviction_policy, spot_max_price) - Add node_labels support for additional node pools - Add validation to prevent Spot on system pools Part of ticket 70784 - Automate Airgap release test [Azure]
- Remove spot instance configuration from workload pool - Spot pools require tolerations on ALL pods and new node pools can't bootstrap in airgap mode (NSG blocks required traffic) - Update workload_node_count minimum to 2 (required for platform) - Fix healthcheck.sh to not exit early on check failures
Spot instances provide ~60-80% cost savings. By NOT applying the NoSchedule taint, all pods can schedule freely without needing tolerations in every helm chart and operator. The spot label is still applied for visibility/debugging.
AKS nodes require unrestricted internet access during bootstrap to pull system images from MCR and register with the control plane. The DenyInternet NSG rule was blocking this traffic, causing cluster creation to hang indefinitely. Changes: - NSG association now depends on workload node pool completion - Removed NSG association from AKS cluster's depends_on - Added explanatory comments documenting the timing requirement
- Increase system node_count from 1 to 2 (required for full platform deployment) - Add documented byoc-values.yaml.template with key airgap settings: - loggingEnabled=true (deploys MinIO) - monitoringEnabled=true (deploys prometheus/grafana) - minio storage class=managed-csi (Azure Files doesn't support chmod) - helmNamespace="" (fixes double-helm path in workspace-service) - Spot instance tolerations for workload scheduling
- run-airgap-test.sh: Complete CI/CD simulation script - Creates AKS cluster with terraform - Installs BYOC from generated values file - Verifies all pods running (especially workspace-service) - Always tears down on exit via trap (even on failure) - Configurable via env vars and CLI args - azure-pipelines.yml: Azure DevOps build pipeline - Daily scheduled runs at 2 AM UTC - Manual trigger with skip-destroy option for debugging - Orphan detection stage for cleanup - Uses variable group for secrets Required secrets: QUIX_ACR_USERNAME, QUIX_ACR_PASSWORD QUIX_ZIP_CLIENT_ID, QUIX_ZIP_CLIENT_SECRET, QUIX_ZIP_TENANT_ID QUIX_LICENSE_KEY Usage: ./run-airgap-test.sh [--run-id ID] [--skip-destroy]
- Fix wc -l parsing bug in wait_for_nodes() that caused arithmetic errors on macOS due to embedded newlines. Changed grep -v | wc -l to grep -cv. - Update default K8s version from 1.30 to 1.33.6 (1.30 is now LTS-only). - Add .gitignore for generated byoc-values.yaml (contains credentials).
- Complete toolchain for running airgap tests consistently - Includes: Terraform 1.7.5, Azure CLI, kubectl 1.33.0, Helm 3.14.0 - Based on mcr.microsoft.com/azure-cli:cbl-mariner2.0 (linux/amd64) - Eliminates macOS vs Linux inconsistencies Also fixed: strip whitespace from node count to fix macOS arithmetic error
- Uses quixregistry.azurecr.io/airgap-test-runner:latest - Eliminates need for tool installation steps - Consistent tooling between local dev and CI/CD - Updated K8s version to 1.33.6
grep -c returns exit code 1 when count is 0 (no matches). With set -e, this caused the script to fail immediately when all nodes were Ready. Added || true to suppress the exit code.
…az acr login in container
- Uncomment backend "azurerm" {} block so pipeline -backend-config
flags are applied. Without this, each stage uses ephemeral local
state and teardown cannot destroy provisioned resources.
- Add terraform state files, .terraform dir, and tfvars to gitignore
to prevent accidental commits of state or secrets.
…scripts Add precheck_registry() to run-airgap-test.sh that validates all required Helm charts and container images exist in quixregistry before spending time on terraform apply. Replace inline pod-checking verify_deployment() with a call to scripts/healthcheck.sh for comprehensive 4-section health verification. Delete dead standalone scripts (install.sh, verify.sh, teardown.sh) that were superseded by the integrated run-airgap-test.sh orchestration.
…re healthcheck Fixes discovered during end-to-end smoke test: - Switch precheck from broken OAuth to az acr CLI for registry validation - Correct chart/image repo paths to match repomirror config - Fix values generation: use field-specific sed instead of blanket CHANGEME - Add local backend override for terraform init without remote storage - Add AcrPull role assignment for kubelet identity on quixregistry - Make healthcheck config-aware: read platform-variables configmap to skip disabled components instead of failing on them - Fix grep pipefail crash in healthcheck diagnostics
- Precheck now reads BYOC role version files and checks exact chart:version tags in the registry, catching version drift (e.g., new release not mirrored) - Shows latest available version when a specific version is missing - Clean up stale helm releases before install (handles --skip-destroy re-runs) - Separate infrastructure charts (version-checked) from platform charts (existence-checked)
The quixplatform installer kills Ansible roles after 10 minutes by default. kube-prometheus-stack needs longer on spot instances. Set ANSIBLE_TIMEOUT=1200 (20m) via image.env in the values template.
dev.sh backgrounds helm and streams installer pod logs via kubectl. In long-running installs (monitoring takes 12+ min), the log stream can disconnect, causing dev.sh to exit non-zero even though the installer pod continues running successfully inside the cluster. When dev.sh fails, poll the quixplatform-manager-job status directly via kubectl until it completes or fails, rather than aborting.
az aks get-credentials without --admin may embed a short-lived bearer token from the SP's OIDC session. When the token expires (~10 min), helm loses its K8s API connection and exits, even though the installer pod continues running inside the cluster. --admin gives client-certificate credentials that don't expire.
--admin appends "-admin" to the kubeconfig context name.
added 8 commits
February 24, 2026 11:26
When dev.sh disconnects, reconnect to the installer pod's log stream directly via kubectl logs --follow --since=20m. Logs stream in the background while the foreground polls job completion status. This restores visibility into Ansible role output that was previously lost when dev.sh's log stream broke.
workspace-service constructs OCI chart URLs using the helmNamespace value from platform-variables configmap. With helmNamespace="" the Ansible installer correctly defaults to "helm" (Jinja2 default with boolean=true), but the platform_values.yaml.j2 template passes the empty string through, causing workspace-service to look for charts at registry root instead of under helm/.
The airgap test now uses the pre-built installer image that has all ansible files, helm charts, and dependencies baked in. No zip download needed at runtime. - Add INSTALLER_TAG substitution (default: 1.6.7-0.1.20260217631-airgap) - Disable privateByocStorageAccount (no zip download) - Remove QUIX_ZIP_* credential requirements - Log installer image tag at startup
After BYOC install completes, tighten NSG rules to minimum: - Remove broad AllowAzureCloud (global), AllowMCR, AllowMCR-CDN (20.0.0.0/8), AllowACR/Storage-WestEurope - Replace with regional AzureCloud.westeurope only - Healthcheck now runs under tight rules proving no hidden deps Negative test deploys docker.io/library/nginx pod and asserts ImagePullBackOff to prove external registries are actually blocked.
Skip dev.sh PVC chart upload convenience. The installer pod now pulls charts directly from quixregistry via registrypullsecret, matching the actual customer airgap deployment path.
When the installer pod finishes before the poll loop detects completion, kubectl logs --follow exits and the process dies. kill on a dead PID returns non-zero, which set -e treats as fatal - skipping the success log and returning 1 from install_byoc, triggering the EXIT trap. Fix: kill ... || true on all three paths (success, failure, timeout) so a dead process doesn't abort the script.
Parse BYOCVersions container_versions.yaml to extract exact image tags for all 19 platform images (authapi, portalui, workspace-service, etc.) and verify each image:tag exists in quixregistry before provisioning infrastructure. This catches the most common failure mode: BYOCVersions main updated with new image tags that haven't been mirrored to quixregistry yet. Previously the precheck only verified repo existence, not the specific tag the installer would try to pull.
- Use priority 201 instead of reusing 200 (ARM race with deletion) - Add 5s sleep between deletes and create for ARM propagation - Wrap nsg rule create in error handling instead of bare set -e - Update default installer tag to 1.6.7-0.1.20260116617-airgap
48c2dcf to
a6745df
Compare
added 13 commits
February 25, 2026 09:10
Move subscription ID, ACR resource ID, and registry hostname out of committed code. All org-specific config now comes from environment variables (ACR_REGISTRY, ACR_ID) set by the private ADO pipeline. - Remove acr_id default containing subscription ID from variables.tf - Parameterize registry hostname via ACR_REGISTRY env var in scripts - Replace hardcoded registry in byoc-values.yaml.template with placeholder - Delete stale azure-pipelines.yml (real pipeline lives in BYOC repo) - Clean up org-specific references in comments
Spot instances are unreliable (no capacity available), causing clusters to remain dead. Use on-demand Standard_D4s_v3 with reserved instances instead. Remove all spot tolerations from BYOC values.
The fat installer has all BYOCVersions baked in. Without this flag, dev.sh tries to sync from a non-existent BYOCVersions directory.
- Add -input=false to terraform apply and destroy (fail fast on missing vars) - Compute JOB_DEADLINE_EPOCH (80 min from start) to cap total install time - install_byoc uses min(INSTALL_TIMEOUT, remaining budget) as effective timeout - wait_for_installer_job polls against deadline instead of elapsed counter - Prevents timeout cascade where install + fallback exceeds job timeout
…p test With acmeEnabled=false and no pre-existing wildcard secret, the ingress role falls through to an empty TLS secret causing Traefik cert errors. Fix by generating a self-signed wildcard cert at runtime and injecting it as fullchainPemBase64/privkeyPemBase64/customCaPemBase64 Helm values. These flow through cert_secrets.yaml hook -> agent-certificates secret -> mounted into installer pod -> ingress role creates wildcard TLS secret. Also adds trust-manager chart and images to registry prechecks since customCertificateAuthoritySupplied=true activates that role.
Configure Auth0 as the identity provider with tenant quix-byoc (eu). Auth0 client credentials are injected from the variable group at runtime.
Resolve quix-byoc.eu.auth0.com at runtime and add an NSG rule to allow HTTPS to Auth0's IPs. This is the one accepted external dependency for customers who choose Auth0 over Keycloak.
Use crane to thin-pull container_versions.yaml from the fat installer image before running registry prechecks. This enables platform image version validation without requiring a BYOCVersions checkout.
Move crane install from runtime download to Dockerfile build step. Removes GitHub dependency at pipeline runtime.
- Replace dig (bind-utils, not in container) with getent ahosts (glibc) for Auth0 DNS resolution in lockdown_network(). Fixes #66854 crash. - Platform image version checks now emit [WARN] instead of [MISSING], tracked separately so mismatches don't fail the precheck. The fat installer has versions baked in that may differ from container_versions.yaml. - Guard crane auth login with if-check so credential failures don't crash the script under set -e.
a66f78c to
cd7bc9c
Compare
added 6 commits
March 3, 2026 15:11
The container runs CBL Mariner with GNU tar, which doesn't support the --include flag (BSD tar). Use positional path argument instead. Also capture crane stderr for debugging extraction failures.
The fat installer image uses app/ansible-baked/ not app/ansible/ for its baked assets directory.
Add api.opsgenie.com (priority 203) for platform operational alerts and expand Auth0 to include both quix-byoc.eu.auth0.com and quix.eu.auth0.com (priority 202, combined IPs). Both use dynamic DNS resolution at runtime with graceful fallback on resolution failure.
- Check in a Root CA (RSA 4096, 10yr) so devs can trust it once; wildcard cert is still generated fresh per run, signed by this CA. - Disable samplesLibrary (github.com), SMTP (sendgrid), bookADemoLink (quix.io) - all unreachable in airgap. - Update template comment to reference certs/ca.crt for trust.
The runner container uses CBL-Mariner 2.0 with OpenSSL 1.1.1 which does not support -copy_extensions or -addext on req. Use an extfile to pass the subjectAltName extension instead.
- AKS cluster using quix-aks module with spot workload nodes - Orchestration script supporting create/deploy/destroy/status actions - Values template for standard BYOC deployment via dev.sh - Per-environment state and resource naming (rg-quix-devN, aks-quix-devN)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
examples/airgap-test/- automated airgap release verification for Quix BYOC deployments (ticket 70784)quixregistry.azurecr.io, runs verification tests, then tears downquix-aksmodule with spot instance support (priority,eviction_policy,spot_max_pricefields)What's included
Infrastructure (
main.tf,variables.tf,outputs.tf):run_id,purpose=airgap-test,temporary=trueScripts:
run-airgap-test.sh- Main orchestration (terraform + helm install + verify + teardown)run-in-container.sh- Local Docker wrapper for consistent toolingscripts/install.sh- Helm install with credential managementscripts/verify.sh- 6-point health check (namespaces, pods, job, deployments, ingress)scripts/healthcheck.sh- Detailed pod health checksscripts/teardown.sh- Terraform destroy + Azure CLI fallback + orphan detectionCI/CD:
azure-pipelines.yml- Azure DevOps pipeline (manual + nightly schedule)Dockerfile- Container with terraform, kubectl, helm, az-clibyoc-values.yaml.template- Helm values for airgap deploymentModule changes (
quix-aks):Test plan