feat: Migrate Zuul to new production cluster (otcinfra2) by LukasCuperDT · Pull Request #1605 · opentelekomcloud-infra/system-config

LukasCuperDT · 2026-05-14T06:11:34Z

Summary

Migrate Zuul CI/CD to the new production cluster otcinfra2, managed by ArgoCD
(zuul-ci Application, namespace zuul-ci). Vault is vault-k8s (HA, KV v2
at secret/), accessed by ArgoCD Vault Plugin with
<path:secret/data/...#field> placeholders.

This branch supersedes #1583 — same head/base, refreshed description that
matches the actual current state after several rounds of fixes.

Zuul prod overlay (`kubernetes/kustomize/zuul/overlays/prod/`)

Full kustomize overlay for Zuul on otcinfra2.
zuul.conf connections (validated against the running scheduler):
- github → app_id 3404137, key /var/lib/zuul-ssh/github.key
- gitlab → git.tsi-dev.otc-service.com
- gitea → gitea-k8s.eco.tsi-dev.otc-service.com (the new k8s gitea, port 2222)
Repointed gitea connection from gitea.eco.tsi-dev.otc-service.com (old VM
behind HAProxy) to gitea-k8s.eco.tsi-dev.otc-service.com. Removed the
192.168.170.200 HAProxy hostAliases workaround — zuul now reaches gitea
via public DNS / ingress (80.158.58.167) with valid TLS, since it lives in
the same cluster.
Bumped Zuul images to 13.1.1-gitea-v6.
clouds.yaml: cloud_users/448_nodepool + clouds/otcci_nodepool_pool1,
private: true endpoint (same VPC).
Patches for zookeeper, executor, merger, scheduler, web, nodepool-launcher.
Ingress for zuul-web and the webhook endpoint
(/api/connection, /api/webhook on zuul.eco.tsi-dev.otc-service.com).
ZK and executor volumes use csi-disk-topology; scheduler keys use
csi-disk-retain.

Zuul preprod overlay

Mirrors the prod layout (executor/merger/scheduler/web patches, ingress,
PV/PVC) so both environments share structure.

Vault integration

Vault TLS enabled with self-signed CA via cert-manager; CA replicated to
zuul-ci namespace for the vault-agent sidecar.
vault-agent runs as zuul UID 10001 (token-file permissions).
VAULT_ADDR switched to the internal service; VAULT_API_ADDR set to the
external ingress URL.
NetworkPolicy: allow zuul-ci → vault on 8200.
Auto-unsealer sidecar (uses wget against the API).
Image pull secret (quay-pull-secret) added on the openstack-plugin init
container.

Vault paths consumed (all live)

zuul/database
zuul/connections/{gitea,github_k8s,gitlab}
zuul/auth/{zitadel,gitea-actions}
zuul/oauth2-proxy
zuul/merger (includes known_hosts — now contains both old and new gitea
RSA keys so the merger can clone over SSH)
zuul/keystore_password, zuul/sshkey
zuul/logs/otc_technical_user

Tenant `eco2`

Tenant config lives in opentelekomcloud-infra/zuul-infra (loaded from
branch Migration_of_zuul_to_prod). After this branch, include-branches
restrictions were removed so PRs against any branch are processed.

Misc

Reduced resource requests across otcinfra2 apps.
Nodepool: max-hold-age set, pod resources reduced.
Removed missing publish-otc-docs-pti template from zuul.d/project.yaml.

Status

Scheduler / web / merger / executor running and reachable
gitea webhook reaches scheduler, gitea SSH cloning works (host key fix)
github webhook + recheck pipeline reaching scheduler
Final sign-off after recheck triggers a successful job run
Switch zuul-config PVC to csi-sfs dynamic provisioning (RWX)

- Add zuul, swift-proxy, oauth2-proxy ArgoCD apps to argocd-k8s - Add new_prod overlay for zuul kustomize - Add new_prod overlay for oauth2-proxy kustomize - Add swift-proxy helm chart with values-new-prod.yaml - Add postgresql helm charts (upstream + local) with csi-disk-retain - Fix AVP references: align with preprod secret structure - Fix clouds.yaml: use cloud_users/448_nodepool for creds - Set private: true for internal OTC endpoint access

…ename new_prod to prod - Remove clusterIP: None from zuul-web service (breaks ingress routing) - Remove nodeSelector dedicated=zuul-ci from all 8 base components - Add preprod overlay from monitoring branch (kustomize-validated) - Rename new_prod overlay to prod - Rename swift-proxy values-new-prod.yaml to values-prod.yaml

- Add JSON 6902 patch: clusterIP: None on zuul-web service - Add JSON 6902 patch: nodeSelector dedicated=zuul-ci on all zuul pods - Matches live otcci cluster state exactly (zero diff with main)

Rename all new-prod/new_prod references to prod across the production overlay: resource names, labels, cert names, swift container, log paths, k8s context, and tenant config path.

- Fix oauth2-proxy ArgoCD path (new_prod → prod), rename app to oauth2-proxy-zuul - Set SFS Turbo IP 192.168.170.45 for zuul PVs - Fix StorageClass to csi-sfsturbo, add Retain reclaim policy - Change namespace from zuul to zuul-ci (matches Vault roles) - Add swift-proxy zuul-prod-logs tempauth user - Add zuul executor swift auth credentials in site-vars.yaml

- Fix ZK cert SANs, zuul.conf, nodepool.yaml: zuul.svc → zuul-ci.svc - Change zuul-config PV path from / to /zuul-config (avoid SFS Turbo root conflict) - Add nodeSelector os.name=HCE_OS_2.0_x86_64 to all workloads (CentOS kernel 3.10 breaks bwrap)

WaitForFirstConsumer binding mode ensures EVS disks are provisioned in the same AZ (eu-de-03) as the HCE worker nodes.

…d nodepool-launcher volumes patch

- Include opentelekomcloud/ansible-collection-cloud as untrusted project - Only include project config (jobs/templates come from zuul-infra) - Load from main branch

- New connection github-otc (app_id 3404137) for opentelekomcloud org - Add github-otc.key to zuul-ssh-keys secret - Move ansible-collection-cloud under github-otc source in tenant config

- Remove github-otc connection, use single github connection for both orgs - Move ansible-collection-cloud under github source in tenant config - Remove github-otc.key from zuul-ssh-keys secret

…h only

Separate prod (github_k8s) from preprod (github) AVP references so both environments can coexist with different GitHub App configs.

Override implied branch matcher so PRs targeting main are also matched by the prod project config.

Update project.yaml branches matcher and tenant config include-branches to use Zuul_test instead of Migration_of_zuul_to_prod for testing.

Cluster CPU requests were at 130% of capacity (25.5/19.6 cores). Adjusted requests to match actual usage with headroom: - vault: server 500m→100m, injector/CSI 250m→50m - zitadel: 250m→50m, 512Mi→256Mi - dependencytrack: API 1000m→100m (mem 4G→5Gi for actual 4.6Gi usage) - dependencytrack frontend: 1000m→100m - mcaptcha: 500m→200m, redis 250m→50m - postgresql: primary 250m→50m, metrics 100m→25m - element: synapse/SFU/haproxy 100m→25m, redis 50m→10m - cpn frontend: 200m→50m Total CPU request savings: ~4785m

Adds a lightweight sidecar to the vault StatefulSet that: - Checks seal status every 10s via localhost:8200 - If sealed, reads 3 unseal keys from vault-unseal-keys secret - Runs vault operator unseal for each key The RBAC for accessing vault-unseal-keys was already set up in the local vault chart (unsealer-rbac.yaml).

vault status returns exit code 2 when sealed, causing the sidecar to think vault is unreachable. Switch to wget-based API calls which correctly detect sealed state and unseal via /v1/sys/unseal.

- Add max-hold-age: 7200 (2h) to zuul-debian label so stale pods/namespaces auto-expire instead of lingering indefinitely - Reduce pod requests from 1CPU/1Gi to 500m/512Mi (CI jobs rarely need a full core) - Increase limits to 2CPU/2Gi to allow burst during builds

GitHub check runs will now show as eco/check instead of production/check. Pipeline regex patterns already use .*/check so no other changes needed.

GitHub App not installed on terraform-opentelekomcloud-modules org; the resulting 403 on branch-protection GraphQL trips the GithubRateLimitHandler bogus rate-limit (3389s sleep) and wedges tenant load. Drop until App is installed on that org.

The eco2 production tenant was missing the opendev source, so jobs resolving roles like 'bindep', 'revoke-sudo', etc. from opendev.org/zuul/zuul-jobs failed with: ERROR! the role 'bindep' was not found Mirror the legacy eco tenant (otcci cluster) by adding zuul-jobs, ansible-collections-openstack and refstack-client as untrusted-projects under the existing 'opendev' connection. Shadow zuul-jobs only with opentelekomcloud-infra/zuul-infra so the local zuul-infra in-repo jobs win over upstream definitions where applicable. otc-zuul-jobs is intentionally NOT shadowed and not included as a project - new prod cluster does not use otc-zuul-jobs; everything must live in zuul-infra or system-config. Also remove opentelekomcloud-infra/otc-zuul-jobs from the github untrusted-projects list for the same reason. Fixes the bindep role resolution failure in publish-otc-docs-hc periodic job on eco2.

26 untrusted-projects in the eco2 tenant define their own build/lint jobs in their own .zuul.yaml (e.g. <repo>-build-image, tflint, simple, otc-api-ref-src). The tenant previously restricted include to only 'project', which caused those self-referential job definitions to be ignored, producing 'Job X not defined' config errors for each repo. Add 'job' to the include list for the affected repos so Zuul loads their job definitions alongside their project pipeline. Affected repos: - opentelekomcloud-infra/{actions-runner,alerta,cloudmon-plugin-smtp, csm-test-utils,gitstyring,identity-deploy,octavia-proxy,otc-api-ref, rancher-integration-tests,vault-raft-backup,zuul-config} - opentelekomcloud/{ansible-collection-gitcontrol,oms,python-otcextensions, swift-proxy,terraform-opentelekomcloud-modules, terraform-opentelekomcloud-project-factory, terraform-provider-opentelekomcloud,terraform-provider-rancher2} - stackmon/{apimon,cloudmon,cloudmon-plugin-loadbalancer,globalmon, metrics-processor,otc-status-dashboard,status-dashboard}

Allow infra-prod-* job definitions in opentelekomcloud-infra/system-config to be loaded by the eco2 tenant. Without this, the octavia-proxy project references infra-prod-propose-octavia-proxy-image-stable but the job definition (in system-config/zuul.d/infra-prod.yaml) is silently dropped.

The 'infra-prod-propose-update' base job (and its descendants like infra-prod-propose-octavia-proxy-image-stable) reference a Zuul secret named 'github_propose_token' that was never defined anywhere. This caused a 'job ... uses secret github_propose_token which is not defined' config error. * Define github_propose_token in zuul.d/secrets.yaml (user/email/token, encrypted with the eco2 tenant pubkey for opentelekomcloud-infra/ system-config). Token comes from vault secret/github/otcbot. * Update the eco2 tenant include for opentelekomcloud-infra/system-config to add 'secret' alongside 'project' and 'job' so the secret is loaded.

…cret for gitcontrol The eco2 tenant was loading opentelekomcloud-infra/system-config jobs and secrets from the default branch (main), which still contains the old bridge-host based propose-update parent and lacks the github_propose_token secret. Set load-branch to Migration_of_zuul_to_prod so the new k8s-only stack is used. Also include 'secret' for opentelekomcloud/ansible-collection-gitcontrol so its in-repo gitcontrol-secret can resolve.

load-branch on an untrusted project does not RESTRICT loading to that branch — Zuul still loads main as well. Combined with system-config defining the same image-build/infra-prod jobs that zuul-infra (config- project) also defines, this triggered ~40 'not permitted to shadow' errors. Reverting until we either (a) merge the migration branch into main, or (b) introduce include-branches with explicit allow-list.

All system-config jobs (docker-image, infra-prod-*) have been moved to zuul-infra (config-project) where they belong. Loading 'job' from system-config@main now causes ~40 'not permitted to shadow' errors because the same job names also exist in zuul-infra. Drop 'job' and 'secret' from the include list -- system-config is now a pure project-stanza source for the eco2 tenant.

These jobs (infra-prod-bootstrap-bridge, infra-prod-service-bridge, infra-prod-gitea-sync) are not needed in this setup — the infra is managed via k8s/ArgoCD, not via an Ansible bridge host. - Remove all three from the deploy pipeline - Remove bootstrap-bridge and service-bridge from the periodic pipeline - Remove the entire periodic-hourly section (only contained these jobs) - Fix infra-prod-grafana-datasources: was depending on bootstrap-bridge (bridge-specific), now depends on infra-prod-base like all others

The python-tox image does not include LaTeX/wget and other packages required for PDF doc builds. All CI roles (prepare-build-pdf-docs, bindep package install) are designed for root-accessible nodes. Running as uid 1000 with no sudo prevents any system package installation. Change runAsUser/runAsGroup to 0 so the pod runs as root, matching the operational model assumed by the playbooks.

…tem/* projects

Set unauthenticated_metrics_access=false in vault listener telemetry config. Prometheus uses a bearer token (secret/prometheus/infra#vault_metrics_token) for authenticated scraping, so monitoring will continue to work.

git.tsi-dev.otc-service.com returns an SSO/privacyIDEA login page instead of API responses, causing Zuul full-reconfigure to hang indefinitely when fetching branches for ecosystem/zuul-project-config. Comment out the gitlab connection in zuul.conf and the gitlab section in the eco2 tenant config until the GitLab API token/auth is restored.

The repo has been transferred to stackmon/ansible-collection-apimon which is already in the tenant config. The stale opentelekomcloud/ entry causes a 403 on every startup (no GitHub App installation), falling back to unauthenticated requests that immediately exhaust the 60/hr rate limit.

Repo exists on Gitea but was missing from the tenant config. Also installed webhook (id=138) on the repo.

- Add helm chart under kubernetes/helm_charts/local/gitea-runner/ - Deploys act_runner connected to gitea-k8s.eco.tsi-dev.otc-service.com - Runner name: ubuntu-gitea-runner-01 (labels: ubuntu, ubuntu-22.04) - Uses dind sidecar for Docker-in-Docker builds, runner image: debian:bookworm - Registration token sourced from Vault: gitea-k8s/gitea-runner - Add ArgoCD app entry in prod values for otcinfra2 cluster targetRevision: Migration_of_zuul_to_prod

- Use instance-level registration token (Gitea site admin > Runners) instead of org-level, so runner serves ALL repos on gitea-k8s - Rename runner to gitea-k8s-runner-01 (not system-config specific) - Add more runner labels: debian, debian-bookworm in addition to ubuntu

…s/checkout) debian:bookworm lacks node binary needed by JS actions like actions/checkout@v4

node:20-bookworm causes actions/setup-python@v5 to fail because it cannot find Python 3.12 for the OS (no match in python-versions manifest). catthehacker/ubuntu:act-22.04 is designed for act-based runners and supports actions/setup-python correctly.

Building Go providers with goreleaser (14 cross-compilation targets) OOM-kills the container. Increase memory limit from 2Gi to 4Gi to give goreleaser sufficient headroom. Also aligns better with the eco1 VMs (s2.large.2 = 8Gi RAM) used for the same job type.

Prod requested cpu:500m, preprod requested cpu:1 (==limit, no bursting). On the 5-node cluster, CPU requests were oversubscribed: real usage was only 11-36%, but the scheduler reported 'Insufficient cpu' and the cluster-autoscaler refused to scale ('pod didn't trigger scale-up'). This blocked periodic publish-otc-docs-hc jobs from getting nodes for 30+ hours, exhausting the docs-build semaphore (max=10). Reduce CPU requests to 200m in both overlays so more concurrent build pods can schedule onto existing nodes. Memory requests and all limits unchanged so heavy jobs (goreleaser, tox-functional, terraform) can still burst up to their original ceiling without OOM risk. CPU is compressible so over-packing only causes graceful throttling, never kills.

The python-tox image used for zuul-debian has no container tooling (no podman/buildah/docker/skopeo) and no systemd, so image-build jobs that ran ensure-docker on it produced RETRY_LIMIT. Introduce a dedicated builder pod label backed by quay.io/buildah/stable which ships buildah + podman + skopeo + fuse-overlayfs. Runs privileged so buildah can do rootful builds; uses STORAGE_DRIVER=vfs to avoid fuse-overlayfs requirements in the containers-storage emptyDir volume. zuul-infra image-build/upload/promote jobs are switched to this new nodeset in a companion change.

The zuul.d/infra-prod.yaml, project.yaml and secrets.yaml files were modified for the new k8s-only prod setup, but: 1. Secrets (otc_clouds_yaml, eco2_vault_approle) are defined in zuul-infra repo and cannot be used by jobs in system-config. 2. Changing these files in this PR modifies the old prod configuration which must remain unchanged during the migration. New prod job definitions belong in zuul-infra/zuul.d/, not here.

Nightly docs build batch was triggering OOM on the zuul-debian pod when apt installed ~1.4GB of packages (due to recommends being enabled). The 4Gi limit was insufficient during peak concurrent job load. Increasing to 6Gi provides headroom for the PDF prereq install even after the install_recommends: false fix is deployed, as a safety net.

LukasCuperDT added 30 commits May 13, 2026 11:36

feat: Add oauth2-proxy preprod overlay from monitoring branch

f9d3edf

fix: Restore nodeSelector and headless service in zuul_ci overlay

d7ff779

- Add JSON 6902 patch: clusterIP: None on zuul-web service - Add JSON 6902 patch: nodeSelector dedicated=zuul-ci on all zuul pods - Matches live otcci cluster state exactly (zero diff with main)

Rename new-prod to prod in zuul overlay

38d0e1c

Rename all new-prod/new_prod references to prod across the production overlay: resource names, labels, cert names, swift container, log paths, k8s context, and tenant config path.

zuul prod: use csi-disk-topology for ZK and executor volumes

075e252

WaitForFirstConsumer binding mode ensures EVS disks are provisioned in the same AZ (eu-de-03) as the HCE worker nodes.

Fix tenant config (remove default zuul.d from extra-config-paths), ad…

9a763c4

…d nodepool-launcher volumes patch

Fix tenant config: use Migration_of_zuul_to_prod branch for all projects

d108bb0

Add ansible-collection-cloud to tenant config

976ecba

- Include opentelekomcloud/ansible-collection-cloud as untrusted project - Only include project config (jobs/templates come from zuul-infra) - Load from main branch

Add github-otc connection for opentelekomcloud org

403b8eb

- New connection github-otc (app_id 3404137) for opentelekomcloud org - Add github-otc.key to zuul-ssh-keys secret - Move ansible-collection-cloud under github-otc source in tenant config

Consolidate to single GitHub connection (app_id 3404137)

85894b0

- Remove github-otc connection, use single github connection for both orgs - Move ansible-collection-cloud under github source in tenant config - Remove github-otc.key from zuul-ssh-keys secret

Add main branch to Gitea include-branches for PR triggers

7b37cf8

Add sshkey to Gitea connection for private repo cloning

ae05d88

Set git_port=2222 for Gitea SSH connection

8754cdc

Remove missing publish-otc-docs-pti template from project config

93ed2e0

Remove main from Gitea include-branches until PR is merged

856f09d

Add main to gitea include-branches for prod zuul tenant

6def4d8

Set ZUUL_CONFIG_BRANCH to Migration_of_zuul_to_prod for zuul-infra

b034514

Restrict zuul-infra config-project to Migration_of_zuul_to_prod branc…

e8b40bb

…h only

Remove merge-mode squash-merge (not supported by GitHub connection)

dea0f9c

Use github_k8s Vault path for prod zuul GitHub connection

28acdd2

Separate prod (github_k8s) from preprod (github) AVP references so both environments can coexist with different GitHub App configs.

Add explicit branches matcher for prod check pipeline

67ea502

Override implied branch matcher so PRs targeting main are also matched by the prod project config.

Switch zuul target branch to Zuul_test

05461df

Update project.yaml branches matcher and tenant config include-branches to use Zuul_test instead of Migration_of_zuul_to_prod for testing.

Fix vault unsealer: use wget API instead of vault CLI

c502d73

vault status returns exit code 2 when sealed, causing the sidecar to think vault is unreachable. Switch to wget-based API calls which correctly detect sealed state and unseal via /v1/sys/unseal.

Rename zuul tenant from 'production' to 'eco'

4e51a62

GitHub check runs will now show as eco/check instead of production/check. Pipeline regex patterns already use .*/check so no other changes needed.

LukasCuperDT added 15 commits May 14, 2026 10:14

Document projects removed from eco2 tenant during migration

3f6afbe

Remove noop stub jobs from periodic pipeline

402d488

tenant: add gitea docs/*, docs-swiss/*, infra/* repos + gitlab ecosys…

b50e59c

…tem/* projects

Fix YAML anchor null value for infra-prod-base job

2cc2ff1

security: disable unauthenticated vault metrics endpoint

f2db155

Set unauthenticated_metrics_access=false in vault listener telemetry config. Prometheus uses a bearer token (secret/prometheus/infra#vault_metrics_token) for authenticated scraping, so monitoring will continue to work.

LukasCuperDT force-pushed the Migration_of_zuul_to_prod branch from cec68b4 to f2db155 Compare May 22, 2026 08:29

LukasCuperDT added 14 commits May 22, 2026 10:35

Upgrading Argocd

71ea3f7

zuul: add infra/otc-metadata-rework to eco2 tenant

9f20ee6

Repo exists on Gitea but was missing from the tenant config. Also installed webhook (id=138) on the repo.

Fix runner labels: use node:20-bookworm image (has node.js for action…

c503f9d

…s/checkout) debian:bookworm lacks node binary needed by JS actions like actions/checkout@v4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Migrate Zuul to new production cluster (otcinfra2)#1605

feat: Migrate Zuul to new production cluster (otcinfra2)#1605
LukasCuperDT wants to merge 82 commits into
mainfrom
Migration_of_zuul_to_prod

LukasCuperDT commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LukasCuperDT commented May 14, 2026

Summary

Zuul prod overlay (kubernetes/kustomize/zuul/overlays/prod/)

Zuul preprod overlay

Vault integration

Vault paths consumed (all live)

Tenant eco2

Misc

Status

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Zuul prod overlay (`kubernetes/kustomize/zuul/overlays/prod/`)

Tenant `eco2`