Skip to content

feat: Migrate Zuul to new production cluster (otcinfra2)#1605

Open
LukasCuperDT wants to merge 82 commits into
mainfrom
Migration_of_zuul_to_prod
Open

feat: Migrate Zuul to new production cluster (otcinfra2)#1605
LukasCuperDT wants to merge 82 commits into
mainfrom
Migration_of_zuul_to_prod

Conversation

@LukasCuperDT

Copy link
Copy Markdown
Contributor

Summary

Migrate Zuul CI/CD to the new production cluster otcinfra2, managed by ArgoCD
(zuul-ci Application, namespace zuul-ci). Vault is vault-k8s (HA, KV v2
at secret/), accessed by ArgoCD Vault Plugin with
<path:secret/data/...#field> placeholders.

This branch supersedes #1583 — same head/base, refreshed description that
matches the actual current state after several rounds of fixes.

Zuul prod overlay (kubernetes/kustomize/zuul/overlays/prod/)

  • Full kustomize overlay for Zuul on otcinfra2.
  • zuul.conf connections (validated against the running scheduler):
    • github → app_id 3404137, key /var/lib/zuul-ssh/github.key
    • gitlabgit.tsi-dev.otc-service.com
    • giteagitea-k8s.eco.tsi-dev.otc-service.com (the new k8s gitea, port 2222)
  • Repointed gitea connection from gitea.eco.tsi-dev.otc-service.com (old VM
    behind HAProxy) to gitea-k8s.eco.tsi-dev.otc-service.com. Removed the
    192.168.170.200 HAProxy hostAliases workaround — zuul now reaches gitea
    via public DNS / ingress (80.158.58.167) with valid TLS, since it lives in
    the same cluster.
  • Bumped Zuul images to 13.1.1-gitea-v6.
  • clouds.yaml: cloud_users/448_nodepool + clouds/otcci_nodepool_pool1,
    private: true endpoint (same VPC).
  • Patches for zookeeper, executor, merger, scheduler, web, nodepool-launcher.
  • Ingress for zuul-web and the webhook endpoint
    (/api/connection, /api/webhook on zuul.eco.tsi-dev.otc-service.com).
  • ZK and executor volumes use csi-disk-topology; scheduler keys use
    csi-disk-retain.

Zuul preprod overlay

Mirrors the prod layout (executor/merger/scheduler/web patches, ingress,
PV/PVC) so both environments share structure.

Vault integration

  • Vault TLS enabled with self-signed CA via cert-manager; CA replicated to
    zuul-ci namespace for the vault-agent sidecar.
  • vault-agent runs as zuul UID 10001 (token-file permissions).
  • VAULT_ADDR switched to the internal service; VAULT_API_ADDR set to the
    external ingress URL.
  • NetworkPolicy: allow zuul-ci → vault on 8200.
  • Auto-unsealer sidecar (uses wget against the API).
  • Image pull secret (quay-pull-secret) added on the openstack-plugin init
    container.

Vault paths consumed (all live)

  • zuul/database
  • zuul/connections/{gitea,github_k8s,gitlab}
  • zuul/auth/{zitadel,gitea-actions}
  • zuul/oauth2-proxy
  • zuul/merger (includes known_hosts — now contains both old and new gitea
    RSA keys so the merger can clone over SSH)
  • zuul/keystore_password, zuul/sshkey
  • zuul/logs/otc_technical_user

Tenant eco2

Tenant config lives in opentelekomcloud-infra/zuul-infra (loaded from
branch Migration_of_zuul_to_prod). After this branch, include-branches
restrictions were removed so PRs against any branch are processed.

Misc

  • Reduced resource requests across otcinfra2 apps.
  • Nodepool: max-hold-age set, pod resources reduced.
  • Removed missing publish-otc-docs-pti template from zuul.d/project.yaml.

Status

  • Scheduler / web / merger / executor running and reachable
  • gitea webhook reaches scheduler, gitea SSH cloning works (host key fix)
  • github webhook + recheck pipeline reaching scheduler
  • Final sign-off after recheck triggers a successful job run
  • Switch zuul-config PVC to csi-sfs dynamic provisioning (RWX)

Related

- Add zuul, swift-proxy, oauth2-proxy ArgoCD apps to argocd-k8s
- Add new_prod overlay for zuul kustomize
- Add new_prod overlay for oauth2-proxy kustomize
- Add swift-proxy helm chart with values-new-prod.yaml
- Add postgresql helm charts (upstream + local) with csi-disk-retain
- Fix AVP references: align with preprod secret structure
- Fix clouds.yaml: use cloud_users/448_nodepool for creds
- Set private: true for internal OTC endpoint access
…ename new_prod to prod

- Remove clusterIP: None from zuul-web service (breaks ingress routing)
- Remove nodeSelector dedicated=zuul-ci from all 8 base components
- Add preprod overlay from monitoring branch (kustomize-validated)
- Rename new_prod overlay to prod
- Rename swift-proxy values-new-prod.yaml to values-prod.yaml
- Add JSON 6902 patch: clusterIP: None on zuul-web service
- Add JSON 6902 patch: nodeSelector dedicated=zuul-ci on all zuul pods
- Matches live otcci cluster state exactly (zero diff with main)
Rename all new-prod/new_prod references to prod across the
production overlay: resource names, labels, cert names, swift
container, log paths, k8s context, and tenant config path.
- Fix oauth2-proxy ArgoCD path (new_prod → prod), rename app to oauth2-proxy-zuul
- Set SFS Turbo IP 192.168.170.45 for zuul PVs
- Fix StorageClass to csi-sfsturbo, add Retain reclaim policy
- Change namespace from zuul to zuul-ci (matches Vault roles)
- Add swift-proxy zuul-prod-logs tempauth user
- Add zuul executor swift auth credentials in site-vars.yaml
- Fix ZK cert SANs, zuul.conf, nodepool.yaml: zuul.svc → zuul-ci.svc
- Change zuul-config PV path from / to /zuul-config (avoid SFS Turbo root conflict)
- Add nodeSelector os.name=HCE_OS_2.0_x86_64 to all workloads (CentOS kernel 3.10 breaks bwrap)
WaitForFirstConsumer binding mode ensures EVS disks are provisioned
in the same AZ (eu-de-03) as the HCE worker nodes.
- Include opentelekomcloud/ansible-collection-cloud as untrusted project
- Only include project config (jobs/templates come from zuul-infra)
- Load from main branch
- New connection github-otc (app_id 3404137) for opentelekomcloud org
- Add github-otc.key to zuul-ssh-keys secret
- Move ansible-collection-cloud under github-otc source in tenant config
- Remove github-otc connection, use single github connection for both orgs
- Move ansible-collection-cloud under github source in tenant config
- Remove github-otc.key from zuul-ssh-keys secret
Separate prod (github_k8s) from preprod (github) AVP references
so both environments can coexist with different GitHub App configs.
Override implied branch matcher so PRs targeting main are
also matched by the prod project config.
Update project.yaml branches matcher and tenant config include-branches
to use Zuul_test instead of Migration_of_zuul_to_prod for testing.
Cluster CPU requests were at 130% of capacity (25.5/19.6 cores).
Adjusted requests to match actual usage with headroom:

- vault: server 500m→100m, injector/CSI 250m→50m
- zitadel: 250m→50m, 512Mi→256Mi
- dependencytrack: API 1000m→100m (mem 4G→5Gi for actual 4.6Gi usage)
- dependencytrack frontend: 1000m→100m
- mcaptcha: 500m→200m, redis 250m→50m
- postgresql: primary 250m→50m, metrics 100m→25m
- element: synapse/SFU/haproxy 100m→25m, redis 50m→10m
- cpn frontend: 200m→50m

Total CPU request savings: ~4785m
Adds a lightweight sidecar to the vault StatefulSet that:
- Checks seal status every 10s via localhost:8200
- If sealed, reads 3 unseal keys from vault-unseal-keys secret
- Runs vault operator unseal for each key

The RBAC for accessing vault-unseal-keys was already set up
in the local vault chart (unsealer-rbac.yaml).
vault status returns exit code 2 when sealed, causing the sidecar
to think vault is unreachable. Switch to wget-based API calls
which correctly detect sealed state and unseal via /v1/sys/unseal.
- Add max-hold-age: 7200 (2h) to zuul-debian label so stale
  pods/namespaces auto-expire instead of lingering indefinitely
- Reduce pod requests from 1CPU/1Gi to 500m/512Mi (CI jobs rarely
  need a full core)
- Increase limits to 2CPU/2Gi to allow burst during builds
GitHub check runs will now show as eco/check instead of
production/check. Pipeline regex patterns already use .*/check
so no other changes needed.
GitHub App not installed on terraform-opentelekomcloud-modules org;
the resulting 403 on branch-protection GraphQL trips the
GithubRateLimitHandler bogus rate-limit (3389s sleep) and wedges
tenant load. Drop until App is installed on that org.
The eco2 production tenant was missing the opendev source, so jobs
resolving roles like 'bindep', 'revoke-sudo', etc. from
opendev.org/zuul/zuul-jobs failed with:
  ERROR! the role 'bindep' was not found

Mirror the legacy eco tenant (otcci cluster) by adding zuul-jobs,
ansible-collections-openstack and refstack-client as untrusted-projects
under the existing 'opendev' connection.

Shadow zuul-jobs only with opentelekomcloud-infra/zuul-infra so the local
zuul-infra in-repo jobs win over upstream definitions where applicable.
otc-zuul-jobs is intentionally NOT shadowed and not included as a project
- new prod cluster does not use otc-zuul-jobs; everything must live in
zuul-infra or system-config.

Also remove opentelekomcloud-infra/otc-zuul-jobs from the github
untrusted-projects list for the same reason.

Fixes the bindep role resolution failure in publish-otc-docs-hc periodic
job on eco2.
26 untrusted-projects in the eco2 tenant define their own build/lint
jobs in their own .zuul.yaml (e.g. <repo>-build-image, tflint, simple,
otc-api-ref-src). The tenant previously restricted include to only
'project', which caused those self-referential job definitions to be
ignored, producing 'Job X not defined' config errors for each repo.

Add 'job' to the include list for the affected repos so Zuul loads
their job definitions alongside their project pipeline.

Affected repos:
- opentelekomcloud-infra/{actions-runner,alerta,cloudmon-plugin-smtp,
  csm-test-utils,gitstyring,identity-deploy,octavia-proxy,otc-api-ref,
  rancher-integration-tests,vault-raft-backup,zuul-config}
- opentelekomcloud/{ansible-collection-gitcontrol,oms,python-otcextensions,
  swift-proxy,terraform-opentelekomcloud-modules,
  terraform-opentelekomcloud-project-factory,
  terraform-provider-opentelekomcloud,terraform-provider-rancher2}
- stackmon/{apimon,cloudmon,cloudmon-plugin-loadbalancer,globalmon,
  metrics-processor,otc-status-dashboard,status-dashboard}
Allow infra-prod-* job definitions in opentelekomcloud-infra/system-config
to be loaded by the eco2 tenant. Without this, the octavia-proxy project
references infra-prod-propose-octavia-proxy-image-stable but the job
definition (in system-config/zuul.d/infra-prod.yaml) is silently dropped.
The 'infra-prod-propose-update' base job (and its descendants like
infra-prod-propose-octavia-proxy-image-stable) reference a Zuul secret
named 'github_propose_token' that was never defined anywhere. This
caused a 'job ... uses secret github_propose_token which is not defined'
config error.

* Define github_propose_token in zuul.d/secrets.yaml (user/email/token,
  encrypted with the eco2 tenant pubkey for opentelekomcloud-infra/
  system-config). Token comes from vault secret/github/otcbot.
* Update the eco2 tenant include for opentelekomcloud-infra/system-config
  to add 'secret' alongside 'project' and 'job' so the secret is loaded.
…cret for gitcontrol

The eco2 tenant was loading opentelekomcloud-infra/system-config jobs
and secrets from the default branch (main), which still contains the
old bridge-host based propose-update parent and lacks the
github_propose_token secret. Set load-branch to
Migration_of_zuul_to_prod so the new k8s-only stack is used.

Also include 'secret' for opentelekomcloud/ansible-collection-gitcontrol
so its in-repo gitcontrol-secret can resolve.
load-branch on an untrusted project does not RESTRICT loading to that
branch — Zuul still loads main as well. Combined with system-config
defining the same image-build/infra-prod jobs that zuul-infra (config-
project) also defines, this triggered ~40 'not permitted to shadow'
errors. Reverting until we either (a) merge the migration branch into
main, or (b) introduce include-branches with explicit allow-list.
All system-config jobs (docker-image, infra-prod-*) have been moved
to zuul-infra (config-project) where they belong. Loading 'job' from
system-config@main now causes ~40 'not permitted to shadow' errors
because the same job names also exist in zuul-infra. Drop 'job' and
'secret' from the include list -- system-config is now a pure
project-stanza source for the eco2 tenant.
These jobs (infra-prod-bootstrap-bridge, infra-prod-service-bridge,
infra-prod-gitea-sync) are not needed in this setup — the infra is
managed via k8s/ArgoCD, not via an Ansible bridge host.

- Remove all three from the deploy pipeline
- Remove bootstrap-bridge and service-bridge from the periodic pipeline
- Remove the entire periodic-hourly section (only contained these jobs)
- Fix infra-prod-grafana-datasources: was depending on bootstrap-bridge
  (bridge-specific), now depends on infra-prod-base like all others
The python-tox image does not include LaTeX/wget and other packages
required for PDF doc builds. All CI roles (prepare-build-pdf-docs,
bindep package install) are designed for root-accessible nodes. Running
as uid 1000 with no sudo prevents any system package installation.

Change runAsUser/runAsGroup to 0 so the pod runs as root, matching the
operational model assumed by the playbooks.
Set unauthenticated_metrics_access=false in vault listener telemetry config.
Prometheus uses a bearer token (secret/prometheus/infra#vault_metrics_token)
for authenticated scraping, so monitoring will continue to work.
@LukasCuperDT LukasCuperDT force-pushed the Migration_of_zuul_to_prod branch from cec68b4 to f2db155 Compare May 22, 2026 08:29
git.tsi-dev.otc-service.com returns an SSO/privacyIDEA login page instead
of API responses, causing Zuul full-reconfigure to hang indefinitely when
fetching branches for ecosystem/zuul-project-config.

Comment out the gitlab connection in zuul.conf and the gitlab section in
the eco2 tenant config until the GitLab API token/auth is restored.
The repo has been transferred to stackmon/ansible-collection-apimon which
is already in the tenant config. The stale opentelekomcloud/ entry causes
a 403 on every startup (no GitHub App installation), falling back to
unauthenticated requests that immediately exhaust the 60/hr rate limit.
Repo exists on Gitea but was missing from the tenant config.
Also installed webhook (id=138) on the repo.
- Add helm chart under kubernetes/helm_charts/local/gitea-runner/
- Deploys act_runner connected to gitea-k8s.eco.tsi-dev.otc-service.com
- Runner name: ubuntu-gitea-runner-01 (labels: ubuntu, ubuntu-22.04)
- Uses dind sidecar for Docker-in-Docker builds, runner image: debian:bookworm
- Registration token sourced from Vault: gitea-k8s/gitea-runner
- Add ArgoCD app entry in prod values for otcinfra2 cluster
  targetRevision: Migration_of_zuul_to_prod
- Add helm chart under kubernetes/helm_charts/local/gitea-runner/
- Deploys act_runner connected to gitea-k8s.eco.tsi-dev.otc-service.com
- Runner name: ubuntu-gitea-runner-01 (labels: ubuntu, ubuntu-22.04)
- Uses dind sidecar for Docker-in-Docker builds, runner image: debian:bookworm
- Registration token sourced from Vault: gitea-k8s/gitea-runner
- Add ArgoCD app entry in prod values for otcinfra2 cluster
  targetRevision: Migration_of_zuul_to_prod
- Use instance-level registration token (Gitea site admin > Runners)
  instead of org-level, so runner serves ALL repos on gitea-k8s
- Rename runner to gitea-k8s-runner-01 (not system-config specific)
- Add more runner labels: debian, debian-bookworm in addition to ubuntu
…s/checkout)

debian:bookworm lacks node binary needed by JS actions like actions/checkout@v4
node:20-bookworm causes actions/setup-python@v5 to fail because it
cannot find Python 3.12 for the OS (no match in python-versions manifest).
catthehacker/ubuntu:act-22.04 is designed for act-based runners and
supports actions/setup-python correctly.
Building Go providers with goreleaser (14 cross-compilation targets)
OOM-kills the container. Increase memory limit from 2Gi to 4Gi to give
goreleaser sufficient headroom. Also aligns better with the eco1 VMs
(s2.large.2 = 8Gi RAM) used for the same job type.
Prod requested cpu:500m, preprod requested cpu:1 (==limit, no bursting).
On the 5-node cluster, CPU requests were oversubscribed: real usage was
only 11-36%, but the scheduler reported 'Insufficient cpu' and the
cluster-autoscaler refused to scale ('pod didn't trigger scale-up').

This blocked periodic publish-otc-docs-hc jobs from getting nodes for
30+ hours, exhausting the docs-build semaphore (max=10).

Reduce CPU requests to 200m in both overlays so more concurrent build
pods can schedule onto existing nodes. Memory requests and all limits
unchanged so heavy jobs (goreleaser, tox-functional, terraform) can still
burst up to their original ceiling without OOM risk. CPU is compressible
so over-packing only causes graceful throttling, never kills.
The python-tox image used for zuul-debian has no container tooling
(no podman/buildah/docker/skopeo) and no systemd, so image-build jobs
that ran ensure-docker on it produced RETRY_LIMIT.

Introduce a dedicated builder pod label backed by quay.io/buildah/stable
which ships buildah + podman + skopeo + fuse-overlayfs. Runs privileged
so buildah can do rootful builds; uses STORAGE_DRIVER=vfs to avoid
fuse-overlayfs requirements in the containers-storage emptyDir volume.

zuul-infra image-build/upload/promote jobs are switched to this new
nodeset in a companion change.
The zuul.d/infra-prod.yaml, project.yaml and secrets.yaml files were
modified for the new k8s-only prod setup, but:
1. Secrets (otc_clouds_yaml, eco2_vault_approle) are defined in
   zuul-infra repo and cannot be used by jobs in system-config.
2. Changing these files in this PR modifies the old prod configuration
   which must remain unchanged during the migration.

New prod job definitions belong in zuul-infra/zuul.d/, not here.
Nightly docs build batch was triggering OOM on the zuul-debian pod when
apt installed ~1.4GB of packages (due to recommends being enabled). The
4Gi limit was insufficient during peak concurrent job load.

Increasing to 6Gi provides headroom for the PDF prereq install even
after the install_recommends: false fix is deployed, as a safety net.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant