feat: Migrate Zuul to new production cluster (otcinfra2)#1605
Open
LukasCuperDT wants to merge 82 commits into
Open
feat: Migrate Zuul to new production cluster (otcinfra2)#1605LukasCuperDT wants to merge 82 commits into
LukasCuperDT wants to merge 82 commits into
Conversation
- Add zuul, swift-proxy, oauth2-proxy ArgoCD apps to argocd-k8s - Add new_prod overlay for zuul kustomize - Add new_prod overlay for oauth2-proxy kustomize - Add swift-proxy helm chart with values-new-prod.yaml - Add postgresql helm charts (upstream + local) with csi-disk-retain - Fix AVP references: align with preprod secret structure - Fix clouds.yaml: use cloud_users/448_nodepool for creds - Set private: true for internal OTC endpoint access
…ename new_prod to prod - Remove clusterIP: None from zuul-web service (breaks ingress routing) - Remove nodeSelector dedicated=zuul-ci from all 8 base components - Add preprod overlay from monitoring branch (kustomize-validated) - Rename new_prod overlay to prod - Rename swift-proxy values-new-prod.yaml to values-prod.yaml
- Add JSON 6902 patch: clusterIP: None on zuul-web service - Add JSON 6902 patch: nodeSelector dedicated=zuul-ci on all zuul pods - Matches live otcci cluster state exactly (zero diff with main)
Rename all new-prod/new_prod references to prod across the production overlay: resource names, labels, cert names, swift container, log paths, k8s context, and tenant config path.
- Fix oauth2-proxy ArgoCD path (new_prod → prod), rename app to oauth2-proxy-zuul - Set SFS Turbo IP 192.168.170.45 for zuul PVs - Fix StorageClass to csi-sfsturbo, add Retain reclaim policy - Change namespace from zuul to zuul-ci (matches Vault roles) - Add swift-proxy zuul-prod-logs tempauth user - Add zuul executor swift auth credentials in site-vars.yaml
- Fix ZK cert SANs, zuul.conf, nodepool.yaml: zuul.svc → zuul-ci.svc - Change zuul-config PV path from / to /zuul-config (avoid SFS Turbo root conflict) - Add nodeSelector os.name=HCE_OS_2.0_x86_64 to all workloads (CentOS kernel 3.10 breaks bwrap)
WaitForFirstConsumer binding mode ensures EVS disks are provisioned in the same AZ (eu-de-03) as the HCE worker nodes.
…d nodepool-launcher volumes patch
- Include opentelekomcloud/ansible-collection-cloud as untrusted project - Only include project config (jobs/templates come from zuul-infra) - Load from main branch
- New connection github-otc (app_id 3404137) for opentelekomcloud org - Add github-otc.key to zuul-ssh-keys secret - Move ansible-collection-cloud under github-otc source in tenant config
- Remove github-otc connection, use single github connection for both orgs - Move ansible-collection-cloud under github source in tenant config - Remove github-otc.key from zuul-ssh-keys secret
Separate prod (github_k8s) from preprod (github) AVP references so both environments can coexist with different GitHub App configs.
Override implied branch matcher so PRs targeting main are also matched by the prod project config.
Update project.yaml branches matcher and tenant config include-branches to use Zuul_test instead of Migration_of_zuul_to_prod for testing.
Cluster CPU requests were at 130% of capacity (25.5/19.6 cores). Adjusted requests to match actual usage with headroom: - vault: server 500m→100m, injector/CSI 250m→50m - zitadel: 250m→50m, 512Mi→256Mi - dependencytrack: API 1000m→100m (mem 4G→5Gi for actual 4.6Gi usage) - dependencytrack frontend: 1000m→100m - mcaptcha: 500m→200m, redis 250m→50m - postgresql: primary 250m→50m, metrics 100m→25m - element: synapse/SFU/haproxy 100m→25m, redis 50m→10m - cpn frontend: 200m→50m Total CPU request savings: ~4785m
Adds a lightweight sidecar to the vault StatefulSet that: - Checks seal status every 10s via localhost:8200 - If sealed, reads 3 unseal keys from vault-unseal-keys secret - Runs vault operator unseal for each key The RBAC for accessing vault-unseal-keys was already set up in the local vault chart (unsealer-rbac.yaml).
vault status returns exit code 2 when sealed, causing the sidecar to think vault is unreachable. Switch to wget-based API calls which correctly detect sealed state and unseal via /v1/sys/unseal.
- Add max-hold-age: 7200 (2h) to zuul-debian label so stale pods/namespaces auto-expire instead of lingering indefinitely - Reduce pod requests from 1CPU/1Gi to 500m/512Mi (CI jobs rarely need a full core) - Increase limits to 2CPU/2Gi to allow burst during builds
GitHub check runs will now show as eco/check instead of production/check. Pipeline regex patterns already use .*/check so no other changes needed.
GitHub App not installed on terraform-opentelekomcloud-modules org; the resulting 403 on branch-protection GraphQL trips the GithubRateLimitHandler bogus rate-limit (3389s sleep) and wedges tenant load. Drop until App is installed on that org.
The eco2 production tenant was missing the opendev source, so jobs resolving roles like 'bindep', 'revoke-sudo', etc. from opendev.org/zuul/zuul-jobs failed with: ERROR! the role 'bindep' was not found Mirror the legacy eco tenant (otcci cluster) by adding zuul-jobs, ansible-collections-openstack and refstack-client as untrusted-projects under the existing 'opendev' connection. Shadow zuul-jobs only with opentelekomcloud-infra/zuul-infra so the local zuul-infra in-repo jobs win over upstream definitions where applicable. otc-zuul-jobs is intentionally NOT shadowed and not included as a project - new prod cluster does not use otc-zuul-jobs; everything must live in zuul-infra or system-config. Also remove opentelekomcloud-infra/otc-zuul-jobs from the github untrusted-projects list for the same reason. Fixes the bindep role resolution failure in publish-otc-docs-hc periodic job on eco2.
26 untrusted-projects in the eco2 tenant define their own build/lint
jobs in their own .zuul.yaml (e.g. <repo>-build-image, tflint, simple,
otc-api-ref-src). The tenant previously restricted include to only
'project', which caused those self-referential job definitions to be
ignored, producing 'Job X not defined' config errors for each repo.
Add 'job' to the include list for the affected repos so Zuul loads
their job definitions alongside their project pipeline.
Affected repos:
- opentelekomcloud-infra/{actions-runner,alerta,cloudmon-plugin-smtp,
csm-test-utils,gitstyring,identity-deploy,octavia-proxy,otc-api-ref,
rancher-integration-tests,vault-raft-backup,zuul-config}
- opentelekomcloud/{ansible-collection-gitcontrol,oms,python-otcextensions,
swift-proxy,terraform-opentelekomcloud-modules,
terraform-opentelekomcloud-project-factory,
terraform-provider-opentelekomcloud,terraform-provider-rancher2}
- stackmon/{apimon,cloudmon,cloudmon-plugin-loadbalancer,globalmon,
metrics-processor,otc-status-dashboard,status-dashboard}
Allow infra-prod-* job definitions in opentelekomcloud-infra/system-config to be loaded by the eco2 tenant. Without this, the octavia-proxy project references infra-prod-propose-octavia-proxy-image-stable but the job definition (in system-config/zuul.d/infra-prod.yaml) is silently dropped.
The 'infra-prod-propose-update' base job (and its descendants like infra-prod-propose-octavia-proxy-image-stable) reference a Zuul secret named 'github_propose_token' that was never defined anywhere. This caused a 'job ... uses secret github_propose_token which is not defined' config error. * Define github_propose_token in zuul.d/secrets.yaml (user/email/token, encrypted with the eco2 tenant pubkey for opentelekomcloud-infra/ system-config). Token comes from vault secret/github/otcbot. * Update the eco2 tenant include for opentelekomcloud-infra/system-config to add 'secret' alongside 'project' and 'job' so the secret is loaded.
…cret for gitcontrol The eco2 tenant was loading opentelekomcloud-infra/system-config jobs and secrets from the default branch (main), which still contains the old bridge-host based propose-update parent and lacks the github_propose_token secret. Set load-branch to Migration_of_zuul_to_prod so the new k8s-only stack is used. Also include 'secret' for opentelekomcloud/ansible-collection-gitcontrol so its in-repo gitcontrol-secret can resolve.
load-branch on an untrusted project does not RESTRICT loading to that branch — Zuul still loads main as well. Combined with system-config defining the same image-build/infra-prod jobs that zuul-infra (config- project) also defines, this triggered ~40 'not permitted to shadow' errors. Reverting until we either (a) merge the migration branch into main, or (b) introduce include-branches with explicit allow-list.
All system-config jobs (docker-image, infra-prod-*) have been moved to zuul-infra (config-project) where they belong. Loading 'job' from system-config@main now causes ~40 'not permitted to shadow' errors because the same job names also exist in zuul-infra. Drop 'job' and 'secret' from the include list -- system-config is now a pure project-stanza source for the eco2 tenant.
These jobs (infra-prod-bootstrap-bridge, infra-prod-service-bridge, infra-prod-gitea-sync) are not needed in this setup — the infra is managed via k8s/ArgoCD, not via an Ansible bridge host. - Remove all three from the deploy pipeline - Remove bootstrap-bridge and service-bridge from the periodic pipeline - Remove the entire periodic-hourly section (only contained these jobs) - Fix infra-prod-grafana-datasources: was depending on bootstrap-bridge (bridge-specific), now depends on infra-prod-base like all others
The python-tox image does not include LaTeX/wget and other packages required for PDF doc builds. All CI roles (prepare-build-pdf-docs, bindep package install) are designed for root-accessible nodes. Running as uid 1000 with no sudo prevents any system package installation. Change runAsUser/runAsGroup to 0 so the pod runs as root, matching the operational model assumed by the playbooks.
Set unauthenticated_metrics_access=false in vault listener telemetry config. Prometheus uses a bearer token (secret/prometheus/infra#vault_metrics_token) for authenticated scraping, so monitoring will continue to work.
cec68b4 to
f2db155
Compare
git.tsi-dev.otc-service.com returns an SSO/privacyIDEA login page instead of API responses, causing Zuul full-reconfigure to hang indefinitely when fetching branches for ecosystem/zuul-project-config. Comment out the gitlab connection in zuul.conf and the gitlab section in the eco2 tenant config until the GitLab API token/auth is restored.
The repo has been transferred to stackmon/ansible-collection-apimon which is already in the tenant config. The stale opentelekomcloud/ entry causes a 403 on every startup (no GitHub App installation), falling back to unauthenticated requests that immediately exhaust the 60/hr rate limit.
Repo exists on Gitea but was missing from the tenant config. Also installed webhook (id=138) on the repo.
- Add helm chart under kubernetes/helm_charts/local/gitea-runner/ - Deploys act_runner connected to gitea-k8s.eco.tsi-dev.otc-service.com - Runner name: ubuntu-gitea-runner-01 (labels: ubuntu, ubuntu-22.04) - Uses dind sidecar for Docker-in-Docker builds, runner image: debian:bookworm - Registration token sourced from Vault: gitea-k8s/gitea-runner - Add ArgoCD app entry in prod values for otcinfra2 cluster targetRevision: Migration_of_zuul_to_prod
- Add helm chart under kubernetes/helm_charts/local/gitea-runner/ - Deploys act_runner connected to gitea-k8s.eco.tsi-dev.otc-service.com - Runner name: ubuntu-gitea-runner-01 (labels: ubuntu, ubuntu-22.04) - Uses dind sidecar for Docker-in-Docker builds, runner image: debian:bookworm - Registration token sourced from Vault: gitea-k8s/gitea-runner - Add ArgoCD app entry in prod values for otcinfra2 cluster targetRevision: Migration_of_zuul_to_prod
- Use instance-level registration token (Gitea site admin > Runners) instead of org-level, so runner serves ALL repos on gitea-k8s - Rename runner to gitea-k8s-runner-01 (not system-config specific) - Add more runner labels: debian, debian-bookworm in addition to ubuntu
…s/checkout) debian:bookworm lacks node binary needed by JS actions like actions/checkout@v4
node:20-bookworm causes actions/setup-python@v5 to fail because it cannot find Python 3.12 for the OS (no match in python-versions manifest). catthehacker/ubuntu:act-22.04 is designed for act-based runners and supports actions/setup-python correctly.
Building Go providers with goreleaser (14 cross-compilation targets) OOM-kills the container. Increase memory limit from 2Gi to 4Gi to give goreleaser sufficient headroom. Also aligns better with the eco1 VMs (s2.large.2 = 8Gi RAM) used for the same job type.
Prod requested cpu:500m, preprod requested cpu:1 (==limit, no bursting).
On the 5-node cluster, CPU requests were oversubscribed: real usage was
only 11-36%, but the scheduler reported 'Insufficient cpu' and the
cluster-autoscaler refused to scale ('pod didn't trigger scale-up').
This blocked periodic publish-otc-docs-hc jobs from getting nodes for
30+ hours, exhausting the docs-build semaphore (max=10).
Reduce CPU requests to 200m in both overlays so more concurrent build
pods can schedule onto existing nodes. Memory requests and all limits
unchanged so heavy jobs (goreleaser, tox-functional, terraform) can still
burst up to their original ceiling without OOM risk. CPU is compressible
so over-packing only causes graceful throttling, never kills.
The python-tox image used for zuul-debian has no container tooling (no podman/buildah/docker/skopeo) and no systemd, so image-build jobs that ran ensure-docker on it produced RETRY_LIMIT. Introduce a dedicated builder pod label backed by quay.io/buildah/stable which ships buildah + podman + skopeo + fuse-overlayfs. Runs privileged so buildah can do rootful builds; uses STORAGE_DRIVER=vfs to avoid fuse-overlayfs requirements in the containers-storage emptyDir volume. zuul-infra image-build/upload/promote jobs are switched to this new nodeset in a companion change.
The zuul.d/infra-prod.yaml, project.yaml and secrets.yaml files were modified for the new k8s-only prod setup, but: 1. Secrets (otc_clouds_yaml, eco2_vault_approle) are defined in zuul-infra repo and cannot be used by jobs in system-config. 2. Changing these files in this PR modifies the old prod configuration which must remain unchanged during the migration. New prod job definitions belong in zuul-infra/zuul.d/, not here.
Nightly docs build batch was triggering OOM on the zuul-debian pod when apt installed ~1.4GB of packages (due to recommends being enabled). The 4Gi limit was insufficient during peak concurrent job load. Increasing to 6Gi provides headroom for the PDF prereq install even after the install_recommends: false fix is deployed, as a safety net.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrate Zuul CI/CD to the new production cluster
otcinfra2, managed by ArgoCD(
zuul-ciApplication, namespacezuul-ci). Vault isvault-k8s(HA, KV v2at
secret/), accessed by ArgoCD Vault Plugin with<path:secret/data/...#field>placeholders.This branch supersedes #1583 — same head/base, refreshed description that
matches the actual current state after several rounds of fixes.
Zuul prod overlay (
kubernetes/kustomize/zuul/overlays/prod/)otcinfra2.zuul.confconnections (validated against the running scheduler):github→ app_id 3404137, key/var/lib/zuul-ssh/github.keygitlab→git.tsi-dev.otc-service.comgitea→gitea-k8s.eco.tsi-dev.otc-service.com(the new k8s gitea, port 2222)gitea.eco.tsi-dev.otc-service.com(old VMbehind HAProxy) to
gitea-k8s.eco.tsi-dev.otc-service.com. Removed the192.168.170.200HAProxyhostAliasesworkaround — zuul now reaches giteavia public DNS / ingress (
80.158.58.167) with valid TLS, since it lives inthe same cluster.
13.1.1-gitea-v6.clouds.yaml:cloud_users/448_nodepool+clouds/otcci_nodepool_pool1,private: trueendpoint (same VPC).zuul-weband the webhook endpoint(
/api/connection,/api/webhookonzuul.eco.tsi-dev.otc-service.com).csi-disk-topology; scheduler keys usecsi-disk-retain.Zuul preprod overlay
Mirrors the prod layout (executor/merger/scheduler/web patches, ingress,
PV/PVC) so both environments share structure.
Vault integration
zuul-cinamespace for the vault-agent sidecar.vault-agentruns as zuul UID10001(token-file permissions).VAULT_ADDRswitched to the internal service;VAULT_API_ADDRset to theexternal ingress URL.
zuul-ci→ vault on 8200.wgetagainst the API).quay-pull-secret) added on the openstack-plugin initcontainer.
Vault paths consumed (all live)
zuul/databasezuul/connections/{gitea,github_k8s,gitlab}zuul/auth/{zitadel,gitea-actions}zuul/oauth2-proxyzuul/merger(includesknown_hosts— now contains both old and new giteaRSA keys so the merger can clone over SSH)
zuul/keystore_password,zuul/sshkeyzuul/logs/otc_technical_userTenant
eco2Tenant config lives in
opentelekomcloud-infra/zuul-infra(loaded frombranch
Migration_of_zuul_to_prod). After this branch,include-branchesrestrictions were removed so PRs against any branch are processed.
Misc
max-hold-ageset, pod resources reduced.publish-otc-docs-ptitemplate fromzuul.d/project.yaml.Status
zuul-configPVC tocsi-sfsdynamic provisioning (RWX)Related
opentelekomcloud-infra/zuul-infrabranchMigration_of_zuul_to_prodsystem-configVault secret restructure on the same branch name