Skip to content

Init of preprod branch#1458

Draft
LukasCuperDT wants to merge 435 commits into
mainfrom
preprod
Draft

Init of preprod branch#1458
LukasCuperDT wants to merge 435 commits into
mainfrom
preprod

Conversation

@LukasCuperDT

Copy link
Copy Markdown
Contributor

No description provided.

@gitguardian

gitguardian Bot commented Dec 1, 2025

Copy link
Copy Markdown

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
26230584 Triggered Generic Private Key 4117abe services/gitea-api-adapter/certs/key.pem View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@LukasCuperDT LukasCuperDT force-pushed the preprod branch 19 times, most recently from d87a2d9 to b2d6f30 Compare December 4, 2025 14:48
@LukasCuperDT LukasCuperDT force-pushed the preprod branch 3 times, most recently from 0cdd762 to 1471590 Compare December 15, 2025 10:48
@LukasCuperDT LukasCuperDT force-pushed the preprod branch 7 times, most recently from fd12694 to 1038c4d Compare January 13, 2026 10:57
LukasCuperDT and others added 30 commits February 12, 2026 09:15
Match the actual cluster configuration to avoid StatefulSet immutable field error.
Change from 'default-*' to 'default-.*' to use proper regex matching.
- Changes scheduler image tag from 13.1.1-gitea to 13.1.1-gitea-v3
- Includes fix for status reporting matching upstream implementation
- Context now initialized in __init__ as tenant/pipeline format
- Parameter order fixed: project, sha, state, url, description, context
- Description format includes sha: 'pipeline status: state (sha)'

This matches the working implementation from upstream commit 4cb0e16faca565637468e2e7f32c5d4432b26f18
- zuul-scheduler: 13.1.1-gitea-v4 -> 13.1.1-gitea-v5
- zuul-web: 13.1.1-gitea-v4 -> 13.1.1-gitea-v5
- zuul-merger: 13.1.1-gitea-v4 -> 13.1.1-gitea-v5

Includes fix for comment event triggers (recheck command) by adding
PR fields (branch, ref, patchset, title) to comment events.
* adding new openapi version with oauth

* updating to 0.6.2 and adding extra vars

* Update zuul-merger-volumes-patch.yaml

fixing blank line

* Update zuul-merger-volumes-patch.yaml

* Update zuul-merger-volumes-patch.yaml

* feat: Run linter jobs on all branches

* feat: Add SSH config for all git services (GitHub, GitLab, Gitea)

Configure SSH authentication for all three git services in zuul-merger:
- GitHub (github.com)
- GitLab (gitlab.com)
- Gitea (gitea.eco.tsi-dev.otc-service.com:2222)

Each service now has its own Host entry with the appropriate identity
file from /var/lib/zuul-ssh/.

---------

Co-authored-by: l.cuper@t-systems.com <l.cuper@t-systems.com>
All Zuul component images rebuilt with fix for:
- GiteaSource.getRequireFilters() method signature (parse_context param)
- GiteaSource.getRejectFilters() method signature (parse_context param)

This resolves: GiteaSource.getRequireFilters() takes 2 positional
arguments but 3 were given error preventing gate pipeline from loading.
Check jobs should not trigger for PRs targeting main branch.
Gate pipeline handles main branch PRs after approval.
- zuul-scheduler: 13.1.1-gitea-v1
- zuul-web: 13.1.1-gitea-v1
- zuul-merger: 13.1.1-gitea-v1
- zuul-executor: 13.1.1-gitea-v1

Includes OpenTelemetry tracing, GiteaShaCache, and EventTuple.
Fix GiteaReporter NoneType error when reporting start status.
Fixes infinite loop caused by comment events triggering check pipeline.
GiteaEventFilter now properly checks action types and comment patterns.
- Add PV/PVC for scheduler keys on SFSTurbo share (100Mi)
- Use JSON 6902 patch to replace emptyDir with PVC for zuul-var-lib
- Keys in /var/lib/zuul/keys/ persist across pod restarts
- Required for pkcs1-oaep encrypted secrets to work after scheduler restart
* test: validate parent job resolution after zuul-jobs fix

* test: trigger linters job with shared NFS mount

* test: trigger zuul with main branch fix

* test: validate parent job resolution fix

* test: verify linters job with shared NFS mount

This commit tests:
- Zuul native git cloning in pods
- Shared NFS workspace at /var/lib/zuul/builds
- No file copying overhead
- Fixed dual PVC mount issue

* test: verify linters job execution

* test: trigger linters job with new config

* Force Zuul configuration reload

Remove cached zuul.d/prod references from preprod branch

* Trigger new GitHub status checks

* Test Gitea webhook via translator

* feat(vault): Add NetworkPolicy and security enhancements

- Add local Vault Helm chart for additional manifests
- Add NetworkPolicy restricting access to argocd-repo-server only
- Add Kubernetes auth configuration scripts for ArgoCD
- Add least-privilege policy (read-only on secret/data/preprod/*)
- Add optional auto-unseal CronJob
- Add vault and vault-additional-manifests ArgoCD applications

* refactor(vault): Simplify to NetworkPolicy only, keep KMS as optional

- Remove auto-unseal-cronjob.yaml (not needed with current Shamir setup)
- Remove vault-config-scripts.yaml (manual scripts not needed)
- Simplify local vault chart to NetworkPolicy only
- Add Prometheus scraping support to NetworkPolicy
- Keep OTC KMS auto-unseal as commented/disabled option in upstream values
- Document KMS migration requirements (seal migration needed)
- Revert ArgoCD apps to use existing vault helm charts

* feat(vault): Add auto-unseal CronJob for Shamir seal

- Add CronJob that checks Vault seal status every minute
- Automatically unseals using keys from K8s Secret
- Uses existing vault service account
- Configurable threshold and key count

Note: This is for preprod with Shamir seal. For production,
consider migrating to OTC KMS auto-unseal for better security.

* feat(vault): Replace CronJob with Bank-Vaults unsealer sidecar

- Remove auto-unseal-cronjob.yaml CronJob approach
- Add Bank-Vaults unsealer as sidecar container in upstream values
- The unsealer runs alongside Vault pods and auto-unseals using K8s secret
- Add unseal-keys-secret.yaml template (disabled by default)
- Update local values with documentation for secret format

Bank-Vaults unsealer is more reliable than CronJob:
- Runs continuously as sidecar (not every minute)
- Watches Vault seal status in real-time
- Battle-tested solution from Banzai Cloud

Required secret format:
  kubectl -n vault create secret generic vault-unseal-keys \
    --from-literal=vault-root=<token> \
    --from-literal=vault-unseal-0=<key1> \
    --from-literal=vault-unseal-1=<key2> \
    --from-literal=vault-unseal-2=<key3>

* fix(vault): update raft retry_join addresses for vault-preprod naming

* fix(vault): nest values under vault: for umbrella chart dependency

Values must be prefixed with vault: to be passed to the vault subchart.
This enables HA raft storage configuration to be applied correctly.

* fix(vault): use coredns label for DNS egress in NetworkPolicy

OTC CCE uses coredns with label k8s-app=coredns, not kube-dns.

* fix(vault): use dynamic version for openstack plugin download

* feat(vault): add RBAC for bank-vaults unsealer to access unseal keys secret

* fix(vault): use correct helper name vault-additional.labels in unsealer-rbac

* feat(vault): add external IP allowlist for Gitea in NetworkPolicy

- Add allowExternalIPs configuration to allow specific external IPs
- Whitelist Gitea runners IP (80.158.23.198) to access Vault
- Also allow ingress-nginx controller for internal routing
- Replace broad label-based allowedNamespaces (vault-access: true) with
  explicit per-namespace allowedConsumers list
- Merge separate allowedPods, allowPrometheus, and allowExternalIPs into
  single allowedConsumers with optional podSelector
- Remove allowExternalIPs (Gitea runners access via ingress-nginx instead)
- Only argocd namespace needs direct Vault access (vault-agent sidecars)
- All other apps use AVP in repo-server, no direct Vault connection needed
- Template now shared with Vault_migration (otcinfra2) branch
Replace full-repo linting with diff-aware run-linters.sh that uses
git diff HEAD~1 to detect files changed in the PR and runs each
linter (flake8, bashate, yamllint, etc.) only on affected files.
Skips linters entirely when no relevant files are changed.
The Zuul workspace is a fresh git clone (not Zuul merger output),
so HEAD~1 is the previous PR commit, not the target branch.
Use git merge-base with origin/<target_branch> to correctly
identify files changed in the PR.

Accept ZUUL_TARGET_BRANCH env var from the Zuul playbook,
with auto-detection fallback (preprod > main > master).
When no changed files are detected (e.g. due to workspace checkout
issues), fall back to full repository linting instead of exiting
with success.
Reviewed-by: SebastianGode
Reviewed-by: LukasCuperDT
* feat: migrate monitoring stack to OTel + Loki + Prometheus

- Add OpenTelemetry Operator and Collector helm charts
- Add Loki 3.x with Swift storage backend
- Replace Victoria Metrics with kube-prometheus-stack Prometheus
- Switch Grafana auth from Keycloak to Zitadel OIDC
- Add ArgoCD applications for new observability stack
- Remove old Ansible playbooks (promtail, alerta, go-neb, grafana)
- Remove old prometheus/promtail chart templates
- Update Vault AVP paths to use /config instead of /preprod
- Fix tls_skip_verify_insecure: true -> false in Grafana

* fix: disable matrix/go-neb, fix loki compactor shared_store, fix grafana SSA, fix swift-creds presync

- Disable go-neb matrix bot (goNeb.enabled: false)
- Comment out matrix-webhook receiver and route from alertmanager
- Remove deprecated shared_store from Loki compactor config (Loki 3.x)
- Add ServerSideApply=true to grafana syncOptions (node-exporter-full CM too large)
- Add PreSync annotation to loki-swift-creds secret (must exist before init job runs)

* fix: update chart versions, fix kube-prometheus-stack no resources, fix OTel template errors

- Update kube-prometheus-stack 72.9.0 -> 82.12.0
- Update loki 6.27.0 -> 6.55.0
- Update opentelemetry-operator 0.78.1 -> 0.107.2
- Update grafana 9.2.1 -> 10.5.15
- Add goNeb.enabled:false to upstream kube-prometheus-stack values (fixes nil pointer)
- Add bucketNames to Loki storage config (required by loki 6.55.0)
- Rename containerSecurityContext -> securityContext for OTel operator (0.107.2 schema)
- Escape PromQL variables in OTel PrometheusRule template

* feat: add selfsigned-issuer ClusterIssuer, use it for OTel operator webhook certs

- Add selfSignedClusterIssuer template to cert-manager local chart
- Enable selfsigned-issuer in cert-manager values
- Switch OTel operator webhook certs from letsencrypt-prod to selfsigned-issuer

* fix: move selfsigned-issuer to upstream cert-manager chart, remove duplicate local chart

- Add selfsigned-issuer.yaml template to upstream cert-manager chart
- Remove cert-manager-additional-manifests ArgoCD app (was duplicating ClusterIssuers)
- Delete local cert-manager chart (no longer needed)

* fix: grafana namespace, prometheus datasource URL, disable embedded grafana

- Fix searchNamespace: grafana (was grafana-preprod)
- Fix Prometheus datasource URL to match kube-prometheus-stack service name
- Update ServiceMonitor release label to kube-prometheus-stack-preprod
- Disable embedded Grafana in kube-prometheus-stack (standalone deployment used)
- Use argocd-vault-plugin-helm-with-args for grafana app

* fix: disable grafana persistence (EVS volume quota exceeded)

* fix: re-enable grafana persistence (PVC quota freed)

* fix: reduce resource requests for preprod

- Loki: disable chunks-cache (9.8Gi) and results-cache (1.2Gi)
- Loki: reduce all replicas from 2 to 1
- Loki: reduce replication_factor to 1
- Loki: lower CPU/memory requests for read/write/backend
- Grafana: reduce memory from 2Gi to 1Gi

* fix: add PKCE and signout redirect for Zitadel OIDC

* fix: correct Swift proxy URL for Loki storage (swift-proxy-preprod:8080)

* fix: make loki-swift-creds a managed resource instead of PreSync hook

* fix: add sync-wave ordering for swift-creds secret before swift-init job

* fix: add Swift storage_url to bypass auth for no-auth Swift proxy

* fix: remove swift-init hook, keep swift-creds as managed secret

* feat: add tempauth to Swift proxy for Loki storage auth

- Add tempauth filter to Swift proxy pipeline
- Configure Loki user in tempauth with AVP secrets
- Remove invalid storage_url from Loki config
- Update Vault: split swift_username into account:user format for auth

* fix: set global.dnsService to coredns for Loki gateway nginx resolver

* fix: add memcached sidecar to Swift proxy for tempauth token storage

* fix: set swift proxy to 1 replica to avoid tempauth token isolation across pods

* fix: disable HPA and set minReplicas to 1 for swift proxy (tempauth token isolation)

* fix: use pre-computed htpasswd for Loki gateway, fix Grafana Loki datasource credentials

* fix: add deleteDatasources to ensure provisioning overwrites stale datasources

* fix: repair broken dashboards and add ArgoCD/NGINX/cert-manager dashboards

- Replace victoria-metrics datasource UID with prometheus in
  blackbox-exporter, hashicorp-vault, node-exporter-full dashboards
- Replace 000000001 datasource UID with prometheus in node-exporter-full
- Add new Infrastructure folder with dashboards:
  - ArgoCD (gnetId 14584)
  - NGINX Ingress Controller (gnetId 9614)
  - cert-manager (gnetId 20842)
- Enable ArgoCD metrics and ServiceMonitors for controller, server,
  repoServer, and applicationSet components

* fix: disable journald receiver in OTel node collector (journalctl not in image)

* fix: add open-telemetry helm repo to CMP plugin init

* feat: add Victoria Metrics cluster for preprod

- Enable victoria-metrics-cluster, additional-manifests, and auth ArgoCD apps
- Right-size VM cluster for preprod (1 replica, reduced resources, 50Gi storage)
- Fix ServiceMonitor labels to kube-prometheus-stack-preprod
- Add remoteWrite from Prometheus to VM via vmauth
- Add vmauth credentials to kube-prometheus-stack basic-auth extraSecret
- Add Victoria Metrics datasource to Grafana (set as default)
- Demote Prometheus datasource to non-default

* fix: correct prometheus-mon ingress hostname for preprod

* fix: remove duplicate PrometheusRules template from upstream VM chart

The PrometheusRules are deployed by the local additional-manifests chart.
The upstream chart should not contain custom templates.

* fix: reduce VM cluster resource requests to fit preprod capacity

vmselect/vminsert: 50m CPU, 128Mi memory requests
vmstorage: 50m CPU, 256Mi memory requests

* fix: add proxy-body-size annotation to vmauth ingress

Prometheus remoteWrite sends large batches that exceed nginx default 1m limit.

* fix: increase vminsert memory limit to 512Mi (was OOMKilled at 256Mi)

* fix: increase vminsert memory limit to 1Gi (still OOMKilled at 512Mi due to queued backlog)

* fix: reduce remoteWrite queue config for preprod

Lower maxShards from 30 to 5 and reduce batch size to prevent vminsert OOM.

* fix: use internal vminsert URL for remoteWrite

Bypass vmauth/ingress for in-cluster writes. Direct to vminsert service
eliminates auth, TLS overhead, nginx body size limits, and DNS issues.

* fix: reduce resource requests across preprod to free scheduling capacity

- zuul-executor: 8G→2G memory request
- zuul-merger: 3→1 replica, 600Mi→256Mi memory request
- zuul-web: 3→1 replica, 500Mi→256Mi memory request
- nodepool-launcher: 3→1 replica, 500Mi→256Mi memory request
- umami: 2→1 replica, 512Mi→256Mi memory request
- dependencytrack-frontend: 2→1 replica

Total savings: ~11.5Gi memory requests, freeing capacity for Grafana and VM.

* fix: reduce Grafana resource requests for preprod

- Grafana main: 1024Mi→256Mi memory request
- Grafana sidecar: 1Gi→128Mi memory request (was chart default)

* fix: increase loki-write PVC to 20Gi and memory limit to 2Gi

* refactor: move custom templates/dashboards from upstream to local charts

- Move grafana dashboards and template to local/grafana chart
- Move go-neb templates to local/kube-prometheus-stack chart
- Remove accidentally committed upstream loki default values.yaml
- Add ArgoCD application for local grafana-dashboards

* fix: rename grafana-dashboards app to grafana-additional-manifests, enable ServerSideApply

* fix(cert-manager): nest values under subchart key to enable ServiceMonitor

Values were at the top level and not reaching the cert-manager subchart.
This caused the ServiceMonitor to not be created, resulting in no
cert-manager metrics being scraped by Prometheus/Victoria Metrics.

* Add kubernetes, infrastructure dashboards and logs-by-level dashboard

- Add 6 kubernetes dashboards (API server, CoreDNS, global/namespace/node/pod views)
- Add 3 infrastructure dashboards (ArgoCD, nginx-ingress, cert-manager) with fixed datasource refs
- Fix ArgoCD dashboard: update datasource format from string to object (Grafana 10+ compat)
- Fix nginx-ingress dashboard: replace unresolved ${DS_PROMETHEUS} with actual datasource uid
- Add new ECO K8S Logs by Namespace & Level dashboard with namespace, level, and search filters

* Add dual datasource support (Prometheus+VictoriaMetrics) and admin overview dashboard

* Fix admin overview dashboard queries to match working k8s dashboards patterns

* Fix admin overview dashboard: namespace variable not loading on dashboard open

Root cause: namespace variable had refresh=2 (on time range change only) instead
of refresh=1 (on dashboard load), so it stayed empty on first load. All panels
filtering by namespace showed no data.

Fixes:
- namespace variable: refresh 2->1, add allValue='.*', use kube_pod_info query
  (matching working k8s-views-namespaces dashboard pattern)
- job variable: add allValue='.*', use node_cpu_seconds_total query
  (matching working k8s-views-global dashboard pattern)

* fix: revert loki-write PVC to 10Gi to fix ArgoCD sync failure

StatefulSet volumeClaimTemplates are immutable. The PVC was already expanded
to 20Gi manually, but the StatefulSet spec must match the original 10Gi.
The memory limit increase to 2Gi is kept.

* Fix Loki query parse error: change namespace allValue from .* to .+

Loki rejects {namespace=~".*"} because .* matches empty strings.
Changed allValue to .+ which Loki accepts and works equally for
Prometheus since namespace values are never empty.

* Fix Loki label names in all dashboards

Loki uses OpenTelemetry-style labels (k8s_namespace_name, service_name)
not the traditional Kubernetes labels (namespace, app, cluster).
Updated all Loki queries and variable definitions:
- namespace -> k8s_namespace_name
- app -> service_name
- Removed non-existent cluster variable from eco-k8-logs

* fix(zuul): fix Swift log upload 401 by adding tempauth authentication

- Add Zuul tempauth user to swift-proxy for account AUTH_zuul-preprod-logs
- Switch Zuul clouds.yaml from auth_type:none to v1password with tempauth credentials
- Credentials sourced from Vault secret/swift/users/zuul

* chore: point zuul app to feature branch for testing

* fix(zuul): use admin_token auth via tempauth pre-authentication

- Revert clouds.yaml to auth_type:none (v1password not in executor image)
- Add Swift tempauth credentials to site-vars.yaml for upload role
- Upload role will pre-authenticate and use obtained token

* fix(logs): Add severity extraction to OTel collector and fix logs-by-level dashboard

- Add severity_parser operator to filelog/pods receiver in the OTel
  daemonset collector. This extracts log level keywords (FATAL, ERROR,
  WARN, INFO, DEBUG, TRACE and variants) from the log body and sets
  SeverityText on the log record, which the Loki exporter maps to the
  'level' label on Loki streams.

- Update the ECO K8S Logs by Namespace & Level dashboard to use the
  'level' stream label selector instead of parsing JSON from the log
  body at query time. This makes the level dropdown populate correctly
  and the queries work for both JSON and plain-text logs.

* fix(grafana): Fix dashboards and remove duplicate Infrastructure folder

- eco-k8-logs: Add pod selector variable (k8s_pod_name) with multi-select,
  update workload variable to use k8s_container_name filtered by pod,
  update LogQL query to include pod and container label selectors

- grafana values-preprod: Remove duplicate 'grafana-dashboards-infrastructure'
  provider and its gnetId dashboard definitions (argocd, nginx-ingress,
  cert-manager) that created a second 'Infrastructure' folder. The local
  JSON dashboards in dashboards/infrastructure/ are already loaded via
  the sidecar ConfigMap mechanism.

- kube-prometheus-stack: Enable CoreDNS ServiceMonitor (coreDns: true)

- ingress-nginx: Enable metrics and ServiceMonitor for monitoring

* fix: loki-write crash, dashboard filtering, coredns metrics, otel OOM

- swift-proxy: increase storage PVC from 100Gi to 350Gi (disk was full,
  causing 503 on Swift writes and loki-write CrashLoopBackOff)
- grafana: fix ECO K8S Logs dashboard - use detected_level instead of
  non-existent level label, add pod selector variable
- kube-prometheus-stack: enable CoreDNS monitoring with correct OTC CCE
  selector (k8s-app: coredns instead of kube-dns)
- opentelemetry: increase node collector memory from 512Mi to 1Gi
  to fix OOMKilled on node 192.168.150.215

* fix(grafana): use Prometheus for pod variable to show only running pods

Changed pod dropdown in ECO K8S log dashboards to query
kube_pod_status_phase from Prometheus instead of Loki label values.
This ensures only currently running pods appear in the selector,
eliminating hundreds of stale/terminated pods from historical data.

* feat: add Tempo tracing backend and Grafana traces datasource

- Deploy Tempo single-binary (grafana/tempo 1.24.4) in tempo namespace
- Add Tempo datasource to Grafana with traces-to-logs/metrics links
- Add ArgoCD Application for Tempo in preprod
- Add OTLP receiver and Tempo exporter to OTel DaemonSet collector
- Add traces pipeline to OTel collector service
- Add tempo helm repo to ArgoCD AVP plugin init
- Fix pod variable in log dashboards (use Prometheus kube_pod_status_phase)

* fix: correct Tempo port 3100→3200, enable metrics generator and Prometheus remote write receiver

* feat: add OTel auto-instrumentation CR for Python, Node.js, Java, Go

* feat: enable native OTLP tracing for ArgoCD

* fix: use extraArgs for ArgoCD OTLP tracing instead of CM keys

* Add prometheus-blackbox-exporter for preprod cluster

Deploy blackbox-exporter to monitoring namespace on preprod to enable
endpoint probing. Configures ServiceMonitor targets for all preprod
services so the existing Grafana blackbox dashboard gets data.

- Add upstream helm chart wrapper (prometheus-blackbox-exporter v9.2.0)
- Add values-preprod.yaml with 16 probe targets (http_2xx + http_403)
- Add ArgoCD application entry in monitoring namespace

* Fix blackbox probes for auth-protected and Vault endpoints

- prometheus-mon: use http_auth module (accepts 200/401/403) since
  ingress has nginx basic auth
- vault-lb: probe /v1/sys/health instead of root (returns 307 redirect)
- swift-proxy: use http_auth module (accepts 200/401/403) instead of
  http_403 since root may return 401
- Add http_auth module for endpoints behind authentication

* Fix blackbox dashboard: add legendFormat to all stat panels

Without legendFormat, stat panels display the full metric label set
(pod, prometheus, namespace, etc.) making them unreadable. Now all
panels show clean instance URLs or summary labels instead.

* Fix blackbox dashboard: set textMode=value on stat panels

With legendFormat set, textMode=auto causes stat panels to display
both the full instance URL and the value, making them unreadable.
Set textMode=value since the repeating row title already shows the
target URL.

* Improve blackbox dashboard readability

- Add regex to target variable to extract short hostname from full URL
  (e.g. 'grafana' instead of 'https://grafana.eco-preprod...')
- Increase stat panel height from 2 to 3 rows for better display

* Redesign blackbox dashboard layout for readability

Reorganize panel layout within each target section:
- All 7 stat panels now in a single horizontal row at the top
- Both graphs (HTTP Duration, Probe Duration) in a full-width row below
- Previously stats were stacked in a narrow 4-unit left column

* Fix blackbox dashboard: remove regex that broke all queries

The regex capture group on the target variable replaced the full URL
with just the subdomain, causing all queries to return no data since
instance labels contain full URLs.

* Enable OpenTelemetry auto-instrumentation for preprod services

Add OTel auto-instrumentation pod annotations to enable distributed
tracing via the OpenTelemetry Operator. The operator injects the
appropriate language agent as an init container.

Python (inject-python):
  - Zuul scheduler, executor, web, merger

Java (inject-java):
  - Dependency-Track API server

Node.js (inject-nodejs):
  - Outline wiki
  - Umami analytics
  - OpenAPI validator
  - Circle Partner Navigator (backend, frontend, mail gateway)

Skipped (no stable auto-instrumentation):
  - Go services (Vault, Grafana, Loki, Victoria Metrics, Zitadel)
  - Rust services (mCaptcha)
  - ArgoCD (already has native OTLP export)

* Mirror Node.js autoinstrumentation image to quay.io

Update Instrumentation CRD to use quay.io/opentelekomcloud/autoinstrumentation-nodejs:0.57.1
instead of ghcr.io (tag 0.57.0 does not exist, 0.57.1 is the patch release).

* Add quay.io imagePullSecret to Outline deployment

Required for pulling OTel autoinstrumentation init container from
private quay.io/opentelekomcloud registry.

* Remove runAsNonRoot from Outline podSecurityContext

The OTel Node.js init container runs as root. Moving runAsNonRoot
to container-level securityContext only (already set there).

* Use amd64 tag for Node.js autoinstrumentation image

Previous push was arm64 from Mac, causing exec format error on
amd64 cluster nodes. Using distinct tag to bypass node image cache.

* Revert OTel auto-instrumentation for Zuul components

Zuul bundles its own OpenTelemetry SDK (zuul.lib.tracing) which
conflicts with the operator's injected agent, causing ImportError
on startup (ReadableLogRecord API mismatch). Zuul already has
native OTel tracing support configured separately.

* Use quay.io/opentelekomcloud for all OTel autoinstrumentation images

Mirror Python and Java agent images to quay.io to avoid ghcr.io
pull issues. Go agent kept on ghcr.io (alpha, not currently used).

- python:0.51b0 -> quay.io/opentelekomcloud/autoinstrumentation-python:0.51b0
- java:2.12.0  -> quay.io/opentelekomcloud/autoinstrumentation-java:2.12.0
- nodejs:0.57.1-amd64 already on quay.io (unchanged)

* Add restricted securityContext to OTel init containers

Namespaces with PodSecurity 'restricted:latest' require init containers
to set allowPrivilegeEscalation=false and drop ALL capabilities.

* Upgrade OTel Operator to 0.109.0, revert unsupported securityContext

- Upgrade opentelemetry-operator chart 0.107.2 -> 0.109.0 (app 0.148.0)
- Remove securityContext from Instrumentation CRD (not supported in schema)
- Namespaces with OTel init containers use baseline PodSecurity instead

* Remove runAsNonRoot from pod-level securityContext for OTel compatibility

OTel init containers run as root. Moving runAsNonRoot enforcement to
container-level securityContext only. Affected: umami, openapi-validator,
circle-partner-navigator (backend/frontend/emg), dependency-track.

* Relax PodSecurity to baseline for OTel-instrumented namespaces

OTel Operator injects init containers without securityContext (not
supported in CRD schema). Change enforce from restricted to baseline
for: analytics, openapi-validator, circle-partner-navigator, dependencytrack.
Audit and warn remain restricted for visibility.

* Add imagePullSecrets for quay.io OTel images to all instrumented services

All deployments need imagePullSecrets to pull the OTel init container
from quay.io/opentelekomcloud (private registry).

* Fix node collector OOM: increase memory to 2Gi, reduce batch to 2048

The node collector on node 192.168.150.215 was constantly OOMing (23 restarts)
because Loki rejected batches exceeding gRPC 4MB limit (batch_size 8192 produced
~4.2MB payloads). Failed exports queued in memory, triggering memory_limiter
which dropped ALL data including trace spans.

Changes:
- Reduce send_batch_size 8192 -> 2048 (keeps batches under 4MB gRPC limit)
- Increase memory limit 1Gi -> 2Gi (headroom for log/trace queues)

This fixes missing traces from instrumented services (outline, umami, cpn,
openapi-validator, dependency-track) which were being silently dropped.

* Fix OTLP protocol mismatch: set OTEL_EXPORTER_OTLP_PROTOCOL=grpc

Node.js and Python SDK default to http/protobuf protocol, but the collector
endpoint is port 4317 (gRPC). This causes a protocol mismatch where the SDK
sends HTTP/1.1 to a gRPC (HTTP/2) port, silently failing to export traces.

Setting OTEL_EXPORTER_OTLP_PROTOCOL=grpc globally in the Instrumentation CR
ensures all language SDKs use gRPC to match the collector's port 4317.

* Enable Zuul native OTLP tracing to OTel collector

Add [tracing] section to zuul.conf with:
- endpoint pointing to OTel node collector DaemonSet
- gRPC protocol (matching collector's 4317 port)
- insecure=true (internal cluster communication)

Zuul bundles its own OpenTelemetry SDK and cannot use
auto-instrumentation injection. Native tracing config is
the supported approach per Zuul docs.

* Enable alerting stack for preprod: go-neb, Matrix, alert rules

- Enable go-neb Matrix bot for Preprod room
- Enable Matrix webhook receiver and routing in Alertmanager
- Add additionalPrometheusRulesMap with alert groups:
  blackbox-exporter, cert-manager, postgresql, argocd,
  persistent-storage, pod-health, node-health

* Move PrometheusRules to local helm chart, add new alerts

- Move 22 custom alert rules from upstream additionalPrometheusRulesMap
  to local kube-prometheus-stack chart as a PrometheusRule template
- Remove duplicate NodeHighMemoryUsage (= default NodeMemoryHighUtilization)
- Add 7 new alerts based on actual available metrics:
  - PostgresDown (pg_up), PostgresDeadlocks, PostgresLongRunningTransactions
  - ArgoCDClusterConnectionLost
  - TempoDiscardedSpans, TempoIngesterFlushFailing, TempoIngesterHighLiveTraces
- Enable rules for both preprod and otcinfra clusters
- Total: 29 custom alerts across 9 groups

* Add ServiceMonitors and alert rules for Loki, Vault, Zitadel, Anubis

ServiceMonitors (preprod only):
- Loki: all components via http-metrics port 3100
- Vault: server via /v1/sys/metrics?format=prometheus port 8200
- Zitadel: via /debug/metrics port 8080
- Anubis: via metrics port 9090

New alert rules (15):
- Loki: LokiRequestErrors, LokiRequestPanics, LokiCompactorFailing
- Vault: VaultDown, VaultSealed, VaultNotActive,
  VaultAutopilotUnhealthy, VaultAutopilotNodeUnhealthy,
  VaultHighGoroutines
- Zitadel: ZitadelDown, ZitadelHighErrorRate, ZitadelSlowResponses
- Anubis: AnubisDown, AnubisHighDenyRate, AnubisHighChallengeRate

Total: 44 custom alerts across 13 groups, 4 ServiceMonitors

* Consolidate ServiceMonitors: disable native SMs for Loki and Vault

- Loki: disable native serviceMonitor (duplicate of local chart SM)
- Vault: disable serverTelemetry.prometheusOperator (SM managed in local chart)

ServiceMonitors now centralized in the local kube-prometheus-stack chart:
  Loki, Vault, Zitadel, Anubis

Native chart ServiceMonitors intentionally kept for tightly-coupled services:
  ArgoCD (4), Grafana, Tempo, PostgreSQL, VictoriaMetrics (3),
  cert-manager, OTel operator, Blackbox exporter (target-based)

* Fix vmstorage OOM (512Mi->1536Mi) and exclude loki gateway from ServiceMonitor

- Increase vmstorage memory limit from 512Mi to 1536Mi (was OOMKilled with 4B rows)
- Increase vmstorage memory request from 256Mi to 512Mi
- Exclude loki gateway component from loki ServiceMonitor (nginx returns 401 on /metrics)
- Fixes KubePodCrashLooping, PrometheusRemoteWriteDesiredShards, TargetDown alerts

* Fix vault NetworkPolicy DNS egress to use ipBlock instead of podSelector

CNI evaluates egress rules before kube-proxy DNAT on this cluster,
so targeting CoreDNS pods by label doesn't match traffic sent to the
ClusterIP (10.247.3.10). Use ipBlock with service CIDR instead.

This was blocking Raft leader election (DNS timeout between vault nodes).

* Increase Loki ingestion rate limits and PostgreSQL CPU limit

- Double Loki ingestion rate (16->32 MB/s) and burst (32->64 MB/s)
- Double Loki per-stream rate (5->10 MB) and burst (15->30 MB)
  Fixes OTel collector 429 retry storm causing NodeSystemSaturation
- Increase PostgreSQL CPU limit from 1000m to 2000m (72% throttled at 883m)
  Fixes CPUThrottlingHigh and InfoInhibitor alerts in postgres namespace

* Right-size resource requests to reduce overcommit alerts

CPU request reductions (~750m saved):
- redis: 200m -> 50m (uses ~1m)
- mcaptcha: 200m -> 100m (uses ~200m peak)
- cpn-front: 100m -> 30m (uses ~1m)
- gitea-runner: 200m -> 100m (uses ~5m idle)
- dependencytrack-frontend: 100m -> 20m (uses ~1m)
- backstage redis: 100m -> 20m (uses ~1m)

Memory request reductions (~7GiB saved):
- zookeeper x3: 2Gi -> 1Gi (uses ~1Gi, limit stays 2Gi)
- dependencytrack-api: 6Gi -> 3Gi (uses ~2.5Gi, limit 16Gi -> 8Gi)
- zuul-executor: 2G -> 1G (uses ~510Mi, limit stays 2G)

* Increase Loki write PVC size from 10Gi to 50Gi (disk full)

* Add Grafana dashboards for Loki, OTel Collector, and Zitadel

- Loki Operational dashboard (based on Grafana.com #14055)
- OpenTelemetry Collector dashboard (based on Grafana.com #15983)
- Zitadel Overview dashboard (custom, uses __name__ syntax for dot-notation metrics)

* Remove duplicate k8s dashboard provider from upstream grafana values

The same 6 Kubernetes dashboards were provisioned by both
'grafana-dashboards-kubernetes' (url-based) and 'sidecarProvider'
(ConfigMap-based from local chart). Duplicate UIDs caused Grafana to
revoke DB write permissions for the entire sidecarProvider, blocking
all new dashboard provisioning (loki, otel-collector, zitadel).

* Fix Vault ServiceMonitor selector: use vault-internal label

The selector had 'component: server' but Vault Helm chart services
don't have that label. Use 'vault-internal: true' to target the
headless internal service for individual pod scraping.

* Add job=vault relabeling to Vault ServiceMonitor

The Vault Grafana dashboard hardcodes job="vault" in all queries.
Without explicit relabeling, the job label defaults to the service
name (vault-preprod-internal), causing empty dashboard panels.

* Add ingress-nginx metrics Service and ServiceMonitor

The ingress-nginx controller exposes metrics on port 10254 but has
no Service or ServiceMonitor for it. Creates a ClusterIP Service
targeting port 10254 and a ServiceMonitor to scrape it, with
job=ingress-nginx relabeling for dashboard compatibility.

* Fix Loki operational dashboard broken queries

- log_messages_total{app=loki} -> loki_log_messages_total{namespace=loki}
- kube_pod_container_resource_limits_memory_bytes -> kube_pod_container_resource_limits{resource=memory}
- kube_pod_container_resource_requests_memory_bytes -> kube_pod_container_resource_requests{resource=memory}
- kube_pod_container_resource_limits_cpu_cores -> kube_pod_container_resource_limits{resource=cpu}
- kube_pod_container_resource_requests_cpu_cores -> kube_pod_container_resource_requests{resource=cpu}
- Promtail panels -> OTel Collector (pod=~otel-node-collector.*)

* Right-size resource requests and add CPU limits

Fix KubeCPUOvercommit, KubeMemoryOvercommit, NodeSystemSaturation:

CPU request reductions (~615m saved):
- OTel node collector: 100m->50m, add cpu limit 1000m (was unlimited)
- Loki gateway: 100m->50m, add cpu limit 500m (was unlimited)
- Vault server: 100m->30m (uses 24m)
- Vault injector: 50m->10m (uses 1m), mem 256Mi->64Mi
- Vault unsealer: 50m->10m (uses <1m)
- Vault CSI provider: 50m->10m (uses 1m), mem 256Mi->64Mi
- Zuul scheduler: 100m->20m (uses 2m)
- Swift proxy: 75m->30m (uses 37m)

Memory request reductions (~4.1 GiB saved):
- Vault server: 1Gi->256Mi (uses 88Mi)
- Zuul scheduler: 2Gi->512Mi (uses 236Mi)
- Swift proxy: 1Gi->512Mi (uses 286Mi)

CPU limits added to prevent node saturation:
- OTel node collector: unlimited->1000m
- Loki gateway: unlimited->500m

* Further reduce CPU requests on low-usage services

- vmauth: 100m->10m × 2 pods (uses 1m each)
- tempo: 100m->20m (uses 3m)
- postgresql metrics sidecar: 100m->10m (uses 1m)
Saves additional ~350m CPU requests

* Override NodeSystemSaturation alert with threshold 8

Cloud VMs with EVS disks have inflated load averages due to I/O wait,
causing false positives at the default threshold of 2. Raise to 8.
Disable the upstream rule and add our own in the local chart.

* Increase PostgreSQL max_connections from 100 to 200

Outline was crash-looping with 'remaining connection slots are reserved
for roles with the SUPERUSER attribute' - 73 idle connections from
dependencytrack(20), zitadel(16), mcaptcha(12), backstage(18) were
consuming nearly all 100 slots.

* Reduce Zuul log level from DEBUG to WARNING

Remove -d flag from zuul-scheduler, zuul-web, zuul-executor,
and zuul-merger to stop debug-level logging.

* Add 30-day log expiry for Swift storage and expand PVC to 500Gi

- Expand swift-proxy storage PVC from 350Gi to 500Gi (was 87% full)
- Add object-expirer sidecar to storage StatefulSet
- Add object-expirer ConfigMap with expirer daemon config
- Add daily CronJob that deletes .data files older than 30 days
- At ~3.75 GiB/day growth, 30-day retention caps usage at ~112 GiB

* fix: remove opendev connection from zuul.conf.hcl

The opendev.org connection is no longer needed since all tenant
references to opendev repos (zuul/zuul-jobs, ansible-collections-openstack,
osf/refstack-client) are being removed in zuul-config PR #271.

Related: opentelekomcloud-infra/zuul-config#271

* fix(swift-proxy): reduce log retention from 30 to 15 days

Disk at 487G/500Gi (99%+), writes stopped since Apr 12.
Reduce retention to free space faster once cronjob is deployed.

* fix(swift-proxy): use kubectl exec for cleanup cronjob instead of PVC mount

RWO PVC cannot be mounted by both the storage pod and cleanup job.
Use kubectl exec into the storage pod directly instead.
Add RBAC role for pods/exec permission.
All 12 cronjobs were crashing with 'Missing environment variable: GITEA_TOKEN'.
Uncomment the GITEA_TOKEN AVP placeholder and create Vault secret.
Reviewed-by: Yustina Kvrivishvili
Reviewed-by: SebastianGode
Reviewed-by: LukasCuperDT
- Add otc-ca-cert secret template with AVP reference
- Mount CA cert into Zitadel pods via extraVolumes/extraVolumeMounts
- Set SSL_CERT_FILE env to trust OTC_CA for LDAP connections
- Fixes: x509: certificate signed by unknown authority
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants