Skip to content

Commit 28bb76e

Browse files
committed
Add Kyverno ArgoCD app and Cilium preflight
Make Kyverno a standalone ArgoCD Application (sync-wave 3) so its webhooks register before any app PVCs are created; add infrastructure/controllers/argocd/apps/kyverno-app.yaml, remove Kyverno from the Infrastructure AppSet, and register kyverno-app.yaml in the kustomization manifest. Add preflight checks to scripts/bootstrap-argocd.sh to verify the cilium CLI and expected Cilium version (1.19.0), warn/prompt on mismatches, and document repair steps for Hubble Relay cert issues. Update documentation: README.md clarifies that the cilium install CLI version must match the Helm chart and includes Hubble cert cleanup steps; CLAUDE.md updates sync-wave ordering and rationale (PVC Plumber → Kyverno → Infrastructure AppSet). Also expand .claude local settings to allow additional kubectl bash commands used during bootstrap.
1 parent 981779b commit 28bb76e

7 files changed

Lines changed: 115 additions & 12 deletions

File tree

.claude/settings.local.json

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,9 @@
1313
"Bash(kubectl api-resources:*)",
1414
"Bash(python3:*)",
1515
"Bash(helm repo list:*)",
16-
"WebFetch(domain:raw.githubusercontent.com)"
16+
"WebFetch(domain:raw.githubusercontent.com)",
17+
"Bash(kubectl run:*)",
18+
"Bash(kubectl rollout:*)"
1719
],
1820
"deny": [],
1921
"ask": []

CLAUDE.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -117,15 +117,17 @@ Applications deploy in strict order to prevent race conditions:
117117
| **0** | Foundation | Cilium (CNI), ArgoCD, 1Password Connect, External Secrets, AppProjects |
118118
| **1** | Storage | Longhorn, VolumeSnapshot Controller, VolSync |
119119
| **2** | PVC Plumber | Backup existence checker (FAIL-CLOSED gate: PVC creation denied if Plumber is down) |
120-
| **4** | Infrastructure AppSet | Deploys from explicit path list: cert-manager, external-dns, GPU operators, Kyverno, gateway, databases, etc. |
120+
| **3** | Kyverno | Policy engine (standalone App, must register webhooks before app PVCs are created) |
121+
| **4** | Infrastructure AppSet | Deploys from explicit path list: cert-manager, external-dns, GPU operators, gateway, databases, etc. |
121122
| **5** | Monitoring AppSet | Discovers `monitoring/*` applications |
122123
| **6** | My-Apps AppSet | Discovers `my-apps/*/*` applications |
123124

124125
**Why this matters**:
125126
- Longhorn won't deploy until Cilium + External Secrets are healthy
126-
- PVC Plumber (Wave 2) must run before Infrastructure AppSet (Wave 4) because Kyverno policies call PVC Plumber API
127+
- PVC Plumber (Wave 2) must run before Kyverno (Wave 3) because Kyverno policies call PVC Plumber API
128+
- Kyverno (Wave 3) is a **standalone Application** (not in the Infrastructure AppSet) to guarantee its webhooks are registered before any app PVCs are created. ApplicationSets are considered "healthy" immediately upon creation, so putting Kyverno in an AppSet would race with app deployment.
127129
- **FAIL-CLOSED**: If PVC Plumber is down, Kyverno denies creation of backup-labeled PVCs. Apps retry via ArgoCD backoff until Plumber is healthy. This prevents data loss during disaster recovery.
128-
- Kyverno, cert-manager, GPU operators etc. deploy via Infrastructure AppSet (Wave 4) before user apps (Wave 6)
130+
- cert-manager, GPU operators etc. deploy via Infrastructure AppSet (Wave 4) before user apps (Wave 6)
129131
- This prevents "chicken-and-egg" dependency issues and SSD thrashing
130132

131133
**Important**: The Infrastructure AppSet uses an explicit list of paths (not glob discovery). To add a new infrastructure component, you must add its path to `infrastructure/controllers/argocd/apps/infrastructure-appset.yaml`.

README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ Omni provisions Talos clusters without a CNI. Install Cilium to get networking f
8080

8181
```bash
8282
cilium install \
83+
--version 1.19.0 \
8384
--set cluster.name=talos-prod-cluster \
8485
--set ipam.mode=kubernetes \
8586
--set kubeProxyReplacement=true \
@@ -94,9 +95,16 @@ cilium install \
9495
--set gatewayAPI.enableAppProtocol=true
9596
```
9697

97-
> **Important:** `cluster.name` must match `infrastructure/networking/cilium/values.yaml` for Hubble certificate SANs. After ArgoCD deploys, it takes over Cilium management at Wave 0.
98+
> **Important — version must match:** The `cilium install` CLI version must match the Helm chart version in `infrastructure/networking/cilium/kustomization.yaml` (currently **1.19.0**). Use `cilium install --version 1.19.0` to pin it. If versions differ, ArgoCD upgrades Cilium at Wave 0 and regenerates some Hubble certs but not others, causing TLS handshake failures (`x509: certificate signed by unknown authority`) that block all sync waves.
9899
>
99-
> If `cilium install` is run without `--set cluster.name=talos-prod-cluster`, certificates are generated for `default` or `kind-kind`. When ArgoCD later configures Cilium to expect `talos-prod-cluster`, the certificates will not match, causing TLS handshake failures in Hubble Relay (`x509: certificate signed by unknown authority`).
100+
> **Important — cluster name must match:** `cluster.name` must match `infrastructure/networking/cilium/values.yaml` for Hubble certificate SANs. If `cilium install` is run without `--set cluster.name=talos-prod-cluster`, certificates are generated for `default` or `kind-kind`, causing the same TLS failures.
101+
>
102+
> **If Hubble Relay is crash-looping after bootstrap**, delete stale certs and restart:
103+
> ```bash
104+
> kubectl delete secret hubble-relay-client-certs hubble-server-certs -n kube-system
105+
> kubectl rollout restart deployment hubble-relay -n kube-system
106+
> kubectl rollout restart ds cilium -n kube-system
107+
> ```
100108
101109
### Step 2: Install Gateway API CRDs
102110

infrastructure/controllers/argocd/apps/infrastructure-appset.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ spec:
2020
- path: infrastructure/controllers/nvidia-gpu-operator
2121
- path: infrastructure/controllers/postgres-operator
2222
- path: infrastructure/controllers/reloader
23-
- path: infrastructure/controllers/kyverno
2423
- path: infrastructure/networking/cloudflared
2524
- path: infrastructure/networking/coredns
2625
- path: infrastructure/networking/gateway

infrastructure/controllers/argocd/apps/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ resources:
1111
- snapshot-controller-app.yaml # Wave 1 - VolumeSnapshot controller + CRDs
1212
- volsync-app.yaml # Wave 1 - PVC backup and replication (Kopia + NFS)
1313
- pvc-plumber-app.yaml # Wave 2 - Backup existence checker for restore
14+
- kyverno-app.yaml # Wave 3 - Policy engine (must be healthy before apps create PVCs)
1415
# ApplicationSets for automatic discovery
1516
- infrastructure-appset.yaml # Wave 4
1617
- monitoring-appset.yaml # Wave 5
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
apiVersion: argoproj.io/v1alpha1
3+
kind: Application
4+
metadata:
5+
name: kyverno
6+
namespace: argocd
7+
annotations:
8+
argocd.argoproj.io/sync-wave: "3"
9+
finalizers:
10+
- resources-finalizer.argocd.argoproj.io
11+
spec:
12+
project: infrastructure
13+
source:
14+
repoURL: https://github.com/mitchross/talos-argocd-proxmox.git
15+
targetRevision: main
16+
path: infrastructure/controllers/kyverno
17+
destination:
18+
server: https://kubernetes.default.svc
19+
namespace: kyverno
20+
syncPolicy:
21+
automated:
22+
prune: true
23+
selfHeal: true
24+
syncOptions:
25+
- CreateNamespace=true
26+
- ServerSideApply=true

scripts/bootstrap-argocd.sh

Lines changed: 70 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,75 @@ set -euo pipefail
44
# Bootstrap ArgoCD Script
55
# This script works around kustomize --enable-helm compatibility issues
66
# by using Helm directly, then letting ArgoCD self-manage
7+
#
8+
# Prerequisites:
9+
# 1. Cilium must be installed FIRST (provides CNI networking)
10+
# 2. Gateway API CRDs must be applied
11+
# 3. 1Password secrets must be pre-seeded
12+
#
13+
# See README.md for the full bootstrap sequence.
714

815
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
916
ROOT_DIR="$( cd "$SCRIPT_DIR/.." && pwd )"
1017

18+
# Expected Cilium version — must match infrastructure/networking/cilium/kustomization.yaml
19+
EXPECTED_CILIUM_VERSION="1.19.0"
20+
1121
echo "🚀 Bootstrapping ArgoCD with sync waves..."
1222

23+
# Pre-flight: Verify Cilium is installed and healthy at the correct version
24+
echo ""
25+
echo "🔍 Pre-flight: Checking Cilium..."
26+
27+
if ! command -v cilium &> /dev/null; then
28+
echo "❌ cilium CLI not found. Install it first: https://docs.cilium.io/en/stable/gettingstarted/k8s-install-default/"
29+
exit 1
30+
fi
31+
32+
if ! cilium status --wait --wait-duration 30s &> /dev/null; then
33+
echo "❌ Cilium is not healthy. Install Cilium first:"
34+
echo ""
35+
echo " cilium install \\"
36+
echo " --version $EXPECTED_CILIUM_VERSION \\"
37+
echo " --set cluster.name=talos-prod-cluster \\"
38+
echo " --set ipam.mode=kubernetes \\"
39+
echo " --set kubeProxyReplacement=true \\"
40+
echo " --set k8sServiceHost=localhost \\"
41+
echo " --set k8sServicePort=7445 \\"
42+
echo " --set gatewayAPI.enabled=true"
43+
echo ""
44+
exit 1
45+
fi
46+
47+
RUNNING_VERSION=$(kubectl get ds cilium -n kube-system -o jsonpath='{.spec.template.spec.containers[0].image}' 2>/dev/null | sed -E 's/.*:v([0-9]+\.[0-9]+\.[0-9]+).*/\1/' || true)
48+
49+
if [ -n "$RUNNING_VERSION" ] && [ "$RUNNING_VERSION" != "$EXPECTED_CILIUM_VERSION" ]; then
50+
echo "⚠️ WARNING: Cilium version mismatch!"
51+
echo " Running: $RUNNING_VERSION"
52+
echo " Expected: $EXPECTED_CILIUM_VERSION (from Helm chart)"
53+
echo ""
54+
echo " ArgoCD Wave 0 will upgrade Cilium $RUNNING_VERSION$EXPECTED_CILIUM_VERSION"
55+
echo " This in-place upgrade can corrupt BPF state and break new pod networking."
56+
echo ""
57+
echo " Recommended: Reinstall Cilium at the correct version first:"
58+
echo " cilium uninstall"
59+
echo " cilium install --version $EXPECTED_CILIUM_VERSION \\"
60+
echo " --set cluster.name=talos-prod-cluster \\"
61+
echo " --set ipam.mode=kubernetes \\"
62+
echo " --set kubeProxyReplacement=true \\"
63+
echo " --set k8sServiceHost=localhost \\"
64+
echo " --set k8sServicePort=7445 \\"
65+
echo " --set gatewayAPI.enabled=true"
66+
echo ""
67+
read -p " Continue anyway? (y/N) " -n 1 -r
68+
echo ""
69+
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
70+
exit 1
71+
fi
72+
else
73+
echo "✅ Cilium $RUNNING_VERSION is healthy and matches Helm chart ($EXPECTED_CILIUM_VERSION)"
74+
fi
75+
1376
# Step 1: Create namespace
1477
echo ""
1578
echo "📦 Creating argocd namespace..."
@@ -52,11 +115,13 @@ echo ""
52115
echo "✅ ArgoCD bootstrap complete!"
53116
echo ""
54117
echo "📊 ArgoCD will now sync applications in this order:"
55-
echo " Wave 0: Cilium (networking) & Secrets"
56-
echo " Wave 1: Longhorn (storage), Snapshot Controller & VolSync"
57-
echo " Wave 2: Infrastructure (core services)"
58-
echo " Wave 3: Monitoring (observability)"
59-
echo " Wave 4: My-Apps (workloads)"
118+
echo " Wave 0: Cilium (networking), 1Password Connect, External Secrets"
119+
echo " Wave 1: Longhorn (storage), Snapshot Controller, VolSync"
120+
echo " Wave 2: PVC Plumber (backup checker, FAIL-CLOSED gate)"
121+
echo " Wave 3: Kyverno (policy engine, must register webhooks before app PVCs)"
122+
echo " Wave 4: Infrastructure AppSet (cert-manager, GPU operators, gateway, etc.)"
123+
echo " Wave 5: Monitoring AppSet (Prometheus, Grafana, Loki)"
124+
echo " Wave 6: My-Apps AppSet (user workloads)"
60125
echo ""
61126
echo "🔍 Monitor progress with:"
62127
echo " kubectl get applications -n argocd -w"

0 commit comments

Comments
 (0)