Production-grade GitOps Kubernetes cluster on Talos OS with self-managing ArgoCD, Cilium, and zero-touch PVC backup/restore
A GitOps-driven Kubernetes cluster using Talos OS (secure, immutable Linux for K8s), ArgoCD, and Cilium, running on Proxmox. Managed via Omni (Sidero's Talos management platform) with the Proxmox Infrastructure Provider for automated node provisioning.
- Self-Managing ArgoCD - ArgoCD manages its own installation, upgrades, and ApplicationSets from Git
- Directory = Application - Apps discovered automatically by directory path, no manual Application manifests
- Sync Wave Ordering - Strict deployment ordering prevents race conditions
- Zero-Touch Backups - Add a label to a PVC, get automatic Kopia backups to NFS with disaster recovery
- Gateway API - Modern ingress via Cilium Gateway API (not legacy Ingress)
- GPU Support - Full NVIDIA GPU support via Talos system extensions and GPU Operator
- Zero SSH - All node management via Omni UI or Talos API
| Resource | Description |
|---|---|
| Omni | Talos cluster management platform |
| Proxmox Infra Provider | Proxmox infrastructure provider for Omni |
| Starter Repo | Full config & automation for Sidero Omni + Talos + Proxmox |
| Reference Guide | VirtualizationHowTo guide for Talos Omni on-prem setup |
graph TD;
subgraph "Bootstrap Process (Manual)"
User(["User"]) -- "kubectl apply -k" --> Kustomization["infrastructure/argocd/kustomization.yaml"];
Kustomization -- "Deploys" --> ArgoCD["ArgoCD<br/>(from Helm Chart)"];
Kustomization -- "Deploys" --> RootApp["Root Application<br/>(root.yaml)"];
end
subgraph "GitOps Self-Management Loop (Automatic)"
ArgoCD -- "1. Syncs" --> RootApp;
RootApp -- "2. Points to<br/>.../argocd/apps/" --> ArgoConfigDir["ArgoCD Config<br/>(Projects & AppSets)"];
ArgoCD -- "3. Deploys" --> AppSets["ApplicationSets"];
AppSets -- "4. Scans Repo for<br/>Application Directories" --> AppManifests["Application Manifests<br/>(e.g., my-apps/nginx/)"];
ArgoCD -- "5. Deploys" --> ClusterResources["Cluster Resources<br/>(Nginx, Prometheus, etc.)"];
end
style User fill:#a2d5c6,stroke:#333
style Kustomization fill:#5bc0de,stroke:#333
style RootApp fill:#f0ad4e,stroke:#333
style ArgoCD fill:#d9534f,stroke:#333
ArgoCD deploys applications in strict order to prevent dependency issues:
| Wave | Component | Purpose |
|---|---|---|
| 0 | Foundation | Cilium (CNI), ArgoCD, 1Password Connect, External Secrets, AppProjects |
| 1 | Storage + backup operator | Longhorn, VolumeSnapshot Controller, VolSync, pvc-plumber v2 operator |
| 2 | pvc-plumber webhook configs | FAIL-CLOSED PVC admission gate registers after the operator pod is healthy |
| 3 | CNPG Barman Plugin | Database backup plugin before DB clusters |
| 4 | Infrastructure AppSet + custom entrypoints | Cert-Manager, External-DNS, GPU Operators, Gateway, KEDA, Temporal Worker Controller |
| 4 | Database AppSet | CloudNativePG operators & instances (selfHeal: false for DR) |
| 5 | OTEL + Monitoring AppSet | OpenTelemetry Operator, Prometheus, Grafana, Loki |
| 6 | My-Apps AppSet | Discovers my-apps/*/* (user applications) |
- Omni deployed and accessible - See Omni Setup Guide
- Sidero Proxmox Provider configured - See proxmox provider config
- Cluster created in Omni - Talos cluster provisioned and healthy
- kubectl access - Download kubeconfig from Omni UI
- Local tools installed:
kubectl,kustomize, Cilium CLI (ciliumorcilium-cli),1passwordCLI (op)
Once your cluster is provisioned via Omni, follow these steps to install the GitOps stack.
You need kubectl access before anything else. The default OIDC kubeconfig expires and requires a browser — use the Omni service account for a stable bearer token instead.
Prerequisite: You must have the
OMNI_SERVICE_ACCOUNT_KEYstored in 1Password (item:talos-prod-sa). See Cluster Access for how to create a service account if you don't have one yet.
# Sign in to 1Password
eval $(op signin)
# Set Omni endpoint
export OMNI_ENDPOINT=https://omni.vanillax.me:443
# Pull the service account key from 1Password
export OMNI_SERVICE_ACCOUNT_KEY="$(op read 'op://homelab-prod/talos-prod-sa/OMNI_SERVICE_ACCOUNT_KEY')"
# Generate bearer-token kubeconfig (not OIDC)
omnictl kubeconfig --cluster talos-prod-cluster --service-account --user talos-prod-sa --force
# Verify access
kubectl get nodesFish shell
set -x OMNI_ENDPOINT https://omni.vanillax.me:443
set -x OMNI_SERVICE_ACCOUNT_KEY (op read 'op://homelab-prod/talos-prod-sa/OMNI_SERVICE_ACCOUNT_KEY')
omnictl kubeconfig --cluster talos-prod-cluster --service-account --user talos-prod-sa --force
kubectl get nodesOmni provisions Talos clusters without a CNI. Install Cilium to get networking functional:
cilium-cli install \
--version 1.19.3 \
--set cluster.name=talos-prod-cluster \
--set ipam.mode=kubernetes \
--set kubeProxyReplacement=true \
--set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
--set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
--set cgroup.autoMount.enabled=false \
--set cgroup.hostRoot=/sys/fs/cgroup \
--set k8sServiceHost=localhost \
--set k8sServicePort=7445 \
--set hubble.enabled=false \
--set hubble.relay.enabled=false \
--set hubble.ui.enabled=false \
--set gatewayAPI.enabled=true \
--set gatewayAPI.enableAlpn=true \
--set gatewayAPI.enableAppProtocol=trueImportant — version must match: The
cilium installCLI version must match the Helm chart version ininfrastructure/networking/cilium/kustomization.yaml(currently 1.19.3). Usecilium install --version 1.19.3to pin it. If versions differ, ArgoCD upgrades Cilium at Wave 0 and regenerates some Hubble certs but not others, causing TLS handshake failures (x509: certificate signed by unknown authority) that block all sync waves.Important — Hubble is disabled at bootstrap on purpose: The CLI install only provides basic CNI networking. ArgoCD enables Hubble at Wave 0 via the full
values.yaml(which hashubble.enabled: true). This ensures ArgoCD is the sole owner of Hubble TLS certificates — no cert mismatch between CLI install and ArgoCD's Helm render. TheignoreDifferencesincilium-app.yamlthen preserves those certs on subsequent syncs.Important — cluster name must match:
cluster.namemust matchinfrastructure/networking/cilium/values.yamlfor Hubble certificate SANs. Ifcilium installis run without--set cluster.name=talos-prod-cluster, certificates are generated fordefaultorkind-kind, causing TLS failures.
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml
kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/experimental-install.yamlVerify Cilium:
cilium status
kubectl get pods -n kube-system -l k8s-app=ciliumOn Arch/CachyOS, the package often installs the binary as cilium-cli rather than cilium. The bootstrap script accepts either name.
kubectl create namespace 1passwordconnect
kubectl create namespace external-secrets
eval $(op signin)
export OP_CREDENTIALS=$(op read op://homelab-prod/1passwordconnect/1password-credentials.json)
export OP_CONNECT_TOKEN=$(op read 'op://homelab-prod/1password-operator-token/credential')
kubectl create secret generic 1password-credentials \
--namespace 1passwordconnect \
--from-literal=1password-credentials.json="$OP_CREDENTIALS"
kubectl create secret generic 1password-operator-token \
--namespace 1passwordconnect \
--from-literal=token="$OP_CONNECT_TOKEN"
kubectl create secret generic 1passwordconnect \
--namespace external-secrets \
--from-literal=token="$OP_CONNECT_TOKEN"Option A: Bootstrap Script (Recommended)
./scripts/bootstrap-argocd.shOption B: Manual Steps
kubectl apply -f infrastructure/controllers/argocd/ns.yaml
helm upgrade --install argocd argo-cd \
--repo https://argoproj.github.io/argo-helm \
--version 9.4.15 \
--namespace argocd \
--values infrastructure/controllers/argocd/values.yaml \
--wait \
--timeout 10m
kubectl wait --for condition=established --timeout=60s crd/applications.argoproj.io
kubectl wait --for=condition=Available deployment/argocd-server -n argocd --timeout=300s
kubectl apply -f infrastructure/controllers/argocd/http-route.yaml
kubectl apply -f infrastructure/controllers/argocd/root.yaml# Check ArgoCD pods
kubectl get pods -n argocd
# Watch applications sync (all should reach 'Synced')
kubectl get applications -n argocd -w
# View sync wave order
kubectl get applications -n argocd -o custom-columns=NAME:.metadata.name,WAVE:.metadata.annotations.argocd\\.argoproj\\.io/sync-wave,STATUS:.status.sync.statuskubectl port-forward svc/argocd-server -n argocd 8080:443
# Open https://localhost:8080
# Admin password is pre-configured via bootstrap Helm valuesArgoCD takes over and manages everything from Git:
- Wave 0: Cilium, 1Password Connect, External Secrets deploy in parallel
- Wave 1: Longhorn, Snapshot Controller, VolSync, and the pvc-plumber v2 operator deploy after networking + secrets are ready
- Wave 2: pvc-plumber webhook configurations register (FAIL-CLOSED admission gate for backup-labeled PVCs in app namespaces)
- Wave 3: CNPG Barman Plugin deploys (DB backup plugin readiness before DB clusters)
- Wave 4: Infrastructure AppSet plus custom standalone entrypoints deploy cert-manager, GPU operators, gateway, KEDA, Temporal Worker Controller, etc.
- Wave 4: Database AppSet deploys CloudNativePG operators and instances
- Wave 5: OpenTelemetry Operator and Monitoring AppSet deploy Prometheus, Grafana, Loki
- Wave 6: My-Apps AppSet deploys user applications
New applications are discovered automatically by directory structure - add a directory with a kustomization.yaml and push to Git.
The default omnictl kubeconfig uses OIDC exec auth which expires and requires a browser login. For long-lived access, create a service account with a bearer token instead.
IMPORTANT: Use the CLI, not the Omni UI. UI-generated PGP keys are incompatible with the CLI's gopenpgp library (EdDSA verification failure).
# 1. Create the service account (1 year max TTL)
omnictl serviceaccount create talos-prod-sa --use-user-role
# 2. Save the output — OMNI_ENDPOINT and OMNI_SERVICE_ACCOUNT_KEY
# Store both values in 1Password immediately. The key is shown ONCE.
# 3. Generate a bearer-token kubeconfig (NOT OIDC)
OMNI_ENDPOINT=https://omni.vanillax.me:443 \
OMNI_SERVICE_ACCOUNT_KEY="<key-from-step-2>" \
omnictl kubeconfig --cluster talos-prod-cluster --service-account --user talos-prod-sa --force
# 4. Verify
kubectl get nodesRenewal (expires after 1 year):
omnictl serviceaccount destroy talos-prod-sa
omnictl serviceaccount create talos-prod-sa --use-user-role
# Regenerate kubeconfig with step 3 above, update key in 1PasswordGotchas:
- Always create via CLI — UI-generated keys fail with
gopenpgp: EdDSA verification failure - The
--service-accountflag is what gives you a bearer token. Without it you get OIDC exec (the thing that expires) - If the key fails with signature errors, write it to a file and use
$(cat /tmp/key.txt)instead of inline quoting - Node management is done through Omni web UI (upgrades, configuration, patches)
All PVC backups use Kopia on NFS via VolSync, automated by the pvc-plumber v2 operator. Add backup: "hourly" or backup: "daily" label to any PVC and backups happen automatically with zero-touch disaster recovery.
- Backend: Kopia filesystem repository on TrueNAS NFS (
192.168.10.133:/mnt/BigTank/k8s/volsync-kopia-nfs) - Encryption: Kopia password from 1Password (
rustfsitem) - Restore: Automatic on PVC recreation — the operator's mutating webhook calls Kopia live, and on
decision=restoreinjectsdataSourceRefso VolSync populates the PVC before the app gets a Bound volume. Validating webhook is fail-closed: denies admission if Kopia truth is unknown - Database backups: CNPG uses Barman to RustFS S3; abandoned database backup prefixes are cleaned by the GitOps-managed
rustfs-lifecyclecontroller - Details: See docs/volsync-storage-recovery.md and docs/cnpg-disaster-recovery.md
- AI-guided database recovery: Copy/paste prompts are in LLM Recovery Prompt Templates
The cluster is running Talos 1.13 (migrated from 1.12 in April 2026). A few things changed at 1.13 that you'll hit if you spin up or rebuild a cluster — read this before touching the cluster template.
Talos 1.13 replaced the old install/upgrade flow with the
LifecycleService API. Earlier versions could auto-detect a system
disk during maintenanceUpgrade; 1.13 requires an explicit
machine.install.disk in the machine config.
Symptom if missing: fresh VMs boot, but control planes stay stuck in
stage=7 (UPGRADING) with configuptodate=false forever. Resource
versions cycle into the hundreds. The LoadBalancer never goes healthy,
Kubernetes never bootstraps. No error surfaces anywhere — it silently
fails inside maintenanceUpgrade.
This repo ships the fix as a cluster-level config patch in
omni/cluster-template/cluster-template.yaml:
- name: install-disk
inline:
machine:
install:
disk: /dev/sda # Proxmox virtio-scsi-single + scsi0 presents as /dev/sdaAll machine classes (CP / worker / GPU) use the same bus layout, so the
patch goes at cluster scope — not per-machineset. If you add a class
with a different disk presentation (e.g., NVMe passthrough →
/dev/nvme0n1), override it per-machineset instead.
Talos 1.13 is the target point for migrating the GPU worker from the proprietary NVIDIA kernel modules to the NVIDIA open kernel modules. Talos continues to own the host driver and the container toolkit via system extensions; the GPU Operator stays scoped to device plugin, GFD, validator, and runtime-class management.
Plan: docs/superpowers/plans/2026-04-19-talos-1.13-oss-nvidia-migration.md
Key files touched by the migration:
omni/cluster-template/cluster-template.yaml— swap extension fromnonfree-kmod-nvidia-productionto the OSS equivalent.infrastructure/controllers/nvidia-gpu-operator/kustomization.yaml— align with Talos 1.13 beta OSS guide, especiallyhostPaths.driverInstallDir.infrastructure/controllers/nvidia-gpu-operator/cluster-policy.yaml— keep dormant reference aligned with OSS assumptions.
Because there's only one GPU worker, this is a maintenance-window
migration with explicit rollback — not a canary. llama-cpp is offline
for the duration.
Omni 1.7 is required to provision/upgrade Talos 1.13 clusters. When upgrading:
- Take an Omni etcd snapshot (
omni/omni/README.md→ Backup/Recovery). - Upgrade the Omni container to 1.7.x, restart. Verify the UI loads and existing clusters still show healthy.
- Upgrade
omnictlon your workstation to match the server version — mismatched versions fail with obscure gRPC errors. - Regenerate the service-account kubeconfig if it's older than 30 days (token rotation often lags server upgrades).
After the RustFS wipe in April 2026, every CNPG database was re-bootstrapped
from scratch via initdb (v1 of each overlay). Any database DR
runbook older than 2026-04-18 references the old WAL chain and will not
work. Current procedure is in
docs/cnpg-disaster-recovery.md — that
doc was rewritten against the new clean-slate pattern, so treat it as
authoritative over anything in docs/research/storage/.
Compute
├── AMD Threadripper 2950X (16c/32t)
├── 128GB ECC DDR4 RAM
├── 2x NVIDIA RTX 3090 24GB
└── Google Coral TPU
Storage
├── 4TB ZFS RAID-Z2
├── NVMe OS Drive
└── Longhorn distributed storage for K8s
Network
├── 2.5Gb Networking
├── Firewalla Gold
└── Internal DNS Resolution
| Issue | Steps |
|---|---|
| ArgoCD not syncing | kubectl get applicationsets -n argocd / kubectl describe applicationset infrastructure -n argocd / Check for stale operations before reverting Git: kubectl get application argocd -n argocd -o yaml |
| Cilium issues | cilium status / kubectl logs -n kube-system -l k8s-app=cilium / cilium connectivity test |
| Storage issues | kubectl get pvc -A / kubectl get pods -n longhorn-system |
| Secrets not syncing | kubectl get externalsecret -A / kubectl get pods -n 1passwordconnect / kubectl describe clustersecretstore 1password |
| GPU issues | kubectl get nodes -l feature.node.kubernetes.io/pci-0300_10de.present=true / kubectl get pods -n gpu-operator |
| Backup issues | kubectl get replicationsource -A / kubectl get pods -n volsync-system -l app.kubernetes.io/name=pvc-plumber |
# Remove finalizers and delete all applications
kubectl get applications -n argocd -o name | xargs -I{} kubectl patch {} -n argocd --type json -p '[{"op": "remove","path": "/metadata/finalizers"}]'
kubectl delete applications --all -n argocd
./scripts/bootstrap-argocd.sh- CLAUDE.md - Full development guide and patterns for this repository
- docs/volsync-storage-recovery.md - PVC backup/restore (architecture, admission flow, scenarios, troubleshooting)
- docs/argocd.md - ArgoCD GitOps patterns
- docs/argocd-entrypoints.md - Root entrypoints, waves, and AppSet/custom-entrypoint decisions
- docs/network-topology.md - Network architecture
- docs/network-policy.md - Cilium network policies
- omni/ - Omni deployment configs, machine classes, and cluster templates
- omni/omni/README.md - Omni instance setup guide
- omni/docs/ - Architecture, operations, prerequisites, troubleshooting
- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License