Skip to content

mitchross/talos-argocd-proxmox

Repository files navigation

Talos ArgoCD Proxmox Cluster

Ask DeepWiki

Production-grade GitOps Kubernetes cluster on Talos OS with self-managing ArgoCD, Cilium, and zero-touch PVC backup/restore

A GitOps-driven Kubernetes cluster using Talos OS (secure, immutable Linux for K8s), ArgoCD, and Cilium, running on Proxmox. Managed via Omni (Sidero's Talos management platform) with the Proxmox Infrastructure Provider for automated node provisioning.

Key Features

  • Self-Managing ArgoCD - ArgoCD manages its own installation, upgrades, and ApplicationSets from Git
  • Directory = Application - Apps discovered automatically by directory path, no manual Application manifests
  • Sync Wave Ordering - Strict deployment ordering prevents race conditions
  • Zero-Touch Backups - Add a label to a PVC, get automatic Kopia backups to NFS with disaster recovery
  • Gateway API - Modern ingress via Cilium Gateway API (not legacy Ingress)
  • GPU Support - Full NVIDIA GPU support via Talos system extensions and GPU Operator
  • Zero SSH - All node management via Omni UI or Talos API

Repositories & Resources

Resource Description
Omni Talos cluster management platform
Proxmox Infra Provider Proxmox infrastructure provider for Omni
Starter Repo Full config & automation for Sidero Omni + Talos + Proxmox
Reference Guide VirtualizationHowTo guide for Talos Omni on-prem setup

Architecture

graph TD;
    subgraph "Bootstrap Process (Manual)"
        User(["User"]) -- "kubectl apply -k" --> Kustomization["infrastructure/argocd/kustomization.yaml"];
        Kustomization -- "Deploys" --> ArgoCD["ArgoCD<br/>(from Helm Chart)"];
        Kustomization -- "Deploys" --> RootApp["Root Application<br/>(root.yaml)"];
    end

    subgraph "GitOps Self-Management Loop (Automatic)"
        ArgoCD -- "1. Syncs" --> RootApp;
        RootApp -- "2. Points to<br/>.../argocd/apps/" --> ArgoConfigDir["ArgoCD Config<br/>(Projects & AppSets)"];
        ArgoCD -- "3. Deploys" --> AppSets["ApplicationSets"];
        AppSets -- "4. Scans Repo for<br/>Application Directories" --> AppManifests["Application Manifests<br/>(e.g., my-apps/nginx/)"];
        ArgoCD -- "5. Deploys" --> ClusterResources["Cluster Resources<br/>(Nginx, Prometheus, etc.)"];
    end

    style User fill:#a2d5c6,stroke:#333
    style Kustomization fill:#5bc0de,stroke:#333
    style RootApp fill:#f0ad4e,stroke:#333
    style ArgoCD fill:#d9534f,stroke:#333
Loading

Sync Wave Architecture

ArgoCD deploys applications in strict order to prevent dependency issues:

Wave Component Purpose
0 Foundation Cilium (CNI), ArgoCD, 1Password Connect, External Secrets, AppProjects
1 Storage + backup operator Longhorn, VolumeSnapshot Controller, VolSync, pvc-plumber v2 operator
2 pvc-plumber webhook configs FAIL-CLOSED PVC admission gate registers after the operator pod is healthy
3 CNPG Barman Plugin Database backup plugin before DB clusters
4 Infrastructure AppSet + custom entrypoints Cert-Manager, External-DNS, GPU Operators, Gateway, KEDA, Temporal Worker Controller
4 Database AppSet CloudNativePG operators & instances (selfHeal: false for DR)
5 OTEL + Monitoring AppSet OpenTelemetry Operator, Prometheus, Grafana, Loki
6 My-Apps AppSet Discovers my-apps/*/* (user applications)

Prerequisites

  1. Omni deployed and accessible - See Omni Setup Guide
  2. Sidero Proxmox Provider configured - See proxmox provider config
  3. Cluster created in Omni - Talos cluster provisioned and healthy
  4. kubectl access - Download kubeconfig from Omni UI
  5. Local tools installed: kubectl, kustomize, Cilium CLI (cilium or cilium-cli), 1password CLI (op)

Bootstrap Process

Once your cluster is provisioned via Omni, follow these steps to install the GitOps stack.

Step 0: Get Cluster Access (kubectl)

You need kubectl access before anything else. The default OIDC kubeconfig expires and requires a browser — use the Omni service account for a stable bearer token instead.

Prerequisite: You must have the OMNI_SERVICE_ACCOUNT_KEY stored in 1Password (item: talos-prod-sa). See Cluster Access for how to create a service account if you don't have one yet.

# Sign in to 1Password
eval $(op signin)

# Set Omni endpoint
export OMNI_ENDPOINT=https://omni.vanillax.me:443

# Pull the service account key from 1Password
export OMNI_SERVICE_ACCOUNT_KEY="$(op read 'op://homelab-prod/talos-prod-sa/OMNI_SERVICE_ACCOUNT_KEY')"

# Generate bearer-token kubeconfig (not OIDC)
omnictl kubeconfig --cluster talos-prod-cluster --service-account --user talos-prod-sa --force

# Verify access
kubectl get nodes
Fish shell
set -x OMNI_ENDPOINT https://omni.vanillax.me:443
set -x OMNI_SERVICE_ACCOUNT_KEY (op read 'op://homelab-prod/talos-prod-sa/OMNI_SERVICE_ACCOUNT_KEY')
omnictl kubeconfig --cluster talos-prod-cluster --service-account --user talos-prod-sa --force
kubectl get nodes

Step 1: Install Cilium CNI

Omni provisions Talos clusters without a CNI. Install Cilium to get networking functional:

cilium-cli install \
    --version 1.19.3 \
    --set cluster.name=talos-prod-cluster \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=true \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup \
    --set k8sServiceHost=localhost \
    --set k8sServicePort=7445 \
    --set hubble.enabled=false \
    --set hubble.relay.enabled=false \
    --set hubble.ui.enabled=false \
    --set gatewayAPI.enabled=true \
    --set gatewayAPI.enableAlpn=true \
    --set gatewayAPI.enableAppProtocol=true

Important — version must match: The cilium install CLI version must match the Helm chart version in infrastructure/networking/cilium/kustomization.yaml (currently 1.19.3). Use cilium install --version 1.19.3 to pin it. If versions differ, ArgoCD upgrades Cilium at Wave 0 and regenerates some Hubble certs but not others, causing TLS handshake failures (x509: certificate signed by unknown authority) that block all sync waves.

Important — Hubble is disabled at bootstrap on purpose: The CLI install only provides basic CNI networking. ArgoCD enables Hubble at Wave 0 via the full values.yaml (which has hubble.enabled: true). This ensures ArgoCD is the sole owner of Hubble TLS certificates — no cert mismatch between CLI install and ArgoCD's Helm render. The ignoreDifferences in cilium-app.yaml then preserves those certs on subsequent syncs.

Important — cluster name must match: cluster.name must match infrastructure/networking/cilium/values.yaml for Hubble certificate SANs. If cilium install is run without --set cluster.name=talos-prod-cluster, certificates are generated for default or kind-kind, causing TLS failures.

Step 2: Install Gateway API CRDs

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml
kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/experimental-install.yaml

Verify Cilium:

cilium status
kubectl get pods -n kube-system -l k8s-app=cilium

On Arch/CachyOS, the package often installs the binary as cilium-cli rather than cilium. The bootstrap script accepts either name.

Step 3: Pre-Seed 1Password Secrets

kubectl create namespace 1passwordconnect
kubectl create namespace external-secrets

eval $(op signin)

export OP_CREDENTIALS=$(op read op://homelab-prod/1passwordconnect/1password-credentials.json)
export OP_CONNECT_TOKEN=$(op read 'op://homelab-prod/1password-operator-token/credential')

kubectl create secret generic 1password-credentials \
  --namespace 1passwordconnect \
  --from-literal=1password-credentials.json="$OP_CREDENTIALS"

kubectl create secret generic 1password-operator-token \
  --namespace 1passwordconnect \
  --from-literal=token="$OP_CONNECT_TOKEN"

kubectl create secret generic 1passwordconnect \
  --namespace external-secrets \
  --from-literal=token="$OP_CONNECT_TOKEN"

Step 4: Bootstrap ArgoCD

Option A: Bootstrap Script (Recommended)

./scripts/bootstrap-argocd.sh

Option B: Manual Steps

kubectl apply -f infrastructure/controllers/argocd/ns.yaml

helm upgrade --install argocd argo-cd \
  --repo https://argoproj.github.io/argo-helm \
  --version 9.4.15 \
  --namespace argocd \
  --values infrastructure/controllers/argocd/values.yaml \
  --wait \
  --timeout 10m

kubectl wait --for condition=established --timeout=60s crd/applications.argoproj.io
kubectl wait --for=condition=Available deployment/argocd-server -n argocd --timeout=300s

kubectl apply -f infrastructure/controllers/argocd/http-route.yaml
kubectl apply -f infrastructure/controllers/argocd/root.yaml

Step 5: Verify

# Check ArgoCD pods
kubectl get pods -n argocd

# Watch applications sync (all should reach 'Synced')
kubectl get applications -n argocd -w

# View sync wave order
kubectl get applications -n argocd -o custom-columns=NAME:.metadata.name,WAVE:.metadata.annotations.argocd\\.argoproj\\.io/sync-wave,STATUS:.status.sync.status

Step 6: Access ArgoCD UI (Optional)

kubectl port-forward svc/argocd-server -n argocd 8080:443
# Open https://localhost:8080
# Admin password is pre-configured via bootstrap Helm values

What Happens After Bootstrap

ArgoCD takes over and manages everything from Git:

  1. Wave 0: Cilium, 1Password Connect, External Secrets deploy in parallel
  2. Wave 1: Longhorn, Snapshot Controller, VolSync, and the pvc-plumber v2 operator deploy after networking + secrets are ready
  3. Wave 2: pvc-plumber webhook configurations register (FAIL-CLOSED admission gate for backup-labeled PVCs in app namespaces)
  4. Wave 3: CNPG Barman Plugin deploys (DB backup plugin readiness before DB clusters)
  5. Wave 4: Infrastructure AppSet plus custom standalone entrypoints deploy cert-manager, GPU operators, gateway, KEDA, Temporal Worker Controller, etc.
  6. Wave 4: Database AppSet deploys CloudNativePG operators and instances
  7. Wave 5: OpenTelemetry Operator and Monitoring AppSet deploy Prometheus, Grafana, Loki
  8. Wave 6: My-Apps AppSet deploys user applications

New applications are discovered automatically by directory structure - add a directory with a kustomization.yaml and push to Git.

Cluster Access (Omni Service Account)

The default omnictl kubeconfig uses OIDC exec auth which expires and requires a browser login. For long-lived access, create a service account with a bearer token instead.

IMPORTANT: Use the CLI, not the Omni UI. UI-generated PGP keys are incompatible with the CLI's gopenpgp library (EdDSA verification failure).

# 1. Create the service account (1 year max TTL)
omnictl serviceaccount create talos-prod-sa --use-user-role

# 2. Save the output — OMNI_ENDPOINT and OMNI_SERVICE_ACCOUNT_KEY
#    Store both values in 1Password immediately. The key is shown ONCE.

# 3. Generate a bearer-token kubeconfig (NOT OIDC)
OMNI_ENDPOINT=https://omni.vanillax.me:443 \
OMNI_SERVICE_ACCOUNT_KEY="<key-from-step-2>" \
omnictl kubeconfig --cluster talos-prod-cluster --service-account --user talos-prod-sa --force

# 4. Verify
kubectl get nodes

Renewal (expires after 1 year):

omnictl serviceaccount destroy talos-prod-sa
omnictl serviceaccount create talos-prod-sa --use-user-role
# Regenerate kubeconfig with step 3 above, update key in 1Password

Gotchas:

  • Always create via CLI — UI-generated keys fail with gopenpgp: EdDSA verification failure
  • The --service-account flag is what gives you a bearer token. Without it you get OIDC exec (the thing that expires)
  • If the key fails with signature errors, write it to a file and use $(cat /tmp/key.txt) instead of inline quoting
  • Node management is done through Omni web UI (upgrades, configuration, patches)

Backup System

All PVC backups use Kopia on NFS via VolSync, automated by the pvc-plumber v2 operator. Add backup: "hourly" or backup: "daily" label to any PVC and backups happen automatically with zero-touch disaster recovery.

  • Backend: Kopia filesystem repository on TrueNAS NFS (192.168.10.133:/mnt/BigTank/k8s/volsync-kopia-nfs)
  • Encryption: Kopia password from 1Password (rustfs item)
  • Restore: Automatic on PVC recreation — the operator's mutating webhook calls Kopia live, and on decision=restore injects dataSourceRef so VolSync populates the PVC before the app gets a Bound volume. Validating webhook is fail-closed: denies admission if Kopia truth is unknown
  • Database backups: CNPG uses Barman to RustFS S3; abandoned database backup prefixes are cleaned by the GitOps-managed rustfs-lifecycle controller
  • Details: See docs/volsync-storage-recovery.md and docs/cnpg-disaster-recovery.md
  • AI-guided database recovery: Copy/paste prompts are in LLM Recovery Prompt Templates

Cluster Upgrades & Talos 1.13 Notes

The cluster is running Talos 1.13 (migrated from 1.12 in April 2026). A few things changed at 1.13 that you'll hit if you spin up or rebuild a cluster — read this before touching the cluster template.

machine.install.disk is now mandatory

Talos 1.13 replaced the old install/upgrade flow with the LifecycleService API. Earlier versions could auto-detect a system disk during maintenanceUpgrade; 1.13 requires an explicit machine.install.disk in the machine config.

Symptom if missing: fresh VMs boot, but control planes stay stuck in stage=7 (UPGRADING) with configuptodate=false forever. Resource versions cycle into the hundreds. The LoadBalancer never goes healthy, Kubernetes never bootstraps. No error surfaces anywhere — it silently fails inside maintenanceUpgrade.

This repo ships the fix as a cluster-level config patch in omni/cluster-template/cluster-template.yaml:

- name: install-disk
  inline:
    machine:
      install:
        disk: /dev/sda   # Proxmox virtio-scsi-single + scsi0 presents as /dev/sda

All machine classes (CP / worker / GPU) use the same bus layout, so the patch goes at cluster scope — not per-machineset. If you add a class with a different disk presentation (e.g., NVMe passthrough → /dev/nvme0n1), override it per-machineset instead.

NVIDIA driver migration (in progress)

Talos 1.13 is the target point for migrating the GPU worker from the proprietary NVIDIA kernel modules to the NVIDIA open kernel modules. Talos continues to own the host driver and the container toolkit via system extensions; the GPU Operator stays scoped to device plugin, GFD, validator, and runtime-class management.

Plan: docs/superpowers/plans/2026-04-19-talos-1.13-oss-nvidia-migration.md

Key files touched by the migration:

  • omni/cluster-template/cluster-template.yaml — swap extension from nonfree-kmod-nvidia-production to the OSS equivalent.
  • infrastructure/controllers/nvidia-gpu-operator/kustomization.yaml — align with Talos 1.13 beta OSS guide, especially hostPaths.driverInstallDir.
  • infrastructure/controllers/nvidia-gpu-operator/cluster-policy.yaml — keep dormant reference aligned with OSS assumptions.

Because there's only one GPU worker, this is a maintenance-window migration with explicit rollback — not a canary. llama-cpp is offline for the duration.

Upgrading Omni / omnictl to the 1.13 toolchain

Omni 1.7 is required to provision/upgrade Talos 1.13 clusters. When upgrading:

  1. Take an Omni etcd snapshot (omni/omni/README.md → Backup/Recovery).
  2. Upgrade the Omni container to 1.7.x, restart. Verify the UI loads and existing clusters still show healthy.
  3. Upgrade omnictl on your workstation to match the server version — mismatched versions fail with obscure gRPC errors.
  4. Regenerate the service-account kubeconfig if it's older than 30 days (token rotation often lags server upgrades).

CNPG clean-slate baseline (April 2026)

After the RustFS wipe in April 2026, every CNPG database was re-bootstrapped from scratch via initdb (v1 of each overlay). Any database DR runbook older than 2026-04-18 references the old WAL chain and will not work. Current procedure is in docs/cnpg-disaster-recovery.md — that doc was rewritten against the new clean-slate pattern, so treat it as authoritative over anything in docs/research/storage/.

Hardware

Compute
├── AMD Threadripper 2950X (16c/32t)
├── 128GB ECC DDR4 RAM
├── 2x NVIDIA RTX 3090 24GB
└── Google Coral TPU

Storage
├── 4TB ZFS RAID-Z2
├── NVMe OS Drive
└── Longhorn distributed storage for K8s

Network
├── 2.5Gb Networking
├── Firewalla Gold
└── Internal DNS Resolution

Troubleshooting

Issue Steps
ArgoCD not syncing kubectl get applicationsets -n argocd / kubectl describe applicationset infrastructure -n argocd / Check for stale operations before reverting Git: kubectl get application argocd -n argocd -o yaml
Cilium issues cilium status / kubectl logs -n kube-system -l k8s-app=cilium / cilium connectivity test
Storage issues kubectl get pvc -A / kubectl get pods -n longhorn-system
Secrets not syncing kubectl get externalsecret -A / kubectl get pods -n 1passwordconnect / kubectl describe clustersecretstore 1password
GPU issues kubectl get nodes -l feature.node.kubernetes.io/pci-0300_10de.present=true / kubectl get pods -n gpu-operator
Backup issues kubectl get replicationsource -A / kubectl get pods -n volsync-system -l app.kubernetes.io/name=pvc-plumber

Emergency Reset

# Remove finalizers and delete all applications
kubectl get applications -n argocd -o name | xargs -I{} kubectl patch {} -n argocd --type json -p '[{"op": "remove","path": "/metadata/finalizers"}]'
kubectl delete applications --all -n argocd
./scripts/bootstrap-argocd.sh

Documentation

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

License

MIT License

About

Talos ArgoCD Homelab. My personal production Cluster.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors