mitchross
diff --git a/‎README.md‎
Lines changed: 206 additions & 330 deletions b/‎README.md‎
Lines changed: 206 additions & 330 deletions
diff --git a/‎docs/domains/ai-gpu/3090-llm-optimization.md‎
Lines changed: 7 additions & 7 deletions b/‎docs/domains/ai-gpu/3090-llm-optimization.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎docs/domains/ai-gpu/model-catalog.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/domains/ai-gpu/model-catalog.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/domains/argocd/argocd.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/domains/argocd/argocd.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/domains/argocd/entrypoints.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/domains/argocd/entrypoints.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/domains/cnpg/disaster-recovery.md‎
Lines changed: 1 addition & 5 deletions b/‎docs/domains/cnpg/disaster-recovery.md‎
Lines changed: 1 addition & 5 deletions
diff --git a/‎docs/domains/cnpg/explained.md‎
Lines changed: 8 additions & 6 deletions b/‎docs/domains/cnpg/explained.md‎
Lines changed: 8 additions & 6 deletions
diff --git a/‎docs/domains/rustfs/credential-runbook.md‎
Lines changed: 6 additions & 15 deletions b/‎docs/domains/rustfs/credential-runbook.md‎
Lines changed: 6 additions & 15 deletions
diff --git a/‎docs/index.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/index.md‎
Lines changed: 2 additions & 2 deletions
@@ -16,8 +16,8 @@
   GPU1 = ComfyUI. **Pool both on-demand** (layer-split) only for the occasional
   256K research / full-context coding burst. Keep both 3090s in the box; do
   **not** redistribute a card to the gaming PC.
-- **Why not single-card + CPU offload:** the node is a **Xeon E5 v4 (Broadwell)
-  DL360 Gen9, DDR4-2400, no AVX-512, PCIe 3.0, under Proxmox**. MoE expert
+- **Why not single-card + CPU offload:** the node is an **AMD Threadripper 2950X
+  (16c/32t, Zen+), 128GB ECC DDR4, no AVX-512, PCIe 3.0, under Proxmox**. MoE expert
   offload is memory-bandwidth-bound and would land ~8–12 TPS on this CPU. The
   second 3090 is what keeps the long-context path on-GPU and fast — on this box
   48GB is close to essential, not a nice-to-have.
@@ -57,7 +57,7 @@ three ways to buy more usable context, in priority order **on this CPU**:
 
 1. **More VRAM** — pool the second 3090 (cleanest here).
 2. **Smaller KV** — quantize KV (q8→q4 ≈ half; TurboQuant `turbo3` ≈ ⅕, see below).
-3. **CPU expert offload** — *last resort* on Broadwell; avoid.
+3. **CPU expert offload** — *last resort* on the Threadripper 2950X; avoid.
 
 This ordering is **inverted** from a modern DDR5 / AVX-512 box, where single-card
 + offload is fine.
@@ -79,7 +79,7 @@ both, on-demand ── pool layer-split for 256K research / full-context coding
 - **Rejected: redistribute** (single 3090 + AMD 6800 in cluster + 3090 →
   gaming PC). The 6800 is RDNA2/ROCm and much of the stack is CUDA-locked
   (faster-whisper = CTranslate2, many ComfyUI nodes), and a single cluster 3090
-  would be stuck on the slow Broadwell offload path. Only revisit if the LLM
+  would be stuck on the slow Threadripper-2950X offload path. Only revisit if the LLM
   workload moves off this node.
 
 ## Daily driver + single-vs-dual (decision)
@@ -338,9 +338,9 @@ offload, no spill**:
    used by `my-apps/ai/llmfit/` dual-GPU jobs.)
 3. On that 2-GPU deployment: drop `GGML_CUDA_ENABLE_UNIFIED_MEMORY`, keep KV
    symmetric q8/q8, optionally raise `-ub 1024`.
-4. **Proxmox/DL360 checks:** NUMA-pin the VM to one socket with both 3090s on
-   that socket's PCIe lanes; confirm both cards are x16 Gen3; NVLink optional
-   (helps spec-decode, not layer-split much).
+4. **Proxmox/X399 checks:** confirm both 3090s are passed through on PCIe Gen3
+   x16 lanes; pin the VM's vCPUs to physical cores and enable hugepages; NVLink
+   optional (helps spec-decode, not layer-split much).
 
 ## Model download checklist (NFS: `192.168.10.133:/mnt/ai-pool/llama-cpp`)
 
 
@@ -78,7 +78,7 @@ coding, tool calls, and vision.
 - **For:** the occasional very long research dig or full-repo context, when both
   3090s are pooled (layer-split, 48GB → 256K KV resident).
 - **On a single 3090 it's a trap:** 256K KV (~14GB) spills ~7GB to host RAM →
-  CPU-driven attention on the Broadwell CPU = slow, *and* it's a distinct
+  CPU-driven attention on the Threadripper 2950X CPU = slow, *and* it's a distinct
   instance that thrashes against the daily driver. Perplexica's embedding filter
   keeps prompts <64K anyway, so it does **not** default here. Select manually
   only when you've pooled both cards (scale ComfyUI→0). See optimization doc.
 
@@ -19,7 +19,7 @@ Do not install Prometheus Operator CRDs early just to satisfy bootstrap apps. Se
 ArgoCD starts from the manually seeded root application:
 
 ```text
-infrastructure/controllers/argocd/bootstrap/root-application.yaml
+infrastructure/controllers/argocd/root.yaml
 ```
 
 The root application renders three layers:
@@ -35,14 +35,14 @@ See [ArgoCD entrypoints](entrypoints.md) for the concrete files.
 | Wave | Applications |
 |---|---|
 | `0` | ArgoCD projects/bootstrap, Cilium, 1Password Connect, External Secrets |
-| `1` | cert-manager, Longhorn, snapshot-controller, VolSync |
-| `2` | pvc-plumber core, VolSync backup-cluster wiring |
-| `3` | CNPG Barman plugin |
+| `1` | cert-manager, Longhorn, snapshot-controller |
+| `2` | kopiur operator (CRDs + controller + webhook; volume populator) |
+| `3` | CNPG Barman plugin, kopiur config (ClusterRepository `cluster-kopia` + credential fanout + VolumeSnapshotClass) |
 | `4` | KEDA core, Temporal worker, infrastructure and database AppSets |
 | `5` | OpenTelemetry operator core, monitoring AppSet including `kube-prometheus-stack` |
 | `6` | KEDA observability, OpenTelemetry operator observability, workload AppSet |
 
-cert-manager is intentionally Wave `1`: the CNPG Barman plugin depends on it. pvc-plumber Wave `2` is core-only. KEDA and OpenTelemetry ServiceMonitor resources render from Wave `6` observability overlays.
+cert-manager is intentionally Wave `1`: the CNPG Barman plugin depends on it. The kopiur operator is Wave `2` (CRDs + controller + webhook), with its repo/credential config at Wave `3`. KEDA and OpenTelemetry ServiceMonitor resources render from Wave `6` observability overlays.
 
 CNPG `enablePodMonitor: true` remains an accepted runtime soft-coupling. It can log transient errors before monitoring exists, but it is not an ArgoCD dry-run blocker.
 
 
@@ -25,14 +25,15 @@ This is the review map for everything directly rendered by the root Application
 | `core-dependencies/cert-manager-app.yaml` | Application | 1 | Certificate controller required before CNPG Barman plugin | No, required before cert-dependent apps |
 | `core-dependencies/longhorn-app.yaml` | Application | 1 | Storage foundation before PVC consumers | No, required before restore/app PVC flows |
 | `core-dependencies/snapshot-controller-app.yaml` | Application | 1 | VolumeSnapshot CRDs and controller | No, required by backup/restore flows |
-| `core-dependencies/volsync-app.yaml` | Application | 1 | Backup/restore engine | No, required before PVC Plumber and restore policies |
-| `core-dependencies/pvc-plumber-app.yaml` | Application | 2 | pvc-plumber v4.0.1 bootstrap-core: permissive RS/RD controller with no monitoring dependency | No, required before managed app PVCs |
-| `core-dependencies/volsync-backup-cluster-app.yaml` | Application | 2 | Shared Kopia credentials and VolSync backup-cluster wiring | No, required before managed app PVCs |
+| `core-dependencies/kopiur-operator-app.yaml` | Application | 2 | kopiur operator (Kopia-native backup): CRDs + controller + webhook + volume populator; no monitoring dependency | No, required before managed app PVCs |
+| `core-dependencies/kopiur-config-app.yaml` | Application | 3 | kopiur repo config: `ClusterRepository cluster-kopia` + credential fanout + `VolumeSnapshotClass longhorn-snapclass` | No, required before managed app PVCs |
 | `custom-entrypoints/cnpg-barman-plugin-app.yaml` | Application | 3 | CNPG clusters reference the plugin in wave 4 | Not now, dependency must precede database AppSet |
 | `custom-entrypoints/keda-app.yaml` | Application | 4 | Standalone after prior AppSet generator/render-cache loop | Maybe, after proving AppSet render stability |
+| `custom-entrypoints/vertical-pod-autoscaler-app.yaml` | Application | 4 | VPA controller (recommender/updater/admission) | Maybe, after proving AppSet render stability |
 | `custom-entrypoints/temporal-worker-controller-app.yaml` | Application | 4 | Same AppSet render-cache history as KEDA | Maybe, after proving AppSet render stability |
 | `custom-entrypoints/opentelemetry-operator-app.yaml` | Application | 5 | Core operator after cert-manager; ServiceMonitor removed from core | Maybe, if cert-manager dependency is otherwise enforced |
 | `custom-entrypoints/keda-observability-app.yaml` | Application | 6 | Optional KEDA ServiceMonitor resources after monitoring CRDs exist | No, keeps observability out of core |
+| `custom-entrypoints/vpa-recommendations-app.yaml` | Application | 6 | Optional VPA recommendation CRs after monitoring CRDs exist | No, keeps observability out of core |
 | `custom-entrypoints/opentelemetry-operator-observability-app.yaml` | Application | 6 | Optional OpenTelemetry ServiceMonitor after monitoring CRDs exist | No, keeps observability out of core |
 | `appsets/infrastructure-appset.yaml` | ApplicationSet | 4 | Explicit list of core infrastructure directories | N/A |
 | `appsets/database-appset.yaml` | ApplicationSet | 4 | Discovers `infrastructure/database/*/*`; uses `selfHeal: false` for DR | N/A |
@@ -45,8 +46,7 @@ This is the review map for everything directly rendered by the root Application
 PrometheusRule, Probe, AlertmanagerConfig) in its core kustomization.** Those CRDs don't exist until
 kube-prometheus-stack (Wave 5); an earlier-wave app shipping them fails dry-run and deadlocks the
 App-of-Apps wave gate (proven by the 2026-06-01 nuke drill). Put observability CRs in a **separate
-optional app that syncs after Wave 5** (e.g. `keda-observability` at Wave 6; pvc-plumber's were
-removed from its Wave-2 core). We deliberately do **not** install Prometheus Operator CRDs early —
+optional app that syncs after Wave 5** (e.g. `keda-observability` at Wave 6, split out of KEDA's Wave-4 core). We deliberately do **not** install Prometheus Operator CRDs early —
 `SkipDryRunOnMissingResource` is only an escape hatch / observability-app option, never a core fix.
 `cert-manager` is at **Wave 1** (not 4) so cert-dependent apps (cnpg-barman-plugin, Wave 3) can start.
 Full detail: [cluster DR nuke restore runbook](../../disaster-recovery.md).
 
@@ -22,11 +22,7 @@ CNPG databases live in two layers:
 | **Postgres data** | inside the CNPG `Cluster` CR | Barman Cloud → RustFS S3 | `spec.bootstrap.recovery` + `externalClusters` |
 | **App state** | outside (ExternalSecret, ScheduledBackup) | committed to Git as declarative state | ArgoCD sync |
 
-**Barman ≠ PVC backups.** The PVC/Kopia system (pvc-plumber v4 operator + VolSync,
-writing to RustFS S3) handles *file-level* PVC backups. (Kyverno was removed from
-this path in 2026-05 and is no longer involved — see `docs/storage-architecture.md`.)
-CNPG has its own SQL-aware backup path: Barman Cloud → RustFS S3. The two never
-touch each other. See
+**Barman ≠ PVC backups.** The PVC/Kopia system (the **kopiur** operator, writing to RustFS S3) handles *file-level* PVC backups. CNPG has its own SQL-aware backup path: Barman Cloud → RustFS S3. The two never touch each other. See
 [docs/disaster-recovery.md](../../disaster-recovery.md) for why both exist.
 
 ### How recovery works (the 30-second version)
 
@@ -19,7 +19,7 @@ Two layers of database state, two backup paths, **never confuse them**:
 | **Postgres data** (the actual database content — tables, rows, WAL) | Barman Cloud → S3 | Cluster CR + PVCs |
 | **App-side stuff** (ExternalSecret, ScheduledBackup, Cluster YAML) | Git | The repo |
 
-When you do a disaster recovery, you're using **Barman** to restore Postgres data into a fresh PVC. **pvc-plumber/VolSync has nothing to do with this.** The two backup systems run side-by-side and never touch each other. PVC-level kopia backups would corrupt a running Postgres mid-snapshot — that's why CNPG PVCs explicitly do NOT carry the `backup` label.
+When you do a disaster recovery, you're using **Barman** to restore Postgres data into a fresh PVC. **kopiur (the PVC-backup system) has nothing to do with this.** The two backup systems run side-by-side and never touch each other. PVC-level kopia backups would corrupt a running Postgres mid-snapshot — that's why CNPG PVCs explicitly do NOT carry the `backup` label.
 
 ---
 
@@ -255,19 +255,21 @@ operator orchestration was ~2 minutes.
 
 ---
 
-## Why this is separate from pvc-plumber
+## Why this is separate from the kopiur PVC backups
 
-A reasonable question: pvc-plumber backs up PVCs to kopia. CNPG database files live on PVCs. Why not just label the CNPG PVCs and let pvc-plumber back them up?
+A reasonable question: kopiur backs up PVCs to kopia. CNPG database files live on PVCs. Why not just back the CNPG PVCs up with kopiur too?
 
-Three reasons, in increasing severity:
+kopiur only backs up PVCs that carry an explicit per-PVC `SnapshotPolicy`/`Restore` stub (via the `kopiur-backup` Kustomize component) in a namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`. CNPG PVCs deliberately get none of that, so kopiur never touches them. There is no admission webhook injecting `dataSourceRef` anymore — coverage is opt-in by the per-PVC stub, not enforced at admission.
+
+Three reasons it stays that way, in increasing severity:
 
 1. **Snapshot consistency.** A kopia snapshot of a running Postgres data directory is *not* a consistent backup. The on-disk state at any moment includes half-written files, WAL not yet flushed, etc. Restoring from a CSI snapshot of running Postgres almost works, but recovery is unsafe and Postgres might not even start. Barman uses `pg_basebackup` which IS Postgres-aware and produces a consistent backup.
 
 2. **WAL.** Postgres recovery requires both a base backup AND the WAL records that follow it. Barman archives WAL continuously. PVC snapshots don't archive WAL — they just snapshot whatever WAL was on disk at snapshot time, which could be hours old.
 
 3. **PITR.** With Barman + WAL archiving, you can restore to any point in time within retention. With PVC snapshots, you can only restore to whenever the last snapshot was taken (default 1h or 1d).
 
-So: CNPG PVCs are explicitly **NOT** labeled `backup: hourly|daily`. pvc-plumber's mutating webhook would refuse to inject `dataSourceRef` on those anyway (operator's `SYSTEM_NAMESPACES` excludes `cloudnative-pg`), but as defense-in-depth, the manifest convention is to omit the label.
+So: CNPG PVCs explicitly carry **no** kopiur backup stub and the `cloudnative-pg` namespace is **not** labeled `kopiur.home-operations.com/repo: cluster-kopia`, so kopiur never enrolls them.
 
 ---
 
@@ -302,6 +304,6 @@ Yes — CNPG supports `bootstrap.recovery` with a different `metadata.name`. The
 ## Where to go deeper
 
 - [docs/domains/cnpg/disaster-recovery.md](disaster-recovery.md) — the technical runbook (this doc's reference)
-- [docs/plans/cnpg-plugin-migration.md](../../disaster-recovery.md) — why this cluster uses the Barman Cloud Plugin instead of the deprecated `spec.backup.barmanObjectStore`
+- [disaster-recovery.md](disaster-recovery.md) — why this cluster uses the Barman Cloud Plugin instead of the deprecated `spec.backup.barmanObjectStore`
 - [docs/disaster-recovery.md](../../disaster-recovery.md) — the OTHER backup system (PVC-level, kopia, NEVER use on CNPG PVCs)
 - [docs/pvc-plumber-explained.md](https://github.com/mitchross/pvc-plumber#readme) — pvc-plumber walkthrough for comparison
@@ -110,12 +110,9 @@ These GitOps-managed ExternalSecrets read `rustfs-workload-access-key` and `rust
 | `monitoring/tempo-s3-credentials` | `tempo-s3-credentials` |
 | `posthog/posthog-secrets` | `posthog-secrets` |
 | `rustfs-lifecycle/rustfs-admin-credentials` | `rustfs-admin-credentials` |
-| Each chart-rendered `<ns>/volsync-<pvc>` per backed-up PVC | `volsync-<pvc>` |
+| `kopiur/kopiur-rustfs` (ClusterExternalSecret → every namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`) | `kopiur-rustfs` |
 
-The `volsync-system/pvc-plumber-kopia` ExternalSecret was removed
-2026-05-21 along with the pvc-plumber operator decommission. Per-PVC
-ExternalSecrets are now rendered by the `volsync-backup` Helm chart
-at `infrastructure/storage/volsync-backup/` rather than the operator.
+Per-PVC backup credentials are now delivered by the single `kopiur-rustfs` ClusterExternalSecret (`infrastructure/controllers/kopiur/externalsecret.yaml`), which fans the repo credentials into every namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`. The retired `volsync-backup` per-PVC ExternalSecrets are gone.
 
 Force ESO refresh after changing 1Password:
 
@@ -128,13 +125,9 @@ kubectl annotate externalsecret -n monitoring tempo-s3-credentials force-sync="$
 kubectl annotate externalsecret -n posthog posthog-secrets force-sync="$TS" --overwrite
 kubectl annotate externalsecret -n rustfs-lifecycle rustfs-admin-credentials force-sync="$TS" --overwrite
 
-# Also force every chart-rendered per-PVC ES:
-kubectl get externalsecret -A -l app.kubernetes.io/managed-by=volsync-backup-chart \
-  -o jsonpath='{range .items[*]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' | \
-  while read ns name; do
-    [ -z "$ns" ] && continue
-    kubectl annotate externalsecret -n "$ns" "$name" force-sync="$TS" --overwrite
-  done
+# Also refresh the kopiur repo-credential fanout (one ClusterExternalSecret
+# feeds the per-namespace kopiur-rustfs Secret into every backed-up namespace):
+kubectl annotate clusterexternalsecret kopiur-rustfs force-sync="$TS" --overwrite
 ```
 
 Restart consumers that load S3 credentials from environment variables:
@@ -153,8 +146,6 @@ kubectl rollout restart deploy/db deploy/feature-flags deploy/plugins deploy/web
                        -n posthog
 ```
 
-VolSync mover Jobs read the per-PVC Secret at Job creation time, so the
-NEXT scheduled (or manually triggered) backup run picks up the new
-credentials automatically — no restart of VolSync itself needed.
+kopiur mover Jobs read the namespace `kopiur-rustfs` Secret at Job creation time, so the next scheduled (or manually triggered) Snapshot picks up rotated credentials automatically — no operator restart needed.
 RustFS lifecycle Job is spawned by its CronJob — next scheduled run
 uses the refreshed Secret.
@@ -20,12 +20,12 @@ cluster can be destroyed and rebuilt **unattended** — restores included.
 - **OS**: Talos Linux on Proxmox VMs, provisioned via Omni / Sidero
 - **CNI**: Cilium with Gateway API + LoadBalancer
 - **GitOps**: ArgoCD (self-managing) + ApplicationSets for auto-discovery
-- **Storage**: Longhorn (V1 engine, 2× replicas)
+- **Storage**: Longhorn (V1 engine, 1 replica — single-node)
 - **Backup**: [kopiur](https://github.com/home-operations/kopiur) (Kopia-native) → RustFS S3, per-PVC `SnapshotPolicy`/`Restore` with restore-before-bind
 - **Database**: CloudNativePG (Postgres) with Barman backups to S3
 - **Secrets**: 1Password Connect + External Secrets Operator
 - **Observability**: kube-prometheus-stack, Loki, Tempo, OpenTelemetry
-- **AI**: llama-cpp (Qwen3.6-35B multimodal) + ComfyUI on dedicated GPUs
+- **AI**: vLLM (Qwen3.6-27B, default app inference) + llama-cpp (Qwen3.6-35B multimodal, for ComfyUI) on mutually-exclusive whole-card GPUs
 
 ## Documentation