You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* docs: overhaul root README around the real bootstrap sequence
Rewrite the root README to lead with the clean, linear provision →
bootstrap flow that's actually run end-to-end (machine classes →
template sync → Omni access → Gateway CRDs → Cilium → secrets →
bootstrap-argocd.sh), with a copy-paste "whole sequence" quick
reference plus annotated per-step gotchas.
Correct the backup system everywhere: the retired pvc-plumber + VolSync
stack is replaced by kopiur (sync-wave table, dedicated Backup System
section, troubleshooting row, and the bootstrap script's wave echo all
updated to match the live waves). Refresh version pins (Omni v1.9.0,
ArgoCD chart 9.7.0) and drop the duplicated cluster-access steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P936TdJY9XieDQYKRfq9g8
* docs: refresh stale references across docs/ and omni/ trees
Audit-and-fix pass aligning the documentation with the current cluster
state (same staleness class as the root README overhaul):
Backup system — replace the retired pvc-plumber + VolSync + Kyverno
references with kopiur where they were presented as the live system:
- argocd.md / entrypoints.md: wave table + entrypoint map now show the
kopiur operator (Wave 2) and kopiur config (Wave 3); dropped the
removed volsync/pvc-plumber entrypoints; added the two VPA entrypoints;
fixed the manual root seed path to root.yaml.
- cnpg/disaster-recovery.md + cnpg/explained.md: kopiur wording, removed
the obsolete mutating-webhook/SYSTEM_NAMESPACES mechanics, fixed a dead
docs/plans/ link.
- rustfs/credential-runbook.md: VolSync per-PVC ExternalSecrets ->
the kopiur-rustfs ClusterExternalSecret fanout.
- index.md: Longhorn replica count (single-node) + AI line now leads with
the vLLM default.
Hardware — correct the GPU host from a misattributed Xeon DL360 to the
actual AMD Threadripper 2950X / X399 in the ai-gpu docs.
Omni docs — version pins (Omni 1.8 -> 1.9, Cilium 1.19.4 -> 1.19.5),
kubeProxyReplacement=strict -> true, authoritative port numbers
(8090/8100/8091/50180-udp), multi-disk support note, and repointed
broken links (talos-configs/, examples/, proxmox-provider/README.md,
test-multi-disk.yaml) to real paths.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P936TdJY9XieDQYKRfq9g8
* docs: add root llms.txt with bootstrap/DR operational truths
Capture the non-inferable operational facts surfaced by the first
kopiur-only full nuke (2026-06-28): the two independent restore systems
(kopiur PVC restore-before-bind vs CNPG/Barman), self-healing vs.
real-intervention pod states, the now-automated gitea-actions runner
token, the cosmetic kopiur SnapshotPolicy OutOfSync, and the Omni/Proxmox
infra-provider gotchas (stuck finalizers, API-token format, USB-dongle
passthrough). Curated LLM-facing context, separate from the docs/ tree.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P936TdJY9XieDQYKRfq9g8
* docs: correct llms.txt against committed repo state
Two claims in the restore-context block didn't match the repo:
- gitea-actions runner token is still a MANUAL rebuild step, not
automated — externalsecret.yaml is committed but commented out of
gitea-actions/kustomization.yaml until the 1Password field exists.
Rewrote the section + headline DR claim to reflect the staged-but-
disabled automation and the real Secret name + token-gen command.
- No AppSet ignoreDifferences masks kopiur SnapshotPolicy defaults;
softened the cosmetic-OutOfSync note to drop the false mechanism.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P936TdJY9XieDQYKRfq9g8
---------
Co-authored-by: Claude <noreply@anthropic.com>
|`4`| KEDA core, Temporal worker, infrastructure and database AppSets |
42
42
|`5`| OpenTelemetry operator core, monitoring AppSet including `kube-prometheus-stack`|
43
43
|`6`| KEDA observability, OpenTelemetry operator observability, workload AppSet |
44
44
45
-
cert-manager is intentionally Wave `1`: the CNPG Barman plugin depends on it. pvc-plumber Wave `2`is core-only. KEDA and OpenTelemetry ServiceMonitor resources render from Wave `6` observability overlays.
45
+
cert-manager is intentionally Wave `1`: the CNPG Barman plugin depends on it. The kopiur operator is Wave `2`(CRDs + controller + webhook), with its repo/credential config at Wave `3`. KEDA and OpenTelemetry ServiceMonitor resources render from Wave `6` observability overlays.
46
46
47
47
CNPG `enablePodMonitor: true` remains an accepted runtime soft-coupling. It can log transient errors before monitoring exists, but it is not an ArgoCD dry-run blocker.
|`custom-entrypoints/temporal-worker-controller-app.yaml`| Application | 4 | Same AppSet render-cache history as KEDA | Maybe, after proving AppSet render stability |
34
34
|`custom-entrypoints/opentelemetry-operator-app.yaml`| Application | 5 | Core operator after cert-manager; ServiceMonitor removed from core | Maybe, if cert-manager dependency is otherwise enforced |
35
35
|`custom-entrypoints/keda-observability-app.yaml`| Application | 6 | Optional KEDA ServiceMonitor resources after monitoring CRDs exist | No, keeps observability out of core |
36
+
|`custom-entrypoints/vpa-recommendations-app.yaml`| Application | 6 | Optional VPA recommendation CRs after monitoring CRDs exist | No, keeps observability out of core |
36
37
|`custom-entrypoints/opentelemetry-operator-observability-app.yaml`| Application | 6 | Optional OpenTelemetry ServiceMonitor after monitoring CRDs exist | No, keeps observability out of core |
37
38
|`appsets/infrastructure-appset.yaml`| ApplicationSet | 4 | Explicit list of core infrastructure directories | N/A |
38
39
|`appsets/database-appset.yaml`| ApplicationSet | 4 | Discovers `infrastructure/database/*/*`; uses `selfHeal: false` for DR | N/A |
@@ -45,8 +46,7 @@ This is the review map for everything directly rendered by the root Application
45
46
PrometheusRule, Probe, AlertmanagerConfig) in its core kustomization.** Those CRDs don't exist until
46
47
kube-prometheus-stack (Wave 5); an earlier-wave app shipping them fails dry-run and deadlocks the
47
48
App-of-Apps wave gate (proven by the 2026-06-01 nuke drill). Put observability CRs in a **separate
48
-
optional app that syncs after Wave 5** (e.g. `keda-observability` at Wave 6; pvc-plumber's were
49
-
removed from its Wave-2 core). We deliberately do **not** install Prometheus Operator CRDs early —
49
+
optional app that syncs after Wave 5** (e.g. `keda-observability` at Wave 6, split out of KEDA's Wave-4 core). We deliberately do **not** install Prometheus Operator CRDs early —
50
50
`SkipDryRunOnMissingResource` is only an escape hatch / observability-app option, never a core fix.
51
51
`cert-manager` is at **Wave 1** (not 4) so cert-dependent apps (cnpg-barman-plugin, Wave 3) can start.
52
52
Full detail: [cluster DR nuke restore runbook](../../disaster-recovery.md).
|**App state**| outside (ExternalSecret, ScheduledBackup) | committed to Git as declarative state | ArgoCD sync |
24
24
25
-
**Barman ≠ PVC backups.** The PVC/Kopia system (pvc-plumber v4 operator + VolSync,
26
-
writing to RustFS S3) handles *file-level* PVC backups. (Kyverno was removed from
27
-
this path in 2026-05 and is no longer involved — see `docs/storage-architecture.md`.)
28
-
CNPG has its own SQL-aware backup path: Barman Cloud → RustFS S3. The two never
29
-
touch each other. See
25
+
**Barman ≠ PVC backups.** The PVC/Kopia system (the **kopiur** operator, writing to RustFS S3) handles *file-level* PVC backups. CNPG has its own SQL-aware backup path: Barman Cloud → RustFS S3. The two never touch each other. See
30
26
[docs/disaster-recovery.md](../../disaster-recovery.md) for why both exist.
When you do a disaster recovery, you're using **Barman** to restore Postgres data into a fresh PVC. **pvc-plumber/VolSync has nothing to do with this.** The two backup systems run side-by-side and never touch each other. PVC-level kopia backups would corrupt a running Postgres mid-snapshot — that's why CNPG PVCs explicitly do NOT carry the `backup` label.
22
+
When you do a disaster recovery, you're using **Barman** to restore Postgres data into a fresh PVC. **kopiur (the PVC-backup system) has nothing to do with this.** The two backup systems run side-by-side and never touch each other. PVC-level kopia backups would corrupt a running Postgres mid-snapshot — that's why CNPG PVCs explicitly do NOT carry the `backup` label.
23
23
24
24
---
25
25
@@ -255,19 +255,21 @@ operator orchestration was ~2 minutes.
255
255
256
256
---
257
257
258
-
## Why this is separate from pvc-plumber
258
+
## Why this is separate from the kopiur PVC backups
259
259
260
-
A reasonable question: pvc-plumber backs up PVCs to kopia. CNPG database files live on PVCs. Why not just label the CNPG PVCs and let pvc-plumber back them up?
260
+
A reasonable question: kopiur backs up PVCs to kopia. CNPG database files live on PVCs. Why not just back the CNPG PVCs up with kopiur too?
261
261
262
-
Three reasons, in increasing severity:
262
+
kopiur only backs up PVCs that carry an explicit per-PVC `SnapshotPolicy`/`Restore` stub (via the `kopiur-backup` Kustomize component) in a namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`. CNPG PVCs deliberately get none of that, so kopiur never touches them. There is no admission webhook injecting `dataSourceRef` anymore — coverage is opt-in by the per-PVC stub, not enforced at admission.
263
+
264
+
Three reasons it stays that way, in increasing severity:
263
265
264
266
1. **Snapshot consistency.** A kopia snapshot of a running Postgres data directory is *not* a consistent backup. The on-disk state at any moment includes half-written files, WAL not yet flushed, etc. Restoring from a CSI snapshot of running Postgres almost works, but recovery is unsafe and Postgres might not even start. Barman uses `pg_basebackup` which IS Postgres-aware and produces a consistent backup.
265
267
266
268
2. **WAL.** Postgres recovery requires both a base backup AND the WAL records that follow it. Barman archives WAL continuously. PVC snapshots don't archive WAL — they just snapshot whatever WAL was on disk at snapshot time, which could be hours old.
267
269
268
270
3. **PITR.** With Barman + WAL archiving, you can restore to any point in time within retention. With PVC snapshots, you can only restore to whenever the last snapshot was taken (default 1h or 1d).
269
271
270
-
So: CNPG PVCs are explicitly **NOT** labeled `backup: hourly|daily`. pvc-plumber's mutating webhook would refuse to inject `dataSourceRef` on those anyway (operator's `SYSTEM_NAMESPACES` excludes `cloudnative-pg`), but as defense-in-depth, the manifest convention is to omit the label.
272
+
So: CNPG PVCs explicitly carry **no** kopiur backup stub and the `cloudnative-pg` namespace is **not** labeled `kopiur.home-operations.com/repo: cluster-kopia`, so kopiur never enrolls them.
271
273
272
274
---
273
275
@@ -302,6 +304,6 @@ Yes — CNPG supports `bootstrap.recovery` with a different `metadata.name`. The
302
304
## Where to go deeper
303
305
304
306
- [docs/domains/cnpg/disaster-recovery.md](disaster-recovery.md) — the technical runbook (this doc's reference)
305
-
- [docs/plans/cnpg-plugin-migration.md](../../disaster-recovery.md) — why this cluster uses the Barman Cloud Plugin instead of the deprecated `spec.backup.barmanObjectStore`
307
+
- [disaster-recovery.md](disaster-recovery.md) — why this cluster uses the Barman Cloud Plugin instead of the deprecated `spec.backup.barmanObjectStore`
306
308
- [docs/disaster-recovery.md](../../disaster-recovery.md) — the OTHER backup system (PVC-level, kopia, NEVER use on CNPG PVCs)
307
309
- [docs/pvc-plumber-explained.md](https://github.com/mitchross/pvc-plumber#readme) — pvc-plumber walkthrough for comparison
|Each chart-rendered `<ns>/volsync-<pvc>` per backed-up PVC|`volsync-<pvc>`|
113
+
|`kopiur/kopiur-rustfs` (ClusterExternalSecret → every namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`)|`kopiur-rustfs`|
114
114
115
-
The `volsync-system/pvc-plumber-kopia` ExternalSecret was removed
116
-
2026-05-21 along with the pvc-plumber operator decommission. Per-PVC
117
-
ExternalSecrets are now rendered by the `volsync-backup` Helm chart
118
-
at `infrastructure/storage/volsync-backup/` rather than the operator.
115
+
Per-PVC backup credentials are now delivered by the single `kopiur-rustfs` ClusterExternalSecret (`infrastructure/controllers/kopiur/externalsecret.yaml`), which fans the repo credentials into every namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`. The retired `volsync-backup` per-PVC ExternalSecrets are gone.
VolSync mover Jobs read the per-PVC Secret at Job creation time, so the
157
-
NEXT scheduled (or manually triggered) backup run picks up the new
158
-
credentials automatically — no restart of VolSync itself needed.
149
+
kopiur mover Jobs read the namespace `kopiur-rustfs` Secret at Job creation time, so the next scheduled (or manually triggered) Snapshot picks up rotated credentials automatically — no operator restart needed.
159
150
RustFS lifecycle Job is spawned by its CronJob — next scheduled run
0 commit comments