Skip to content

Commit 53e7b23

Browse files
committed
up
1 parent d274f8b commit 53e7b23

3 files changed

Lines changed: 54 additions & 10 deletions

File tree

CLAUDE.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -116,14 +116,15 @@ Applications deploy in strict order to prevent race conditions:
116116
|------|-----------|---------|
117117
| **0** | Foundation | Cilium (CNI), ArgoCD, 1Password Connect, External Secrets, AppProjects |
118118
| **1** | Storage | Longhorn, VolumeSnapshot Controller, VolSync |
119-
| **2** | PVC Plumber | Backup existence checker (must run before Kyverno policies in Wave 4) |
119+
| **2** | PVC Plumber | Backup existence checker (FAIL-CLOSED gate: PVC creation denied if Plumber is down) |
120120
| **4** | Infrastructure AppSet | Deploys from explicit path list: cert-manager, external-dns, GPU operators, Kyverno, gateway, databases, etc. |
121121
| **5** | Monitoring AppSet | Discovers `monitoring/*` applications |
122122
| **6** | My-Apps AppSet | Discovers `my-apps/*/*` applications |
123123

124124
**Why this matters**:
125125
- Longhorn won't deploy until Cilium + External Secrets are healthy
126126
- PVC Plumber (Wave 2) must run before Infrastructure AppSet (Wave 4) because Kyverno policies call PVC Plumber API
127+
- **FAIL-CLOSED**: If PVC Plumber is down, Kyverno denies creation of backup-labeled PVCs. Apps retry via ArgoCD backoff until Plumber is healthy. This prevents data loss during disaster recovery.
127128
- Kyverno, cert-manager, GPU operators etc. deploy via Infrastructure AppSet (Wave 4) before user apps (Wave 6)
128129
- This prevents "chicken-and-egg" dependency issues and SSD thrashing
129130

@@ -511,8 +512,9 @@ PVC populated from last backup
511512
**Location**: `infrastructure/controllers/kyverno/policies/`
512513

513514
1. **volsync-pvc-backup-restore.yaml** - Main backup/restore automation
514-
- Generates ExternalSecret, ReplicationSource, ReplicationDestination
515+
- **FAIL-CLOSED**: Validate rule denies PVC creation if PVC Plumber is unreachable
515516
- Adds `dataSourceRef` if backup exists (via PVC Plumber)
517+
- Generates ExternalSecret, ReplicationSource, ReplicationDestination
516518
- Excludes system namespaces (kube-system, volsync-system, kyverno)
517519

518520
2. **volsync-nfs-inject.yaml** - NFS mount injection
@@ -536,9 +538,10 @@ PVC populated from last backup
536538
```
537539

538540
**Kyverno uses this to**:
539-
- Call PVC Plumber API during PVC CREATE operation
541+
- First validate PVC Plumber is healthy (`/readyz`) — if not, PVC creation is **denied** (fail-closed)
542+
- Then call PVC Plumber API (`/exists`) during PVC CREATE operation
540543
- If backup exists, add `dataSourceRef` to auto-restore
541-
- Prevents data loss when recreating PVCs
544+
- Prevents data loss when recreating PVCs or during disaster recovery
542545

543546
### Manual Backup Operations
544547

docs/backup-restore.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ PVC Created ──▶ Kyverno Rule 0 ──▶ Calls pvc-plumber ──▶ Backu
6565
```
6666

6767
**Key protections:**
68+
- **Fail-closed gate:** PVC creation denied if PVC Plumber is unreachable (prevents empty PVCs during disaster recovery)
6869
- Backup ReplicationSource only created AFTER PVC is Bound (prevents backup/restore conflicts)
6970
- Restore uses VolumePopulator pattern (dataSourceRef) for atomic restore
7071

@@ -88,10 +89,11 @@ PVC Created ──▶ Kyverno Rule 0 ──▶ Calls pvc-plumber ──▶ Backu
8889

8990
### 4. Kyverno ClusterPolicy
9091
- Triggers on PVCs with label `backup: hourly` or `backup: daily`
91-
- **Rule 0 (mutate):** Calls pvc-plumber; if backup exists, adds `dataSourceRef` to trigger restore
92-
- **Rule 1 (generate):** Creates ExternalSecret (fetches KOPIA_PASSWORD from 1Password)
93-
- **Rule 2 (generate):** Creates ReplicationSource (backup schedule) - only after PVC is Bound
94-
- **Rule 3 (generate):** Creates ReplicationDestination (restore capability)
92+
- **Rule 0 (validate, FAIL-CLOSED):** Calls pvc-plumber `/readyz`; if unreachable, **denies PVC creation** to prevent data loss during disaster recovery
93+
- **Rule 1 (mutate):** Calls pvc-plumber `/exists`; if backup exists, adds `dataSourceRef` to trigger restore
94+
- **Rule 2 (generate):** Creates ExternalSecret (fetches KOPIA_PASSWORD from 1Password)
95+
- **Rule 3 (generate):** Creates ReplicationSource (backup schedule) - only after PVC is Bound
96+
- **Rule 4 (generate):** Creates ReplicationDestination (restore capability)
9597

9698
### 5. VolSync
9799
- Performs actual backup/restore operations using **Kopia**

docs/pvc-plumber-full-flow.md

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -180,8 +180,15 @@
180180
│ KYVERNO ADMISSION WEBHOOK INTERCEPTS │
181181
│ │
182182
│ "I see a PVC with backup: hourly (or daily)" │
183-
│ "Let me check if a backup exists..." │
184183
│ │
184+
│ Step 1: Validate rule checks PVC Plumber health (FAIL-CLOSED) │
185+
│ ┌────────────────────────────────────────────────────────────────────────────┐ │
186+
│ │ HTTP GET http://pvc-plumber.volsync-system/readyz │ │
187+
│ │ If unreachable -> DENY PVC creation (apps retry via ArgoCD backoff) │ │
188+
│ │ If healthy -> proceed to step 2 │ │
189+
│ └────────────────────────────────────────────────────────────────────────────┘ │
190+
│ │
191+
│ Step 2: Mutate rule checks if backup exists │
185192
│ ┌────────────────────────────────────────────────────────────────────────────┐ │
186193
│ │ HTTP GET http://pvc-plumber.volsync-system/exists/karakeep/data-pvc │ │
187194
│ └────────────────────────────────────────────────────────────────────────────┘ │
@@ -204,7 +211,9 @@
204211
│ Returns JSON to Kyverno: │
205212
│ {"exists": true} OR {"exists": false} │
206213
│ │
207-
│ On ANY error (timeout, network, parse) -> {"exists": false} (fail-open) │
214+
│ On ANY error (timeout, network, parse) -> {"exists": false} │
215+
│ NOTE: Kyverno validate rule DENIES PVC creation if PVC Plumber is unreachable │
216+
│ (fail-closed). See Scenario 5 below. │
208217
│ │
209218
└─────────────────────────────────────────────────────────────────────────────────────┘
210219
@@ -407,6 +416,36 @@
407416
│ │
408417
└─────────────────────────────────────────────────────────────────────────────────────┘
409418
419+
┌─────────────────────────────────────────────────────────────────────────────────────┐
420+
│ SCENARIO 5: PVC PLUMBER DOWN DURING DISASTER RECOVERY (FAIL-CLOSED) │
421+
├─────────────────────────────────────────────────────────────────────────────────────┤
422+
│ │
423+
│ Your cluster died. You rebuild from scratch. NFS has all your Kopia backups. │
424+
│ But PVC Plumber fails to start (bad config, NFS unreachable, etc.) │
425+
│ │
426+
│ 1. New cluster bootstrapped │
427+
│ 2. ArgoCD syncs apps │
428+
│ 3. PVC Plumber (Wave 2) is unhealthy │
429+
│ 4. Kyverno (Wave 4) deploys with validate rule │
430+
│ 5. Apps (Wave 6) attempt to create PVCs with backup labels │
431+
│ 6. Kyverno validate rule calls PVC Plumber /readyz -> UNREACHABLE │
432+
│ 7. PVC creation DENIED │
433+
│ 8. ArgoCD retries with exponential backoff (5s -> 10s -> 20s -> 40s -> 3m) │
434+
│ 9. Operator fixes PVC Plumber │
435+
│ 10. PVC Plumber starts, /readyz returns 200 │
436+
│ 11. ArgoCD retries -> PVC creates -> pvc-plumber finds backup -> data restored │
437+
│ │
438+
│ Result: Apps wait for PVC Plumber. Data safety over availability. │
439+
│ Human intervention required to fix PVC Plumber. │
440+
│ │
441+
│ Trade-off: Apps with backup labels CANNOT deploy until PVC Plumber is healthy. │
442+
│ Apps WITHOUT backup labels deploy normally and are unaffected. │
443+
│ │
444+
│ Why this matters: Without this, apps deploy with empty PVCs and the restore │
445+
│ window is permanently missed (Kyverno only checks on PVC CREATE). │
446+
│ │
447+
└─────────────────────────────────────────────────────────────────────────────────────┘
448+
410449
411450
═══════════════════════════════════════════════════════════════════════════════════════
412451
COMPONENT SUMMARY

0 commit comments

Comments
 (0)