11# Storage Architecture & Disaster Recovery
22
3- This document outlines the storage architecture for the cluster, focusing on data persistence, backup strategies, and disaster recovery workflows .
3+ This document outlines the storage architecture for the cluster, focusing on the ** Zero-Touch Backup & Restore ** system powered by VolSync and Kyverno .
44
55## Overview
66
7- The cluster uses a layered storage approach with ** zero-touch backup and restore** :
8- - ** Longhorn** : Distributed block storage for runtime replication (2 replicas per volume)
9- - ** Snapshot Controller** : Manages VolumeSnapshot lifecycles and CRDs
10- - ** VolSync + Kyverno** : Fully automated backup/restore - just label your PVC!
11- - ** Database-native backups** : CloudNativePG and Crunchy Postgres backup directly to S3
7+ The cluster uses a "Smart Restore" strategy. When a PersistentVolumeClaim (PVC) is created, the system checks if a backup * already exists* in the offsite storage (TrueNAS S3) before deciding what to do.
128
13- ## Zero-Touch Architecture (Smart Restore)
9+ - ** Backup Found?** -> Automatically trigger a Restore.
10+ - ** Backup Missing?** -> Automatically configure a fresh Backup schedule.
11+
12+ ## Zero-Touch Architecture (Direct IP Strategy)
13+
14+ To ensure maximum stability and avoid GitOps sync issues, the system connects directly to the offsite storage IP.
1415
1516** User only needs to:**
16- 1 . Add ` backup: "hourly" ` or ` backup: "daily" ` label to PVC
17- 2 . Ensure namespace has ` volsync.backube/privileged-movers: "true" ` ** label** (for base credentials)
18- 3 . Ensure namespace has ` volsync.backube/privileged-movers: "true" ` ** annotation** (for VolSync movers)
17+ 1 . Add ` backup: "hourly" ` (or ` daily ` ) label to any PVC.
18+ 2 . That's it. (Credentials are auto-injected).
1919
2020** System automatically provides:**
21- - ** Kyverno** makes a smart decision:
22- - If backup exists? -> Creates Restore Job.
23- - If new app? -> Creates Backup Job.
21+ 1 . ** Smart Mutation:** Kyverno checks ` http://192.168.10.133:9000/.../config ` .
22+ 2 . ** Conditional Logic:**
23+ * ** 200 OK:** Adds ` dataSourceRef ` to the PVC (forcing it to wait for restore).
24+ * ** 404 Not Found:** Lets PVC start empty.
25+ 3 . ** Resource Generation:**
26+ * ** Backup Job:** Always created (` ReplicationSource ` ).
27+ * ** Restore Job:** Created ONLY if data exists (` ReplicationDestination ` ).
2428
2529```
2630┌─────────────────────────────────────────────────────────────────────────────────┐
27- │ ZERO-TOUCH VOLSYNC ARCHITECTURE │
31+ │ SMART RESTORE ARCHITECTURE │
2832├─────────────────────────────────────────────────────────────────────────────────┤
2933│ │
3034│ USER PROVIDES: SYSTEM AUTO-GENERATES: │
3135│ ┌─────────────────────┐ ┌─────────────────────────────────────┐ │
32- │ │ PVC │ │ volsync-rustfs-base (per namespace) │ │
33- │ │ labels: │ │ (ClusterExternalSecret → Secret) │ │
36+ │ │ PVC │ │ volsync-smart-protection │ │
37+ │ │ labels: │ │ (ClusterPolicy) │ │
3438│ │ backup: hourly │ │ │ │
35- │ └─────────────────────┘ │ {pvc}-volsync-secret (per PVC) │ │
36- │ │ (Kyverno apiCall + base64 encode) │ │
37- │ │ │ │
38- │ │ ReplicationSource │ │
39- │ │ (hourly/daily backups to S3) │ │
40- │ │ │ │
41- │ │ ReplicationDestination │ │
42- │ │ (Creates ONLY if backup exists) │ │
39+ │ └─────────────────────┘ │ 1. Checks 192.168.10.133 (GET) │ │
40+ │ │ 2. Mutates PVC (if found) │ │
41+ │ │ 3. Creates Backup Job (always) │ │
42+ │ │ 4. Creates Restore Job (if found) │ │
4343│ └─────────────────────────────────────┘ │
4444│ │
4545└─────────────────────────────────────────────────────────────────────────────────┘
@@ -54,121 +54,89 @@ graph TD
5454 end
5555
5656 subgraph "Kubernetes Cluster"
57- subgraph "Volsync System"
58- B[("Service: rustfs")] --> S3
59- V[("VolSync Operator")]
60- end
61-
6257 subgraph "Policy Engine"
63- K[("Kyverno")]
64- P_Smart[("Policy: Smart Restore")]
58+ K[("Kyverno Policy<br/>(Smart Restore)")]
6559 end
6660
6761 subgraph "Application Namespace"
6862 PVC[("PVC: data-claim")]
69- RD[("ReplicationDestination <br/>(Restore Job )")]
70- RS[("ReplicationSource <br/>(Backup Job )")]
63+ RD[("Restore Job <br/>(ReplicationDestination )")]
64+ RS[("Backup Job <br/>(ReplicationSource )")]
7165 end
7266 end
7367
7468 %% Flows
75- K -- "1. apiCall (HEAD)" --> B
76- B -.-> S3
69+ K -- "1. apiCall (GET)" --> S3
7770
7871 %% Decision Logic
7972 K -- "2a. If 200 OK (Found)" --> RD
8073 RD -- "3. Pull Data" --> S3
81- RD -- "4. Populate " --> PVC
74+ RD -- "4. Link to PVC " --> PVC
8275
8376 K -- "2b. If 404 (Missing)" --> PVC
8477 PVC -- "5. Start Fresh" --> RS
8578 RS -- "6. Push Backups" --> S3
8679```
8780
88- ## Kyverno Policies
89-
90- ### Smart Restore Policy (` volsync-smart-restore ` )
91- ** The "Look Before You Leap" logic.**
92- - Trigger: PVC creation.
93- - Check: ` apiCall ` (HTTP HEAD) to S3 bucket.
94- - Action:
95- - If HTTP 200: Generate ` ReplicationDestination ` .
96- - If HTTP 404: Do nothing (Allow fresh install).
97-
98- ### Generate Policy (` generate-volsync-backup ` )
99- ** The "Backup Insurance".**
100- - Trigger: PVC creation (always).
101- - Action: Generate ` ReplicationSource ` .
102- - Result: Ensures all fresh apps eventually get backed up.
103-
104- ### Mutate Policy (` volsync-auto-restore ` )
105- ** The "Connection".**
106- - Trigger: PVC creation.
107- - Check: Does ` ReplicationDestination ` exist?
108- - Action: If yes, add ` dataSourceRef ` to PVC.
109- - Result: PVC waits for VolSync to restore data before binding.
81+ ## Kyverno Policy Details
82+
83+ All logic is consolidated into a single policy: ** ` infrastructure/controllers/kyverno/volsync-smart-restore.yaml ` ** .
84+
85+ ### Rules
86+ 1 . ** ` link-restore-if-exists ` (Mutate)**
87+ * ** Check:** ` GET http://192.168.10.133:9000/.../config `
88+ * ** Action:** If 200, adds ` dataSourceRef: ReplicationDestination ` to the PVC.
89+ * ** Why:** This pauses the PVC binding until VolSync populates the volume.
90+
91+ 2 . ** ` generate-restore-job ` (Generate)**
92+ * ** Check:** Same as above.
93+ * ** Action:** Creates the ` ReplicationDestination ` CRD that actually performs the download.
94+
95+ 3 . ** ` generate-backup-job ` (Generate)**
96+ * ** Check:** None (Always runs).
97+ * ** Action:** Creates ` ReplicationSource ` to ensure the new data is backed up going forward.
11098
11199## 4. Disaster Recovery Scenarios
112100
113101### Scenario 1: New App (First Deployment)
114102```
115- 1. User creates PVC with backup: hourly label.
116- 2. Kyverno checks S3: 404 Not Found.
117- 3. Kyverno allows PVC creation WITHOUT dataSourceRef.
118- 4. App starts with empty volume.
119- 5. Kyverno generates ReplicationSource.
120- 6. First backup runs.
103+ 1. User creates PVC.
104+ 2. Kyverno checks S3 -> 404.
105+ 3. PVC created empty (ready immediately).
106+ 4. Kyverno creates Backup Job.
107+ 5. First backup runs in 1 hour.
121108```
122109
123- ### Scenario 2: Cluster Rebuild (Disaster Recovery )
110+ ### Scenario 2: Cluster Rebuild (Total DR )
124111```
125- 1. Bootstrap new Cluster .
126- 2. ArgoCD syncs app .
127- 3. Kyverno checks S3: 200 OK (Found backup!) .
128- 4. Kyverno generates ReplicationDestination .
129- 5. Kyverno mutates PVC to add dataSourceRef .
130- 6. PVC waits (Pending) while VolSync pulls data.
131- 7. Volume restores -> Pod starts.
112+ 1. Cluster is rebuilt. ArgoCD installs app .
113+ 2. Kyverno checks S3 -> 200 .
114+ 3. Kyverno creates Restore Job .
115+ 4. Kyverno Links PVC to Restore Job .
116+ 5. Pod waits in Pending.. .
117+ 6. VolSync downloads data.. .
118+ 7. PVC binds. Pod starts.
132119```
133120
134- ### Scenario 3: Manual Restore (Specific Point in Time)
135- ``` bash
136- # 1. Scale down the application
137- kubectl scale deployment < app> -n < namespace> --replicas=0
138-
139- # 2. Delete the PVC
140- kubectl delete pvc < pvc-name> -n < namespace>
141-
142- # 3. (Optional) Update ReplicationDestination to point to older snapshot if needed
143- # ArgoCD will typically re-sync the latest.
144-
145- # 4. ArgoCD recreates PVC
146- # Kyverno sees backup exists -> generates RD -> restores.
147- ```
148-
149- ## 5. Defense Layers Summary
150-
151- | Layer | Protects Against | Recovery Time | Manual Intervention |
152- | -------| ------------------| ---------------| ---------------------|
153- | Longhorn replicas | Node failure | Instant | None |
154- | VolSync + Kyverno | Cluster loss | ~ 5-15 minutes | None (Zero Touch) |
155-
156- ## 8. Configuration Files
121+ ## 5. Configuration Files
157122
158123| Component | Location |
159124| -----------| ----------|
160- | VolSync operator | ` infrastructure/storage/volsync/ ` |
161- | ** RustFS Service ** | ` infrastructure/storage/volsync/rustfs-service .yaml ` |
162- | ** Smart Restore Policy ** | ` infrastructure/controllers/kyverno/ volsync-smart-restore .yaml ` |
163- | Generate Policy | ` infrastructure/controllers/kyverno/volsync-clusterpolicy.yaml ` |
164- | Mutate Policy | ` infrastructure/controllers/kyverno/volsync-auto-restore.yaml ` |
125+ | ** Smart Policy (The Brain) ** | ` infrastructure/controllers/kyverno/volsync-smart-restore.yaml ` |
126+ | ** VolSync Kustomization ** | ` infrastructure/storage/volsync/kustomization .yaml ` |
127+ | ** Credentials ** | ` infrastructure/storage/ volsync/rustfs-credentials .yaml ` |
128+
129+ ## 6. Troubleshooting
165130
166- ## 9. Troubleshooting
131+ ### "OutOfSync" in ArgoCD?
132+ This is usually cosmetic due to Kyverno status updates.
133+ We have applied ` ignoreDifferences ` in the AppSet to silence ` ClusterPolicy ` status fields.
134+ If it persists, verify your local Git matches the API Server.
167135
168- ### Kyverno apiCall Failed
169- If Kyverno cannot reach S3, check the Service Bridge :
136+ ### Manual Restore
137+ To force a restore to a specific point in time :
170138``` bash
171- kubectl get svc rustfs -n volsync-system
172- kubectl describe endpoints rustfs -n volsync-system
173- # Should point to 192.168.10.133:9000
139+ kubectl delete pvc < pvc-name >
140+ # Update the ReplicationDestination spec if needed to point to a specific snapshot ID
141+ # Re-create PVC (ArgoCD will do this).
174142```
0 commit comments