@@ -7,33 +7,49 @@ This document outlines the storage architecture for the cluster, focusing on dat
77The cluster uses a layered storage approach:
88- ** Longhorn** : Distributed block storage for runtime replication (2 replicas per volume)
99- ** Snapshot Controller** : Manages VolumeSnapshot lifecycles and CRDs
10- - ** VolSync** : Daily backups of all PVCs to S3 using Restic
10+ - ** VolSync** : Scheduled backups of all PVCs to S3 using Restic + automatic restore via Volume Populator
1111- ** Database-native backups** : CloudNativePG and Crunchy Postgres backup directly to S3
1212
1313## Architecture Diagram
1414
1515```
16- ┌─────────────────────────────────────────────────────────────────┐
17- │ Talos Cluster │
18- │ ┌──────────────────┐ ┌──────────────────┐ │
19- │ │ App PVCs │ │ Postgres DBs │ │
20- │ │ (Longhorn) │ │ (CNPG/Crunchy) │ │
21- │ └────────┬─────────┘ └────────┬─────────┘ │
22- │ │ │ │
23- │ ▼ ▼ │
24- │ ┌──────────────────┐ ┌──────────────────┐ │
25- │ │ VolSync │ │ Native PG │ │
26- │ │ (Restic daily) │ │ WAL + Backups │ │
27- │ └────────┬─────────┘ └────────┬─────────┘ │
28- │ │ │ │
29- └───────────┼───────────────────────┼─────────────────────────────┘
30- │ │
31- ▼ ▼
32- ┌─────────────────────────────────────┐
33- │ RustFS (S3) on TrueNAS │
34- │ 192.168.10.133:30292 │
35- │ └── volsync/<app>/ │
36- └─────────────────────────────────────┘
16+ ┌─────────────────────────────────────────────────────────────────────────────┐
17+ │ Talos Cluster │
18+ │ │
19+ │ ┌────────────────────────────────────────────────────────────────────────┐ │
20+ │ │ Normal Operation │ │
21+ │ │ │ │
22+ │ │ App PVC ──► ReplicationSource ──► S3 (backup on schedule) │ │
23+ │ │ ▲ (hourly/daily) │ │
24+ │ │ │ │ │ │
25+ │ │ │ ReplicationDestination ◄───┘ │ │
26+ │ │ │ (syncs 30 min after backup) │ │
27+ │ │ │ │ │ │
28+ │ │ │ ▼ │ │
29+ │ │ │ latestImage │ │
30+ │ │ │ (VolumeSnapshot) │ │
31+ │ │ │ │ │ │
32+ │ │ └────────────────────┘ │ │
33+ │ │ (dataSourceRef for auto-restore) │ │
34+ │ └────────────────────────────────────────────────────────────────────────┘ │
35+ │ │
36+ │ ┌────────────────────────────────────────────────────────────────────────┐ │
37+ │ │ When PVC is Deleted │ │
38+ │ │ │ │
39+ │ │ 1. ArgoCD recreates PVC with dataSourceRef │ │
40+ │ │ 2. Volume Populator sees dataSourceRef → ReplicationDestination │ │
41+ │ │ 3. Creates PVC from latestImage snapshot │ │
42+ │ │ 4. Data is automatically restored! │ │
43+ │ └────────────────────────────────────────────────────────────────────────┘ │
44+ │ │
45+ └─────────────────────────────────────────────────────────────────────────────┘
46+ │
47+ ▼
48+ ┌─────────────────────────────────┐
49+ │ RustFS (S3) on TrueNAS │
50+ │ 192.168.10.133:30292 │
51+ │ └── volsync/<app>/ │
52+ └─────────────────────────────────┘
3753```
3854
3955## 1. Normal Operation (Write Path)
@@ -65,10 +81,11 @@ graph LR
6581- Runtime replication (survives single node failure)
6682- Fast replica rebuild
6783- Automatic rebalancing
84+ - VolumeSnapshots for VolSync
6885
6986** Longhorn does NOT provide:**
7087- Off-cluster backups (handled by VolSync)
71- - Point-in-time recovery (handled by VolSync )
88+ - Automatic restore on PVC deletion (handled by Volume Populator )
7289
7390## 2. Backup Strategy
7491
@@ -79,18 +96,20 @@ Application PVCs are backed up using VolSync with Restic, with a tiered schedule
7996** Critical Apps (Hourly):**
8097home-assistant, paperless-ngx, karakeep, meilisearch, n8n, immich, open-webui, khoj
8198
82- | Setting | Value |
83- | ---------| -------|
84- | Schedule | ` 0 * * * * ` (hourly) |
85- | Retention | 24 hourly + 7 daily |
99+ | Component | Schedule | Purpose |
100+ | -----------| ----------| ---------|
101+ | ReplicationSource | ` 0 * * * * ` (hourly at :00) | Backup PVC → S3 |
102+ | ReplicationDestination | ` 30 * * * * ` (hourly at :30) | Sync S3 → latestImage |
103+ | Retention | 24 hourly + 7 daily | |
86104
87105** Non-Critical Apps (Daily):**
88106container-registry, redis, mqtt, searxng, fizzy, nginx, jellyfin, nestmtx, homepage-dashboard, plex
89107
90- | Setting | Value |
91- | ---------| -------|
92- | Schedule | ` 0 2 * * * ` (daily at 2 AM) |
93- | Retention | 14 days |
108+ | Component | Schedule | Purpose |
109+ | -----------| ----------| ---------|
110+ | ReplicationSource | ` 0 2 * * * ` (daily at 2:00 AM) | Backup PVC → S3 |
111+ | ReplicationDestination | ` 30 2 * * * ` (daily at 2:30 AM) | Sync S3 → latestImage |
112+ | Retention | 14 days | |
94113
95114** Common Settings:**
96115
@@ -102,8 +121,9 @@ container-registry, redis, mqtt, searxng, fizzy, nginx, jellyfin, nestmtx, homep
102121| Copy Method | Snapshot |
103122
104123Each app has:
105- - ` ReplicationSource ` - Defines backup schedule and retention
106- - ` ReplicationDestination ` - Dormant restore definition (triggered manually when needed)
124+ - ` ReplicationSource ` - Backs up PVC to S3 on schedule
125+ - ` ReplicationDestination ` - Syncs from S3, maintains ` latestImage ` snapshot for auto-restore
126+ - ` PVC.dataSourceRef ` - Points to ReplicationDestination for Volume Populator
107127- ` ExternalSecret ` - Pulls S3 credentials from 1Password
108128
109129### Database Backups (Native)
@@ -121,98 +141,121 @@ PostgreSQL databases use their native backup tools:
121141- Weekly full + daily differential
122142- 14-day retention
123143
124- ## 3. Disaster Recovery
144+ ## 3. Automatic Restore (Volume Populator)
145+
146+ ### How It Works
147+
148+ When a PVC is deleted (accidentally or intentionally) and ArgoCD recreates it:
149+
150+ 1 . ** PVC has ` dataSourceRef ` ** pointing to ReplicationDestination
151+ 2 . ** Volume Populator** detects the dataSourceRef
152+ 3 . ** Looks up ` latestImage ` ** from ReplicationDestination (a VolumeSnapshot)
153+ 4 . ** Creates PVC** from that snapshot
154+ 5 . ** Data is restored automatically** - no manual intervention!
155+
156+ ### Storage Overhead
157+
158+ For automatic restore to work, each app needs:
159+ - ReplicationDestination cache PVC (~ 1Gi)
160+ - ReplicationDestination dest PVC (same size as app PVC)
161+ - latestImage VolumeSnapshot
162+
163+ ** Total overhead** : ~ 2Gi + 1x PVC size per app
164+
165+ ### ArgoCD Configuration
166+
167+ PVC ` spec.dataSourceRef ` is immutable after creation. ArgoCD is configured to ignore this field:
168+
169+ ``` yaml
170+ # In ApplicationSet
171+ ignoreDifferences :
172+ - group : " "
173+ kind : PersistentVolumeClaim
174+ jqPathExpressions :
175+ - .spec.dataSourceRef
176+ - .spec.dataSource
177+ - .spec.volumeName
178+ ` ` `
179+
180+ This allows:
181+ - Existing PVCs: ArgoCD ignores the dataSourceRef difference
182+ - New PVCs: Created with dataSourceRef, Volume Populator triggers restore
183+
184+ ## 4. Disaster Recovery Scenarios
185+
186+ ### Scenario 1: App Deleted in ArgoCD UI
125187
126- ### Defense Layers
188+ 1. User deletes app in ArgoCD → PVC is deleted
189+ 2. ArgoCD recreates app (auto-sync)
190+ 3. New PVC created with ` dataSourceRef`
191+ 4. Volume Populator restores from `latestImage`
192+ 5. **Data restored automatically!**
127193
128- - ** Layer 1 (Longhorn)** : Runtime replication across nodes - survives single node failure
129- - ** Layer 2 (VolSync)** : S3 backups to RustFS - survives complete cluster loss
194+ # ## Scenario 2: Accidental PVC Deletion
130195
131- ### Restoring a PVC (VolSync)
196+ 1. `kubectl delete pvc <name> -n <namespace>`
197+ 2. ArgoCD detects missing PVC, recreates it
198+ 3. Volume Populator restores from `latestImage`
199+ 4. **Data restored automatically!**
132200
133- When you need to restore a PVC from backup:
201+ # ## Scenario 3: Full Cluster Rebuild
134202
135- 1 . ** Scale down the application** (to release the PVC):
203+ 1. Bootstrap new Talos cluster
204+ 2. Deploy ArgoCD (GitOps)
205+ 3. ArgoCD deploys all apps from Git
206+ 4. ReplicationDestinations sync from S3, create `latestImage` snapshots
207+ 5. PVCs are created with `dataSourceRef`
208+ 6. Volume Populator creates PVCs from snapshots
209+ 7. **All data restored automatically!**
210+
211+ # ## Scenario 4: Manual Restore (Override)
212+
213+ If you need to restore to a specific point in time :
214+
215+ 1. Scale down the application :
136216` ` ` bash
137217kubectl scale deployment <app> -n <namespace> --replicas=0
138218` ` `
139219
140- 2 . ** Delete the existing PVC** (if corrupted/lost) :
220+ 2. Delete the existing PVC :
141221` ` ` bash
142222kubectl delete pvc <pvc-name> -n <namespace>
143223` ` `
144224
145- 3 . ** Trigger the ReplicationDestination ** :
225+ 3. Trigger manual restore (optional - to get latest from S3) :
146226` ` ` bash
147227kubectl patch replicationdestination <app>-restore -n <namespace> \
148228 --type merge \
149229 -p '{"spec":{"trigger":{"manual":"restore-'$(date +%s)'"}}}'
150230` ` `
151231
152- 4 . ** Wait for restore to complete** (creates a new PVC with restored data):
153- ``` bash
154- kubectl get replicationdestination < app> -restore -n < namespace> -w
155- # Look for: latestImage showing the restored snapshot
156- ```
232+ 4. Wait for sync, then let ArgoCD recreate the PVC (or manually create it)
157233
158- 5 . ** Rename/recreate the PVC** to match what the app expects, or update app to use the restored PVC name.
159-
160- 6 . ** Scale up the application** :
234+ 5. Scale up the application :
161235` ` ` bash
162236kubectl scale deployment <app> -n <namespace> --replicas=1
163237` ` `
164238
165- ### Full Cluster Rebuild
166-
167- After a complete cluster rebuild:
168-
169- 1 . Deploy infrastructure (ArgoCD, External Secrets, Longhorn, VolSync)
170- 2 . Deploy apps - PVCs will be created empty
171- 3 . For each app needing data, trigger ReplicationDestination restore
172- 4 . Apps will start with restored data
173-
174- ### Restoring a Database
175-
176- ** CloudNativePG:**
177- ``` yaml
178- spec :
179- bootstrap :
180- recovery :
181- source : <cluster-name>
182- # Optional: recoveryTarget for point-in-time
183- ```
184-
185- ** Crunchy Postgres:**
186- Use pgBackRest restore commands or recreate cluster with recovery settings.
187-
188- ### Full Cluster Rebuild
239+ # # 5. Defense Layers Summary
189240
190- After a complete cluster rebuild:
241+ | Layer | Protects Against | Recovery Time | Manual Intervention |
242+ |-------|------------------|---------------|---------------------|
243+ | Longhorn replicas | Node failure | Instant | None |
244+ | VolSync + Volume Populator | App/PVC deletion | ~1-2 minutes | None |
245+ | VolSync S3 backups | Cluster loss | ~5-15 minutes | Deploy GitOps |
191246
192- 1 . Deploy infrastructure (ArgoCD, External Secrets, Longhorn, VolSync)
193- 2 . VolSync operator syncs with S3
194- 3 . For each app, trigger ReplicationDestination to restore data
195- 4 . Deploy applications - they bind to restored PVCs
196-
197- ## 4. What Changed from Longhorn Backups
198-
199- | Feature | Before (Longhorn) | Now (VolSync) |
200- | ---------| -------------------| ---------------|
201- | Backup tool | Longhorn built-in | VolSync + Restic |
202- | Backup schedule | RecurringJobs (tiered) | Tiered: hourly (critical) + daily (non-critical) |
203- | Restore method | Hardcoded restore-job.yaml | Trigger ReplicationDestination manually |
204- | Database backups | PVC snapshots (inconsistent) | Native WAL archiving (consistent) |
205- | Complexity | Multiple tiers, shell scripts | Declarative YAML per app |
206-
207- ## 5. Monitoring
247+ # # 6. Monitoring
208248
209249# ## Check VolSync Status
210250` ` ` bash
211- # All ReplicationSources
251+ # All ReplicationSources (backups)
212252kubectl get replicationsource -A
213253
214- # Specific app
215- kubectl describe replicationsource home-assistant-config-backup -n home-assistant
254+ # All ReplicationDestinations (restore points)
255+ kubectl get replicationdestination -A
256+
257+ # Check if latestImage exists (required for auto-restore)
258+ kubectl get replicationdestination -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.status.latestImage.name}{"\n "}{end}'
216259` ` `
217260
218261# ## Check Database Backups
@@ -234,14 +277,17 @@ mc ls rustfs/volsync/
234277mc ls rustfs/volsync/home-assistant/
235278` ` `
236279
237- ## 6 . Configuration Files
280+ # # 7 . Configuration Files
238281
239282| Component | Location |
240283|-----------|----------|
241284| VolSync operator | `infrastructure/storage/volsync/` |
242285| Snapshot Controller | `infrastructure/storage/snapshot-controller/` |
243286| Longhorn (replication only) | `infrastructure/storage/longhorn/` |
244- | App VolSync configs | ` my-apps/<category>/<app>/replicationsource.yaml ` |
287+ | App ReplicationSource | `my-apps/<category>/<app>/replicationsource.yaml` |
288+ | App ReplicationDestination | `my-apps/<category>/<app>/replicationdestination.yaml` |
289+ | App PVC (with dataSourceRef) | `my-apps/<category>/<app>/pvc.yaml` |
290+ | ArgoCD ignoreDifferences | `infrastructure/controllers/argocd/apps/*-appset.yaml` |
245291| CNPG backup config | `infrastructure/database/cloudnative-pg/*/cluster.yaml` |
246292| Crunchy backup config | `infrastructure/database/crunchy-postgres/*/cluster.yaml` |
247293| 1Password setup | `docs/secrets/volsync-secrets.md` |
0 commit comments