Skip to content

Commit 103dc6d

Browse files
mitchrossclaude
andcommitted
docs: update storage architecture for Volume Populator auto-restore
- Document automatic restore via Volume Populator pattern - Add architecture diagram showing backup/restore flow - Document tiered schedules (backup at :00, restore sync at :30) - Add disaster recovery scenarios with automatic restore - Document ArgoCD ignoreDifferences for PVC immutability - Update monitoring commands for ReplicationDestinations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 163d15b commit 103dc6d

1 file changed

Lines changed: 142 additions & 96 deletions

File tree

docs/storage-architecture.md

Lines changed: 142 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -7,33 +7,49 @@ This document outlines the storage architecture for the cluster, focusing on dat
77
The cluster uses a layered storage approach:
88
- **Longhorn**: Distributed block storage for runtime replication (2 replicas per volume)
99
- **Snapshot Controller**: Manages VolumeSnapshot lifecycles and CRDs
10-
- **VolSync**: Daily backups of all PVCs to S3 using Restic
10+
- **VolSync**: Scheduled backups of all PVCs to S3 using Restic + automatic restore via Volume Populator
1111
- **Database-native backups**: CloudNativePG and Crunchy Postgres backup directly to S3
1212

1313
## Architecture Diagram
1414

1515
```
16-
┌─────────────────────────────────────────────────────────────────┐
17-
│ Talos Cluster │
18-
│ ┌──────────────────┐ ┌──────────────────┐ │
19-
│ │ App PVCs │ │ Postgres DBs │ │
20-
│ │ (Longhorn) │ │ (CNPG/Crunchy) │ │
21-
│ └────────┬─────────┘ └────────┬─────────┘ │
22-
│ │ │ │
23-
│ ▼ ▼ │
24-
│ ┌──────────────────┐ ┌──────────────────┐ │
25-
│ │ VolSync │ │ Native PG │ │
26-
│ │ (Restic daily) │ │ WAL + Backups │ │
27-
│ └────────┬─────────┘ └────────┬─────────┘ │
28-
│ │ │ │
29-
└───────────┼───────────────────────┼─────────────────────────────┘
30-
│ │
31-
▼ ▼
32-
┌─────────────────────────────────────┐
33-
│ RustFS (S3) on TrueNAS │
34-
│ 192.168.10.133:30292 │
35-
│ └── volsync/<app>/ │
36-
└─────────────────────────────────────┘
16+
┌─────────────────────────────────────────────────────────────────────────────┐
17+
│ Talos Cluster │
18+
│ │
19+
│ ┌────────────────────────────────────────────────────────────────────────┐ │
20+
│ │ Normal Operation │ │
21+
│ │ │ │
22+
│ │ App PVC ──► ReplicationSource ──► S3 (backup on schedule) │ │
23+
│ │ ▲ (hourly/daily) │ │
24+
│ │ │ │ │ │
25+
│ │ │ ReplicationDestination ◄───┘ │ │
26+
│ │ │ (syncs 30 min after backup) │ │
27+
│ │ │ │ │ │
28+
│ │ │ ▼ │ │
29+
│ │ │ latestImage │ │
30+
│ │ │ (VolumeSnapshot) │ │
31+
│ │ │ │ │ │
32+
│ │ └────────────────────┘ │ │
33+
│ │ (dataSourceRef for auto-restore) │ │
34+
│ └────────────────────────────────────────────────────────────────────────┘ │
35+
│ │
36+
│ ┌────────────────────────────────────────────────────────────────────────┐ │
37+
│ │ When PVC is Deleted │ │
38+
│ │ │ │
39+
│ │ 1. ArgoCD recreates PVC with dataSourceRef │ │
40+
│ │ 2. Volume Populator sees dataSourceRef → ReplicationDestination │ │
41+
│ │ 3. Creates PVC from latestImage snapshot │ │
42+
│ │ 4. Data is automatically restored! │ │
43+
│ └────────────────────────────────────────────────────────────────────────┘ │
44+
│ │
45+
└─────────────────────────────────────────────────────────────────────────────┘
46+
47+
48+
┌─────────────────────────────────┐
49+
│ RustFS (S3) on TrueNAS │
50+
│ 192.168.10.133:30292 │
51+
│ └── volsync/<app>/ │
52+
└─────────────────────────────────┘
3753
```
3854

3955
## 1. Normal Operation (Write Path)
@@ -65,10 +81,11 @@ graph LR
6581
- Runtime replication (survives single node failure)
6682
- Fast replica rebuild
6783
- Automatic rebalancing
84+
- VolumeSnapshots for VolSync
6885

6986
**Longhorn does NOT provide:**
7087
- Off-cluster backups (handled by VolSync)
71-
- Point-in-time recovery (handled by VolSync)
88+
- Automatic restore on PVC deletion (handled by Volume Populator)
7289

7390
## 2. Backup Strategy
7491

@@ -79,18 +96,20 @@ Application PVCs are backed up using VolSync with Restic, with a tiered schedule
7996
**Critical Apps (Hourly):**
8097
home-assistant, paperless-ngx, karakeep, meilisearch, n8n, immich, open-webui, khoj
8198

82-
| Setting | Value |
83-
|---------|-------|
84-
| Schedule | `0 * * * *` (hourly) |
85-
| Retention | 24 hourly + 7 daily |
99+
| Component | Schedule | Purpose |
100+
|-----------|----------|---------|
101+
| ReplicationSource | `0 * * * *` (hourly at :00) | Backup PVC → S3 |
102+
| ReplicationDestination | `30 * * * *` (hourly at :30) | Sync S3 → latestImage |
103+
| Retention | 24 hourly + 7 daily | |
86104

87105
**Non-Critical Apps (Daily):**
88106
container-registry, redis, mqtt, searxng, fizzy, nginx, jellyfin, nestmtx, homepage-dashboard, plex
89107

90-
| Setting | Value |
91-
|---------|-------|
92-
| Schedule | `0 2 * * *` (daily at 2 AM) |
93-
| Retention | 14 days |
108+
| Component | Schedule | Purpose |
109+
|-----------|----------|---------|
110+
| ReplicationSource | `0 2 * * *` (daily at 2:00 AM) | Backup PVC → S3 |
111+
| ReplicationDestination | `30 2 * * *` (daily at 2:30 AM) | Sync S3 → latestImage |
112+
| Retention | 14 days | |
94113

95114
**Common Settings:**
96115

@@ -102,8 +121,9 @@ container-registry, redis, mqtt, searxng, fizzy, nginx, jellyfin, nestmtx, homep
102121
| Copy Method | Snapshot |
103122

104123
Each app has:
105-
- `ReplicationSource` - Defines backup schedule and retention
106-
- `ReplicationDestination` - Dormant restore definition (triggered manually when needed)
124+
- `ReplicationSource` - Backs up PVC to S3 on schedule
125+
- `ReplicationDestination` - Syncs from S3, maintains `latestImage` snapshot for auto-restore
126+
- `PVC.dataSourceRef` - Points to ReplicationDestination for Volume Populator
107127
- `ExternalSecret` - Pulls S3 credentials from 1Password
108128

109129
### Database Backups (Native)
@@ -121,98 +141,121 @@ PostgreSQL databases use their native backup tools:
121141
- Weekly full + daily differential
122142
- 14-day retention
123143

124-
## 3. Disaster Recovery
144+
## 3. Automatic Restore (Volume Populator)
145+
146+
### How It Works
147+
148+
When a PVC is deleted (accidentally or intentionally) and ArgoCD recreates it:
149+
150+
1. **PVC has `dataSourceRef`** pointing to ReplicationDestination
151+
2. **Volume Populator** detects the dataSourceRef
152+
3. **Looks up `latestImage`** from ReplicationDestination (a VolumeSnapshot)
153+
4. **Creates PVC** from that snapshot
154+
5. **Data is restored automatically** - no manual intervention!
155+
156+
### Storage Overhead
157+
158+
For automatic restore to work, each app needs:
159+
- ReplicationDestination cache PVC (~1Gi)
160+
- ReplicationDestination dest PVC (same size as app PVC)
161+
- latestImage VolumeSnapshot
162+
163+
**Total overhead**: ~2Gi + 1x PVC size per app
164+
165+
### ArgoCD Configuration
166+
167+
PVC `spec.dataSourceRef` is immutable after creation. ArgoCD is configured to ignore this field:
168+
169+
```yaml
170+
# In ApplicationSet
171+
ignoreDifferences:
172+
- group: ""
173+
kind: PersistentVolumeClaim
174+
jqPathExpressions:
175+
- .spec.dataSourceRef
176+
- .spec.dataSource
177+
- .spec.volumeName
178+
```
179+
180+
This allows:
181+
- Existing PVCs: ArgoCD ignores the dataSourceRef difference
182+
- New PVCs: Created with dataSourceRef, Volume Populator triggers restore
183+
184+
## 4. Disaster Recovery Scenarios
185+
186+
### Scenario 1: App Deleted in ArgoCD UI
125187
126-
### Defense Layers
188+
1. User deletes app in ArgoCD → PVC is deleted
189+
2. ArgoCD recreates app (auto-sync)
190+
3. New PVC created with `dataSourceRef`
191+
4. Volume Populator restores from `latestImage`
192+
5. **Data restored automatically!**
127193

128-
- **Layer 1 (Longhorn)**: Runtime replication across nodes - survives single node failure
129-
- **Layer 2 (VolSync)**: S3 backups to RustFS - survives complete cluster loss
194+
### Scenario 2: Accidental PVC Deletion
130195

131-
### Restoring a PVC (VolSync)
196+
1. `kubectl delete pvc <name> -n <namespace>`
197+
2. ArgoCD detects missing PVC, recreates it
198+
3. Volume Populator restores from `latestImage`
199+
4. **Data restored automatically!**
132200

133-
When you need to restore a PVC from backup:
201+
### Scenario 3: Full Cluster Rebuild
134202

135-
1. **Scale down the application** (to release the PVC):
203+
1. Bootstrap new Talos cluster
204+
2. Deploy ArgoCD (GitOps)
205+
3. ArgoCD deploys all apps from Git
206+
4. ReplicationDestinations sync from S3, create `latestImage` snapshots
207+
5. PVCs are created with `dataSourceRef`
208+
6. Volume Populator creates PVCs from snapshots
209+
7. **All data restored automatically!**
210+
211+
### Scenario 4: Manual Restore (Override)
212+
213+
If you need to restore to a specific point in time:
214+
215+
1. Scale down the application:
136216
```bash
137217
kubectl scale deployment <app> -n <namespace> --replicas=0
138218
```
139219

140-
2. **Delete the existing PVC** (if corrupted/lost):
220+
2. Delete the existing PVC:
141221
```bash
142222
kubectl delete pvc <pvc-name> -n <namespace>
143223
```
144224

145-
3. **Trigger the ReplicationDestination**:
225+
3. Trigger manual restore (optional - to get latest from S3):
146226
```bash
147227
kubectl patch replicationdestination <app>-restore -n <namespace> \
148228
--type merge \
149229
-p '{"spec":{"trigger":{"manual":"restore-'$(date +%s)'"}}}'
150230
```
151231

152-
4. **Wait for restore to complete** (creates a new PVC with restored data):
153-
```bash
154-
kubectl get replicationdestination <app>-restore -n <namespace> -w
155-
# Look for: latestImage showing the restored snapshot
156-
```
232+
4. Wait for sync, then let ArgoCD recreate the PVC (or manually create it)
157233

158-
5. **Rename/recreate the PVC** to match what the app expects, or update app to use the restored PVC name.
159-
160-
6. **Scale up the application**:
234+
5. Scale up the application:
161235
```bash
162236
kubectl scale deployment <app> -n <namespace> --replicas=1
163237
```
164238

165-
### Full Cluster Rebuild
166-
167-
After a complete cluster rebuild:
168-
169-
1. Deploy infrastructure (ArgoCD, External Secrets, Longhorn, VolSync)
170-
2. Deploy apps - PVCs will be created empty
171-
3. For each app needing data, trigger ReplicationDestination restore
172-
4. Apps will start with restored data
173-
174-
### Restoring a Database
175-
176-
**CloudNativePG:**
177-
```yaml
178-
spec:
179-
bootstrap:
180-
recovery:
181-
source: <cluster-name>
182-
# Optional: recoveryTarget for point-in-time
183-
```
184-
185-
**Crunchy Postgres:**
186-
Use pgBackRest restore commands or recreate cluster with recovery settings.
187-
188-
### Full Cluster Rebuild
239+
## 5. Defense Layers Summary
189240

190-
After a complete cluster rebuild:
241+
| Layer | Protects Against | Recovery Time | Manual Intervention |
242+
|-------|------------------|---------------|---------------------|
243+
| Longhorn replicas | Node failure | Instant | None |
244+
| VolSync + Volume Populator | App/PVC deletion | ~1-2 minutes | None |
245+
| VolSync S3 backups | Cluster loss | ~5-15 minutes | Deploy GitOps |
191246

192-
1. Deploy infrastructure (ArgoCD, External Secrets, Longhorn, VolSync)
193-
2. VolSync operator syncs with S3
194-
3. For each app, trigger ReplicationDestination to restore data
195-
4. Deploy applications - they bind to restored PVCs
196-
197-
## 4. What Changed from Longhorn Backups
198-
199-
| Feature | Before (Longhorn) | Now (VolSync) |
200-
|---------|-------------------|---------------|
201-
| Backup tool | Longhorn built-in | VolSync + Restic |
202-
| Backup schedule | RecurringJobs (tiered) | Tiered: hourly (critical) + daily (non-critical) |
203-
| Restore method | Hardcoded restore-job.yaml | Trigger ReplicationDestination manually |
204-
| Database backups | PVC snapshots (inconsistent) | Native WAL archiving (consistent) |
205-
| Complexity | Multiple tiers, shell scripts | Declarative YAML per app |
206-
207-
## 5. Monitoring
247+
## 6. Monitoring
208248

209249
### Check VolSync Status
210250
```bash
211-
# All ReplicationSources
251+
# All ReplicationSources (backups)
212252
kubectl get replicationsource -A
213253
214-
# Specific app
215-
kubectl describe replicationsource home-assistant-config-backup -n home-assistant
254+
# All ReplicationDestinations (restore points)
255+
kubectl get replicationdestination -A
256+
257+
# Check if latestImage exists (required for auto-restore)
258+
kubectl get replicationdestination -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.status.latestImage.name}{"\n"}{end}'
216259
```
217260

218261
### Check Database Backups
@@ -234,14 +277,17 @@ mc ls rustfs/volsync/
234277
mc ls rustfs/volsync/home-assistant/
235278
```
236279

237-
## 6. Configuration Files
280+
## 7. Configuration Files
238281

239282
| Component | Location |
240283
|-----------|----------|
241284
| VolSync operator | `infrastructure/storage/volsync/` |
242285
| Snapshot Controller | `infrastructure/storage/snapshot-controller/` |
243286
| Longhorn (replication only) | `infrastructure/storage/longhorn/` |
244-
| App VolSync configs | `my-apps/<category>/<app>/replicationsource.yaml` |
287+
| App ReplicationSource | `my-apps/<category>/<app>/replicationsource.yaml` |
288+
| App ReplicationDestination | `my-apps/<category>/<app>/replicationdestination.yaml` |
289+
| App PVC (with dataSourceRef) | `my-apps/<category>/<app>/pvc.yaml` |
290+
| ArgoCD ignoreDifferences | `infrastructure/controllers/argocd/apps/*-appset.yaml` |
245291
| CNPG backup config | `infrastructure/database/cloudnative-pg/*/cluster.yaml` |
246292
| Crunchy backup config | `infrastructure/database/crunchy-postgres/*/cluster.yaml` |
247293
| 1Password setup | `docs/secrets/volsync-secrets.md` |

0 commit comments

Comments
 (0)