@@ -4,11 +4,41 @@ This document outlines the storage architecture for the cluster, focusing on dat
44
55## Overview
66
7- The cluster uses ** Longhorn** as the distributed block storage system. It provides highly available persistent storage for Kubernetes workloads and integrates with S3 for off-site backups.
7+ The cluster uses a layered storage approach:
8+ - ** Longhorn** : Distributed block storage for runtime replication (2 replicas per volume)
9+ - ** VolSync** : Daily backups of all PVCs to S3 using Kopia/Restic
10+ - ** Database-native backups** : CloudNativePG and Crunchy Postgres backup directly to S3
11+
12+ ## Architecture Diagram
13+
14+ ```
15+ ┌─────────────────────────────────────────────────────────────────┐
16+ │ Talos Cluster │
17+ │ ┌──────────────────┐ ┌──────────────────┐ │
18+ │ │ App PVCs │ │ Postgres DBs │ │
19+ │ │ (Longhorn) │ │ (CNPG/Crunchy) │ │
20+ │ └────────┬─────────┘ └────────┬─────────┘ │
21+ │ │ │ │
22+ │ ▼ ▼ │
23+ │ ┌──────────────────┐ ┌──────────────────┐ │
24+ │ │ VolSync │ │ Native PG │ │
25+ │ │ (Kopia daily) │ │ WAL + Backups │ │
26+ │ └────────┬─────────┘ └────────┬─────────┘ │
27+ │ │ │ │
28+ └───────────┼───────────────────────┼─────────────────────────────┘
29+ │ │
30+ ▼ ▼
31+ ┌─────────────────────────────────────┐
32+ │ RustFS (S3) on TrueNAS │
33+ │ 192.168.10.133 │
34+ │ ├── volsync-backups/ │
35+ │ └── postgres-backups/ │
36+ └─────────────────────────────────────┘
37+ ```
838
939## 1. Normal Operation (Write Path)
1040
11- When an application writes data, it flows through the Kubernetes storage stack to Longhorn, which replicates it across nodes.
41+ When an application writes data, it flows through Kubernetes to Longhorn, which maintains 2 replicas:
1242
1343``` mermaid
1444graph LR
@@ -23,82 +53,150 @@ graph LR
2353
2454 subgraph "Longhorn Storage Engine"
2555 PV --> LH_Vol[Longhorn Volume]
26- LH_Vol --> Replica1[Replica 1 ( Node A) ]
27- LH_Vol --> Replica2[Replica 2 ( Node B) ]
56+ LH_Vol --> Replica1[Replica 1 Node A]
57+ LH_Vol --> Replica2[Replica 2 Node B]
2858 end
2959
3060 style App fill:#f9f,stroke:#333,stroke-width:2px
3161 style LH_Vol fill:#bbf,stroke:#333,stroke-width:2px
3262```
3363
34- ## 2. Backup Strategy (Automatic)
64+ ** Longhorn provides:**
65+ - Runtime replication (survives single node failure)
66+ - Fast replica rebuild
67+ - Automatic rebalancing
3568
36- Backups are handled automatically by Longhorn's ** Recurring Jobs ** .
37- - ** Snapshots ** : Local, instant point-in-time copies (kept for hours/days).
38- - ** Backups ** : Deduplicated, compressed chunks sent to S3 (kept for days/weeks).
69+ ** Longhorn does NOT provide: **
70+ - Off-cluster backups (handled by VolSync)
71+ - Point-in-time recovery (handled by VolSync)
3972
40- We use ` RecurringJob ` groups to assign policies:
41- - ** default** : Daily snapshot, Weekly backup (Applied to ALL new volumes).
42- - ** critical** : Hourly snapshot, Daily backup (Applied via ` data-tier: critical ` label).
73+ ## 2. Backup Strategy
4374
44- ``` mermaid
45- graph TD
46- subgraph "Cluster"
47- Volume[Longhorn Volume]
48- Job[Recurring Job] -- Triggers --> Snapshot[Local Snapshot]
49- end
75+ ### PVC Backups (VolSync)
5076
51- subgraph "Off-site Storage (S3)"
52- BackupStore[S3 Bucket]
53- end
77+ All application PVCs are backed up daily at 2 AM using VolSync with Kopia:
78+
79+ | Setting | Value |
80+ | ---------| -------|
81+ | Schedule | ` 0 2 * * * ` (daily at 2 AM) |
82+ | Retention | 14 days |
83+ | Backend | Kopia (Restic-compatible) |
84+ | Target | RustFS S3 on TrueNAS |
85+ | Copy Method | Snapshot |
86+
87+ Each app has:
88+ - ` ReplicationSource ` - Defines backup schedule and retention
89+ - ` ReplicationDestination ` - Pre-provisioned for restore capability
90+ - ` ExternalSecret ` - Pulls S3 credentials from 1Password
91+
92+ ### Database Backups (Native)
93+
94+ PostgreSQL databases use their native backup tools:
95+
96+ ** CloudNativePG (khoj, paperless)**
97+ - Barman for WAL archiving
98+ - Daily base backups at 3 AM
99+ - 14-day retention
100+ - Point-in-time recovery capable
101+
102+ ** Crunchy Postgres (immich)**
103+ - pgBackRest for backups
104+ - Weekly full + daily differential
105+ - 14-day retention
106+
107+ ## 3. Disaster Recovery
108+
109+ ### Restoring a PVC (VolSync)
110+
111+ When you need to restore a PVC from backup:
54112
55- Snapshot -- "Deduplicated Upload" --> BackupStore
113+ 1 . ** Trigger the ReplicationDestination** :
114+ ``` bash
115+ kubectl patch replicationdestination < app> -restore -n < namespace> \
116+ --type merge \
117+ -p ' {"spec":{"trigger":{"manual":"restore-' $( date +%s) ' "}}}'
118+ ```
56119
57- style Job fill:#f96,stroke:#333,stroke-width:2px
58- style BackupStore fill:#6f6,stroke:#333,stroke-width:2px
120+ 2 . ** Wait for restore to complete** :
121+ ``` bash
122+ kubectl get replicationdestination < app> -restore -n < namespace> -w
59123```
60124
61- ## 3. Disaster Recovery (The "Magic" Restore)
125+ 3 . ** Update PVC to use restored data** (if needed):
126+ ``` yaml
127+ spec :
128+ dataSourceRef :
129+ kind : ReplicationDestination
130+ apiGroup : volsync.backube
131+ name : <app>-restore
132+ ` ` `
62133
63- When the cluster is destroyed and rebuilt, data is restored from S3.
134+ ### Restoring a Database
64135
65- ### The "Magic" Explained
66- The "magic" is now powered by the ** Automated Restore Job** (` restore-job.yaml ` ). It runs automatically when the cluster starts.
136+ **CloudNativePG:**
137+ ` ` ` yaml
138+ spec :
139+ bootstrap :
140+ recovery :
141+ source : <cluster-name>
142+ # Optional: recoveryTarget for point-in-time
143+ ```
67144
68- 1 . ** Nuke & Rebuild** : Cluster is wiped. Longhorn installs.
69- 2 . ** Connect S3** : Longhorn connects to S3 and syncs backup metadata.
70- 3 . ** Dynamic Discovery** : The Restore Job scans ** ALL** backups.
71- 4 . ** Match & Restore** : It finds the ** LATEST** backup for your critical apps (e.g., ` karakeep/data-pvc ` ) and creates a PV for it.
72- 5 . ** Bind** : Your App starts, sees the PV, and binds instantly.
145+ ** Crunchy Postgres:**
146+ Use pgBackRest restore commands or recreate cluster with recovery settings.
73147
74- ``` mermaid
75- sequenceDiagram
76- participant Admin
77- participant Longhorn
78- participant S3
79- participant Job as Restore Job
80- participant App
81-
82- Note over Admin, App: Cluster Rebuilt (Empty State)
83-
84- Longhorn->>S3: Sync Backup Metadata
85-
86- Job->>Longhorn: "Do we have backups for Karakeep?"
87- Longhorn-->>Job: "Yes, latest is from 2AM"
88-
89- Job->>Longhorn: Restore Volume from 2AM Backup
90- Longhorn->>S3: Download Data
91- Longhorn->>Longhorn: Reconstruct Volume
92-
93- Job->>App: Create PV "pv-restore-karakeep"
94-
95- App->>Longhorn: Mount Volume
96- Note over App: Application Starts with FRESH Data!
148+ ### Full Cluster Rebuild
149+
150+ After a complete cluster rebuild:
151+
152+ 1 . Deploy infrastructure (ArgoCD, External Secrets, Longhorn, VolSync)
153+ 2 . VolSync operator syncs with S3
154+ 3 . For each app, trigger ReplicationDestination to restore data
155+ 4 . Deploy applications - they bind to restored PVCs
156+
157+ ## 4. What Changed from Longhorn Backups
158+
159+ | Feature | Before (Longhorn) | Now (VolSync) |
160+ | ---------| -------------------| ---------------|
161+ | Backup tool | Longhorn built-in | VolSync + Kopia |
162+ | Backup schedule | RecurringJobs (tiered) | Single daily schedule |
163+ | Restore method | Hardcoded restore-job.yaml | Declarative ReplicationDestination |
164+ | Database backups | PVC snapshots (inconsistent) | Native WAL archiving (consistent) |
165+ | Complexity | Multiple tiers, shell scripts | Simple, uniform config |
166+
167+ ## 5. Monitoring
168+
169+ ### Check VolSync Status
170+ ``` bash
171+ # All ReplicationSources
172+ kubectl get replicationsource -A
173+
174+ # Specific app
175+ kubectl describe replicationsource home-assistant-config-backup -n home-assistant
176+ ```
177+
178+ ### Check Database Backups
179+ ``` bash
180+ # CNPG
181+ kubectl get backup -n cloudnative-pg
182+
183+ # Crunchy
184+ kubectl exec -it < postgres-pod> -n postgres-operator -- pgbackrest info
185+ ```
186+
187+ ### S3 Bucket Contents
188+ ``` bash
189+ mc ls truenas/volsync-backups/
190+ mc ls truenas/postgres-backups/
97191```
98192
99- ### Why ` longhorn-restore-karakeep ` was "Out of Norm"
100- The ` longhorn-restore-karakeep ` StorageClass contained a hardcoded ` fromBackup ` parameter.
101- - ** Pros** : Instant restore without scripts.
102- - ** Cons** : It pins the volume to * that specific backup forever* . If you write new data and restart, it might revert to the old backup depending on reclaim policy. It's not meant for general use.
193+ ## 6. Configuration Files
103194
104- ** Best Practice** : Use the standard ` longhorn ` class. The ** Automated Restore Job** will handle the disaster recovery binding for you.
195+ | Component | Location |
196+ | -----------| ----------|
197+ | VolSync operator | ` infrastructure/storage/volsync/ ` |
198+ | Longhorn (replication only) | ` infrastructure/storage/longhorn/ ` |
199+ | App VolSync configs | ` my-apps/<category>/<app>/replicationsource.yaml ` |
200+ | CNPG backup config | ` infrastructure/database/cloudnative-pg/*/cluster.yaml ` |
201+ | Crunchy backup config | ` infrastructure/database/crunchy-postgres/*/cluster.yaml ` |
202+ | 1Password setup | ` docs/secrets/volsync-secrets.md ` |
0 commit comments