Skip to content

Commit 364b653

Browse files
mitchrossclaude
andcommitted
docs: update storage architecture for VolSync migration
- Add volsync-secrets.md with 1Password setup instructions - Rewrite storage-architecture.md for new backup approach: - Longhorn: runtime replication only - VolSync: daily PVC backups to S3 - CNPG/Crunchy: native database backups - Declarative restore via ReplicationDestination 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 28fc628 commit 364b653

2 files changed

Lines changed: 221 additions & 60 deletions

File tree

docs/secrets/volsync-secrets.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# VolSync Secrets Setup
2+
3+
## Required 1Password Items
4+
5+
Before VolSync backups will work, you need to create one new item in 1Password.
6+
7+
### volsync-kopia
8+
9+
Create a new **Password** item in your 1Password vault:
10+
11+
| Field | Value |
12+
|-------|-------|
13+
| **Item name** | `volsync-kopia` |
14+
| **Field name** | `password` |
15+
| **Value** | A strong random password (32+ characters) |
16+
17+
This password encrypts all Kopia/Restic backup repositories stored in S3.
18+
19+
**Generate a secure password:**
20+
```bash
21+
openssl rand -base64 32
22+
```
23+
24+
Example output: `K7x9mP2nL4qR8vT1wY5zA3cF6hJ0bN+dG=`
25+
26+
### Existing Items Used
27+
28+
The VolSync configuration also uses your existing `minio` item:
29+
30+
| Item | Fields Used |
31+
|------|-------------|
32+
| `minio` | `minio_access_key`, `minio_secret_key` |
33+
34+
These should already exist from your Longhorn backup configuration.
35+
36+
## Verification
37+
38+
After creating the `volsync-kopia` item, verify the ExternalSecrets are syncing:
39+
40+
```bash
41+
# Check VolSync system secret
42+
kubectl get externalsecret -n volsync-system
43+
44+
# Check app-level secrets (example)
45+
kubectl get externalsecret -n home-assistant
46+
```
47+
48+
All ExternalSecrets should show `SecretSynced` status.
49+
50+
## S3 Bucket Setup
51+
52+
Ensure these buckets exist in your RustFS/MinIO (192.168.10.133):
53+
54+
| Bucket | Purpose |
55+
|--------|---------|
56+
| `volsync-backups` | VolSync PVC backups (Kopia repositories) |
57+
| `postgres-backups` | CNPG and Crunchy database backups |
58+
59+
Create them if they don't exist:
60+
```bash
61+
mc mb truenas/volsync-backups
62+
mc mb truenas/postgres-backups
63+
```

docs/storage-architecture.md

Lines changed: 158 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,41 @@ This document outlines the storage architecture for the cluster, focusing on dat
44

55
## Overview
66

7-
The cluster uses **Longhorn** as the distributed block storage system. It provides highly available persistent storage for Kubernetes workloads and integrates with S3 for off-site backups.
7+
The cluster uses a layered storage approach:
8+
- **Longhorn**: Distributed block storage for runtime replication (2 replicas per volume)
9+
- **VolSync**: Daily backups of all PVCs to S3 using Kopia/Restic
10+
- **Database-native backups**: CloudNativePG and Crunchy Postgres backup directly to S3
11+
12+
## Architecture Diagram
13+
14+
```
15+
┌─────────────────────────────────────────────────────────────────┐
16+
│ Talos Cluster │
17+
│ ┌──────────────────┐ ┌──────────────────┐ │
18+
│ │ App PVCs │ │ Postgres DBs │ │
19+
│ │ (Longhorn) │ │ (CNPG/Crunchy) │ │
20+
│ └────────┬─────────┘ └────────┬─────────┘ │
21+
│ │ │ │
22+
│ ▼ ▼ │
23+
│ ┌──────────────────┐ ┌──────────────────┐ │
24+
│ │ VolSync │ │ Native PG │ │
25+
│ │ (Kopia daily) │ │ WAL + Backups │ │
26+
│ └────────┬─────────┘ └────────┬─────────┘ │
27+
│ │ │ │
28+
└───────────┼───────────────────────┼─────────────────────────────┘
29+
│ │
30+
▼ ▼
31+
┌─────────────────────────────────────┐
32+
│ RustFS (S3) on TrueNAS │
33+
│ 192.168.10.133 │
34+
│ ├── volsync-backups/ │
35+
│ └── postgres-backups/ │
36+
└─────────────────────────────────────┘
37+
```
838

939
## 1. Normal Operation (Write Path)
1040

11-
When an application writes data, it flows through the Kubernetes storage stack to Longhorn, which replicates it across nodes.
41+
When an application writes data, it flows through Kubernetes to Longhorn, which maintains 2 replicas:
1242

1343
```mermaid
1444
graph LR
@@ -23,82 +53,150 @@ graph LR
2353
2454
subgraph "Longhorn Storage Engine"
2555
PV --> LH_Vol[Longhorn Volume]
26-
LH_Vol --> Replica1[Replica 1 (Node A)]
27-
LH_Vol --> Replica2[Replica 2 (Node B)]
56+
LH_Vol --> Replica1[Replica 1 Node A]
57+
LH_Vol --> Replica2[Replica 2 Node B]
2858
end
2959
3060
style App fill:#f9f,stroke:#333,stroke-width:2px
3161
style LH_Vol fill:#bbf,stroke:#333,stroke-width:2px
3262
```
3363

34-
## 2. Backup Strategy (Automatic)
64+
**Longhorn provides:**
65+
- Runtime replication (survives single node failure)
66+
- Fast replica rebuild
67+
- Automatic rebalancing
3568

36-
Backups are handled automatically by Longhorn's **Recurring Jobs**.
37-
- **Snapshots**: Local, instant point-in-time copies (kept for hours/days).
38-
- **Backups**: Deduplicated, compressed chunks sent to S3 (kept for days/weeks).
69+
**Longhorn does NOT provide:**
70+
- Off-cluster backups (handled by VolSync)
71+
- Point-in-time recovery (handled by VolSync)
3972

40-
We use `RecurringJob` groups to assign policies:
41-
- **default**: Daily snapshot, Weekly backup (Applied to ALL new volumes).
42-
- **critical**: Hourly snapshot, Daily backup (Applied via `data-tier: critical` label).
73+
## 2. Backup Strategy
4374

44-
```mermaid
45-
graph TD
46-
subgraph "Cluster"
47-
Volume[Longhorn Volume]
48-
Job[Recurring Job] -- Triggers --> Snapshot[Local Snapshot]
49-
end
75+
### PVC Backups (VolSync)
5076

51-
subgraph "Off-site Storage (S3)"
52-
BackupStore[S3 Bucket]
53-
end
77+
All application PVCs are backed up daily at 2 AM using VolSync with Kopia:
78+
79+
| Setting | Value |
80+
|---------|-------|
81+
| Schedule | `0 2 * * *` (daily at 2 AM) |
82+
| Retention | 14 days |
83+
| Backend | Kopia (Restic-compatible) |
84+
| Target | RustFS S3 on TrueNAS |
85+
| Copy Method | Snapshot |
86+
87+
Each app has:
88+
- `ReplicationSource` - Defines backup schedule and retention
89+
- `ReplicationDestination` - Pre-provisioned for restore capability
90+
- `ExternalSecret` - Pulls S3 credentials from 1Password
91+
92+
### Database Backups (Native)
93+
94+
PostgreSQL databases use their native backup tools:
95+
96+
**CloudNativePG (khoj, paperless)**
97+
- Barman for WAL archiving
98+
- Daily base backups at 3 AM
99+
- 14-day retention
100+
- Point-in-time recovery capable
101+
102+
**Crunchy Postgres (immich)**
103+
- pgBackRest for backups
104+
- Weekly full + daily differential
105+
- 14-day retention
106+
107+
## 3. Disaster Recovery
108+
109+
### Restoring a PVC (VolSync)
110+
111+
When you need to restore a PVC from backup:
54112

55-
Snapshot -- "Deduplicated Upload" --> BackupStore
113+
1. **Trigger the ReplicationDestination**:
114+
```bash
115+
kubectl patch replicationdestination <app>-restore -n <namespace> \
116+
--type merge \
117+
-p '{"spec":{"trigger":{"manual":"restore-'$(date +%s)'"}}}'
118+
```
56119

57-
style Job fill:#f96,stroke:#333,stroke-width:2px
58-
style BackupStore fill:#6f6,stroke:#333,stroke-width:2px
120+
2. **Wait for restore to complete**:
121+
```bash
122+
kubectl get replicationdestination <app>-restore -n <namespace> -w
59123
```
60124

61-
## 3. Disaster Recovery (The "Magic" Restore)
125+
3. **Update PVC to use restored data** (if needed):
126+
```yaml
127+
spec:
128+
dataSourceRef:
129+
kind: ReplicationDestination
130+
apiGroup: volsync.backube
131+
name: <app>-restore
132+
```
62133
63-
When the cluster is destroyed and rebuilt, data is restored from S3.
134+
### Restoring a Database
64135
65-
### The "Magic" Explained
66-
The "magic" is now powered by the **Automated Restore Job** (`restore-job.yaml`). It runs automatically when the cluster starts.
136+
**CloudNativePG:**
137+
```yaml
138+
spec:
139+
bootstrap:
140+
recovery:
141+
source: <cluster-name>
142+
# Optional: recoveryTarget for point-in-time
143+
```
67144

68-
1. **Nuke & Rebuild**: Cluster is wiped. Longhorn installs.
69-
2. **Connect S3**: Longhorn connects to S3 and syncs backup metadata.
70-
3. **Dynamic Discovery**: The Restore Job scans **ALL** backups.
71-
4. **Match & Restore**: It finds the **LATEST** backup for your critical apps (e.g., `karakeep/data-pvc`) and creates a PV for it.
72-
5. **Bind**: Your App starts, sees the PV, and binds instantly.
145+
**Crunchy Postgres:**
146+
Use pgBackRest restore commands or recreate cluster with recovery settings.
73147

74-
```mermaid
75-
sequenceDiagram
76-
participant Admin
77-
participant Longhorn
78-
participant S3
79-
participant Job as Restore Job
80-
participant App
81-
82-
Note over Admin, App: Cluster Rebuilt (Empty State)
83-
84-
Longhorn->>S3: Sync Backup Metadata
85-
86-
Job->>Longhorn: "Do we have backups for Karakeep?"
87-
Longhorn-->>Job: "Yes, latest is from 2AM"
88-
89-
Job->>Longhorn: Restore Volume from 2AM Backup
90-
Longhorn->>S3: Download Data
91-
Longhorn->>Longhorn: Reconstruct Volume
92-
93-
Job->>App: Create PV "pv-restore-karakeep"
94-
95-
App->>Longhorn: Mount Volume
96-
Note over App: Application Starts with FRESH Data!
148+
### Full Cluster Rebuild
149+
150+
After a complete cluster rebuild:
151+
152+
1. Deploy infrastructure (ArgoCD, External Secrets, Longhorn, VolSync)
153+
2. VolSync operator syncs with S3
154+
3. For each app, trigger ReplicationDestination to restore data
155+
4. Deploy applications - they bind to restored PVCs
156+
157+
## 4. What Changed from Longhorn Backups
158+
159+
| Feature | Before (Longhorn) | Now (VolSync) |
160+
|---------|-------------------|---------------|
161+
| Backup tool | Longhorn built-in | VolSync + Kopia |
162+
| Backup schedule | RecurringJobs (tiered) | Single daily schedule |
163+
| Restore method | Hardcoded restore-job.yaml | Declarative ReplicationDestination |
164+
| Database backups | PVC snapshots (inconsistent) | Native WAL archiving (consistent) |
165+
| Complexity | Multiple tiers, shell scripts | Simple, uniform config |
166+
167+
## 5. Monitoring
168+
169+
### Check VolSync Status
170+
```bash
171+
# All ReplicationSources
172+
kubectl get replicationsource -A
173+
174+
# Specific app
175+
kubectl describe replicationsource home-assistant-config-backup -n home-assistant
176+
```
177+
178+
### Check Database Backups
179+
```bash
180+
# CNPG
181+
kubectl get backup -n cloudnative-pg
182+
183+
# Crunchy
184+
kubectl exec -it <postgres-pod> -n postgres-operator -- pgbackrest info
185+
```
186+
187+
### S3 Bucket Contents
188+
```bash
189+
mc ls truenas/volsync-backups/
190+
mc ls truenas/postgres-backups/
97191
```
98192

99-
### Why `longhorn-restore-karakeep` was "Out of Norm"
100-
The `longhorn-restore-karakeep` StorageClass contained a hardcoded `fromBackup` parameter.
101-
- **Pros**: Instant restore without scripts.
102-
- **Cons**: It pins the volume to *that specific backup forever*. If you write new data and restart, it might revert to the old backup depending on reclaim policy. It's not meant for general use.
193+
## 6. Configuration Files
103194

104-
**Best Practice**: Use the standard `longhorn` class. The **Automated Restore Job** will handle the disaster recovery binding for you.
195+
| Component | Location |
196+
|-----------|----------|
197+
| VolSync operator | `infrastructure/storage/volsync/` |
198+
| Longhorn (replication only) | `infrastructure/storage/longhorn/` |
199+
| App VolSync configs | `my-apps/<category>/<app>/replicationsource.yaml` |
200+
| CNPG backup config | `infrastructure/database/cloudnative-pg/*/cluster.yaml` |
201+
| Crunchy backup config | `infrastructure/database/crunchy-postgres/*/cluster.yaml` |
202+
| 1Password setup | `docs/secrets/volsync-secrets.md` |

0 commit comments

Comments
 (0)