Skip to content

Commit ba6aedd

Browse files
mitchrossclaude
andcommitted
refactor(volsync): migrate to zero-touch backup architecture
Architecture changes: - Kyverno now generates Secrets directly (not ExternalSecrets) - ClusterExternalSecret creates volsync-rustfs-base per labeled namespace - Kyverno apiCall reads base secret, computes RESTIC_REPOSITORY per PVC - Removed 15+ legacy volsync-secret.yaml files from apps Kyverno updates: - Replace volsync-rbac.yaml with rbac-patch.yaml (Secrets instead of ExternalSecrets) - Update clusterpolicy description to reflect Secret generation - Add rustfs-credentials.yaml to volsync kustomization Namespace requirements: - Label: volsync.backube/privileged-movers: "true" (for ClusterExternalSecret) - Annotation: volsync.backube/privileged-movers: "true" (for VolSync movers) Documentation: - Update storage-architecture.md with new two-layer secret system - Add VolumeSnapshot troubleshooting (Longhorn backup target issues) - Update configuration files table with correct paths - Add Secret verification commands Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent edff902 commit ba6aedd

67 files changed

Lines changed: 193 additions & 726 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/secrets/volsync-secrets.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Create a **Password** item in your 1Password vault:
1616
| **access_key** | RustFS access key |
1717
| **secret_key** | RustFS secret key |
1818
| **restic_password** | A strong random password (32+ characters) |
19-
| **restic_repository** | `s3:http://192.168.10.133:30292/volsync/` |
19+
| **restic_repository** | `s3:http://192.168.10.133:30292/volsync-backup/` |
2020

2121
The `restic_password` encrypts all backup repositories stored in S3.
2222

@@ -50,16 +50,16 @@ All ExternalSecrets should show `SecretSynced` status.
5050

5151
## S3 Bucket Setup
5252

53-
Ensure the `volsync` bucket exists in RustFS (192.168.10.133:30292):
53+
Ensure the `volsync-backup` bucket exists in RustFS (192.168.10.133:30292):
5454

5555
| Bucket | Purpose |
5656
|--------|---------|
57-
| `volsync` | VolSync PVC backups (Restic repositories) |
57+
| `volsync-backup` | VolSync PVC backups (Restic repositories) |
5858

5959
Create it if it doesn't exist:
6060
```bash
6161
mc alias set rustfs http://192.168.10.133:30292 <access_key> <secret_key>
62-
mc mb rustfs/volsync
62+
mc mb rustfs/volsync-backup
6363
```
6464

6565
## Auto-Generated Secret Structure
@@ -74,7 +74,7 @@ metadata:
7474
namespace: <pvc-namespace>
7575
type: Opaque
7676
stringData:
77-
RESTIC_REPOSITORY: s3:http://192.168.10.133:30292/volsync/<namespace>-<pvc>
77+
RESTIC_REPOSITORY: s3:http://192.168.10.133:30292/volsync-backup/<namespace>-<pvc>
7878
RESTIC_PASSWORD: <from 1Password rustfs.restic_password>
7979
AWS_ACCESS_KEY_ID: <from 1Password rustfs.access_key>
8080
AWS_SECRET_ACCESS_KEY: <from 1Password rustfs.secret_key>

docs/storage-architecture.md

Lines changed: 107 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,14 @@ The cluster uses a layered storage approach with **zero-touch backup and restore
1414

1515
**User only needs to:**
1616
1. Add `backup: "hourly"` or `backup: "daily"` label to PVC
17-
2. Ensure namespace has `volsync.backube/privileged-movers: "true"` annotation
17+
2. Ensure namespace has `volsync.backube/privileged-movers: "true"` **label** (for base credentials)
18+
3. Ensure namespace has `volsync.backube/privileged-movers: "true"` **annotation** (for VolSync movers)
19+
20+
**System automatically provides:**
21+
- **ClusterExternalSecret** creates `volsync-rustfs-base` secret in labeled namespaces (base S3 credentials from 1Password)
1822

1923
**Kyverno automatically generates:**
20-
- ExternalSecret (S3 credentials)
24+
- Secret with computed `RESTIC_REPOSITORY` (per-PVC path)
2125
- ReplicationSource (backups)
2226
- ReplicationDestination (restore points)
2327
- dataSourceRef on PVC (if backup exists)
@@ -27,18 +31,21 @@ The cluster uses a layered storage approach with **zero-touch backup and restore
2731
│ ZERO-TOUCH VOLSYNC ARCHITECTURE │
2832
├─────────────────────────────────────────────────────────────────────────────────┤
2933
│ │
30-
│ USER PROVIDES: KYVERNO AUTO-GENERATES: │
34+
│ USER PROVIDES: SYSTEM AUTO-GENERATES:
3135
│ ┌─────────────────────┐ ┌─────────────────────────────────────┐ │
32-
│ │ PVC │ │ ExternalSecret │ │
33-
│ │ labels: │ ────────► │ (S3 creds from 1Password) │ │
36+
│ │ PVC │ │ volsync-rustfs-base (per namespace) │ │
37+
│ │ labels: │ (ClusterExternalSecret → Secret) │ │
3438
│ │ backup: hourly │ │ │ │
35-
│ └─────────────────────┘ │ ReplicationSource │ │
36-
│ │ (hourly backups to S3) │ │
37-
│ ┌─────────────────────┐ │ │ │
38-
│ │ Namespace │ │ ReplicationDestination │ │
39-
│ │ annotations: (persists for restore) │ │
39+
│ └─────────────────────┘ │ {pvc}-volsync-secret (per PVC) │ │
40+
│ │ (Kyverno apiCall + base64 encode) │ │
41+
│ ┌─────────────────────┐ ───────► │ │ │
42+
│ │ Namespace │ │ ReplicationSource │ │
43+
│ │ labels: (hourly/daily backups to S3) │ │
4044
│ │ volsync...true │ │ │ │
41-
│ └─────────────────────┘ │ dataSourceRef on PVC │ │
45+
│ │ annotations: │ │ ReplicationDestination │ │
46+
│ │ volsync...true │ │ (persists for restore) │ │
47+
│ └─────────────────────┘ │ │ │
48+
│ │ dataSourceRef on PVC │ │
4249
│ │ (only if backup exists) │ │
4350
│ └─────────────────────────────────────┘ │
4451
│ │
@@ -54,7 +61,7 @@ The cluster uses a layered storage approach with **zero-touch backup and restore
5461
│ ┌────────────────────────────────────────────────────────────────────────────┐ │
5562
│ │ NEW APP (First Deployment) │ │
5663
│ │ │ │
57-
│ │ PVC created ──► Kyverno generates: ExternalSecret + RS + RD │ │
64+
│ │ PVC created ──► Kyverno generates: Secret + RS + RD │ │
5865
│ │ │ │ │
5966
│ │ │ Kyverno mutate checks: Does RD have latestImage? │ │
6067
│ │ │ │ │ │
@@ -93,7 +100,7 @@ The cluster uses a layered storage approach with **zero-touch backup and restore
93100
┌─────────────────────────────────┐
94101
│ RustFS (S3) on TrueNAS │
95102
│ 192.168.10.133:30292 │
96-
│ └── volsync/<ns>-<pvc>/ │
103+
│ └── volsync-backup/<ns>-<pvc>/ │
97104
└─────────────────────────────────┘
98105
```
99106

@@ -106,18 +113,19 @@ When Kyverno sees a PVC with `backup: hourly` or `backup: daily` label:
106113
```mermaid
107114
graph TD
108115
A[PVC Created with backup label] --> B[Kyverno Generate Policy]
109-
B --> C[Generate ExternalSecret]
110-
B --> D[Generate ReplicationSource]
111-
B --> E[Generate ReplicationDestination]
116+
B --> C[Read volsync-rustfs-base via apiCall]
117+
C --> D[Generate Secret with computed RESTIC_REPOSITORY]
118+
B --> E[Generate ReplicationSource]
119+
B --> F[Generate ReplicationDestination]
112120
113-
C --> F[S3 credentials from 1Password]
114-
D --> G[Hourly/Daily backups to S3]
115-
E --> H[Syncs from S3, creates latestImage]
121+
D --> G[Per-PVC S3 path: volsync-backup/namespace-pvcname]
122+
E --> H[Hourly/Daily backups to S3]
123+
F --> I[Syncs from S3, creates latestImage]
116124
117125
style A fill:#90EE90
118-
style C fill:#FFB6C1
119-
style D fill:#87CEEB
120-
style E fill:#DDA0DD
126+
style D fill:#FFB6C1
127+
style E fill:#87CEEB
128+
style F fill:#DDA0DD
121129
```
122130

123131
### Mutate Policy (`volsync-auto-restore`)
@@ -183,7 +191,7 @@ graph LR
183191
end
184192
185193
subgraph "Kyverno Generated"
186-
PVC -.-> ES[ExternalSecret]
194+
PVC -.-> SEC[Secret]
187195
PVC -.-> RS[ReplicationSource]
188196
PVC -.-> RD[ReplicationDestination]
189197
end
@@ -222,17 +230,26 @@ spec:
222230
| Hourly | `backup: "hourly"` | Every hour at :00 | 24 hourly + 7 daily | Critical apps |
223231
| Daily | `backup: "daily"` | Daily at 2:00 AM | 14 days | Non-critical apps |
224232
225-
### What Kyverno Creates
233+
### What Gets Created
234+
235+
**Per Namespace (via ClusterExternalSecret):**
236+
237+
| Resource | Name | Purpose |
238+
|----------|------|---------|
239+
| Secret | `volsync-rustfs-base` | Base S3 credentials from 1Password |
226240

227-
For each labeled PVC, Kyverno generates:
241+
**Per PVC (via Kyverno Generate Policy):**
228242

229243
| Resource | Name Pattern | Purpose |
230244
|----------|--------------|---------|
231-
| ExternalSecret | `{pvc-name}-volsync-secret` | S3 credentials from 1Password |
245+
| Secret | `{pvc-name}-volsync-secret` | S3 creds + computed RESTIC_REPOSITORY |
232246
| ReplicationSource | `{pvc-name}-backup` | Backs up to S3 on schedule |
233247
| ReplicationDestination | `{pvc-name}-restore` | Syncs from S3, maintains latestImage |
234248

235-
**S3 Repository Path:** `s3://volsync/{namespace}-{pvc-name}/`
249+
**S3 Repository Path:** `s3://volsync-backup/{namespace}-{pvc-name}/`
250+
251+
**Why direct Secret instead of ExternalSecret?**
252+
Kyverno can't pass ExternalSecret Go templates (`{{ .value }}`) without interpreting them as Kyverno variables. By using Kyverno's `apiCall` to read the base secret and `base64_encode()` to compute RESTIC_REPOSITORY, we avoid this conflict.
236253

237254
### Database Backups (Native)
238255

@@ -288,13 +305,14 @@ This ensures backup data is always available for restore, even after app deletio
288305

289306
```
290307
1. User creates PVC with backup: hourly label
291-
2. Kyverno generates ExternalSecret, RS, RD
292-
3. Kyverno mutate: No latestImage → no dataSourceRef
293-
4. PVC created with empty volume
294-
5. App starts with empty data
295-
6. First backup runs at next :00
296-
7. Sync CronJob runs (every 15 min) → triggers RD → creates latestImage
297-
8. Future restores will work!
308+
2. ClusterExternalSecret ensures volsync-rustfs-base exists in namespace
309+
3. Kyverno generates Secret (with RESTIC_REPOSITORY), RS, RD
310+
4. Kyverno mutate: No latestImage → no dataSourceRef
311+
5. PVC created with empty volume
312+
6. App starts with empty data
313+
7. First backup runs at next :00 (creates Longhorn snapshot, runs restic)
314+
8. Sync CronJob runs (every 15 min) → triggers RD → creates latestImage
315+
9. Future restores will work!
298316
```
299317
300318
**Result:** App starts fresh, backups begin automatically.
@@ -369,15 +387,17 @@ kubectl scale deployment <app> -n <namespace> --replicas=1
369387

370388
## 6. Onboarding New Apps
371389

372-
### Step 1: Add namespace annotation (once per namespace)
390+
### Step 1: Add namespace label AND annotation (once per namespace)
373391

374392
```yaml
375393
apiVersion: v1
376394
kind: Namespace
377395
metadata:
378396
name: my-app
397+
labels:
398+
volsync.backube/privileged-movers: "true" # Triggers ClusterExternalSecret
379399
annotations:
380-
volsync.backube/privileged-movers: "true"
400+
volsync.backube/privileged-movers: "true" # Allows VolSync privileged movers
381401
```
382402
383403
### Step 2: Add backup label to PVC
@@ -397,7 +417,9 @@ spec:
397417
storageClassName: longhorn
398418
```
399419
400-
**Done!** Kyverno handles everything else.
420+
**Done!** The system handles everything else:
421+
1. ClusterExternalSecret creates `volsync-rustfs-base` in the namespace
422+
2. Kyverno generates per-PVC Secret, RS, and RD
401423

402424
## 7. Monitoring
403425

@@ -417,31 +439,37 @@ NAMESPACE:.metadata.namespace,\
417439
NAME:.metadata.name,\
418440
LATEST_IMAGE:.status.latestImage.name
419441
420-
# Check Kyverno-generated ExternalSecrets
421-
kubectl get externalsecret -A | grep volsync
442+
# Check Kyverno-generated Secrets
443+
kubectl get secret -A | grep volsync-secret
444+
445+
# Check base secrets exist in namespaces
446+
kubectl get secret -A | grep volsync-rustfs-base
422447
```
423448

424449
### Verify Backups in S3
425450

426451
```bash
427452
# List all backup repositories
428453
mc alias set rustfs http://192.168.10.133:30292 <access_key> <secret_key>
429-
mc ls rustfs/volsync/
454+
mc ls rustfs/volsync-backup/
430455
431456
# Check specific app backup
432-
mc ls rustfs/volsync/karakeep-data-pvc/
457+
mc ls rustfs/volsync-backup/karakeep-data-pvc/
433458
```
434459

435460
## 8. Configuration Files
436461

437462
| Component | Location |
438463
|-----------|----------|
439464
| VolSync operator | `infrastructure/storage/volsync/` |
465+
| ClusterExternalSecret (base creds) | `infrastructure/storage/volsync/rustfs-credentials.yaml` |
440466
| Sync CronJob | `infrastructure/storage/volsync/sync-cronjob.yaml` |
441467
| Kyverno generate policy | `infrastructure/controllers/kyverno/volsync-clusterpolicy.yaml` |
442468
| Kyverno mutate policy | `infrastructure/controllers/kyverno/volsync-restore-mutate.yaml` |
469+
| Kyverno RBAC for Secrets | `infrastructure/controllers/kyverno/rbac-patch.yaml` |
443470
| Snapshot Controller | `infrastructure/storage/snapshot-controller/` |
444471
| Longhorn | `infrastructure/storage/longhorn/` |
472+
| Cilium network policy | `infrastructure/networking/cilium/policies/block-lan-access.yaml` |
445473
| ArgoCD ignoreDifferences | `infrastructure/controllers/argocd/apps/*-appset.yaml` |
446474
| 1Password secret (rustfs) | Configured in 1Password vault |
447475

@@ -455,11 +483,17 @@ kubectl describe pvc <name> -n <namespace>
455483
kubectl get replicationdestination <name>-restore -n <namespace> -o yaml
456484
```
457485

458-
### ExternalSecret not creating Secret
486+
### Secret missing or incomplete
459487

460488
```bash
461-
kubectl get externalsecret <pvc>-volsync-secret -n <namespace> -o yaml
462-
kubectl describe externalsecret <pvc>-volsync-secret -n <namespace>
489+
# Check if base secret exists in namespace
490+
kubectl get secret volsync-rustfs-base -n <namespace>
491+
492+
# Check if per-PVC secret was generated
493+
kubectl get secret <pvc>-volsync-secret -n <namespace> -o yaml
494+
495+
# Verify RESTIC_REPOSITORY is set (should be base64 encoded)
496+
kubectl get secret <pvc>-volsync-secret -n <namespace> -o jsonpath='{.data.RESTIC_REPOSITORY}' | base64 -d
463497
```
464498

465499
### ReplicationSource not backing up
@@ -469,10 +503,39 @@ kubectl get replicationsource <pvc>-backup -n <namespace> -o yaml
469503
kubectl logs -n volsync-system -l app.kubernetes.io/name=volsync
470504
```
471505

506+
### VolumeSnapshot stuck (READYTOUSE: false)
507+
508+
This usually means Longhorn can't reach its backup target. Check:
509+
510+
```bash
511+
# Check snapshot status and error
512+
kubectl describe volumesnapshot -n <namespace>
513+
514+
# Common errors:
515+
# - "connection refused" to backup target → Check Cilium network policy
516+
# - "access denied" → Check Longhorn backup credentials
517+
518+
# Fix: Delete stuck snapshots to let VolSync retry
519+
kubectl delete volumesnapshot -n <namespace> --all
520+
521+
# Verify Cilium allows traffic to backup target (192.168.10.133:9000)
522+
kubectl get ciliumclusterwidenetworkpolicy default-deny-lan-egress -o yaml | grep -A5 "9000"
523+
```
524+
472525
### Force sync ReplicationDestination
473526

474527
```bash
475528
kubectl patch replicationdestination <pvc>-restore -n <namespace> \
476529
--type merge \
477530
-p '{"spec":{"trigger":{"manual":"sync-'$(date +%s)'"}}}'
478531
```
532+
533+
### Check backup job logs
534+
535+
```bash
536+
# Find the backup job
537+
kubectl get jobs -n <namespace> | grep volsync
538+
539+
# Check logs
540+
kubectl logs -n <namespace> -l job-name=volsync-src-<pvc>-backup
541+
```

infrastructure/controllers/kyverno/kustomization.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ kind: Kustomization
33
namespace: kyverno
44
resources:
55
- namespace.yaml
6-
- volsync-rbac.yaml
6+
- rbac-patch.yaml
77
- volsync-clusterpolicy.yaml
88
- volsync-restore-mutate.yaml
99
helmCharts:

infrastructure/controllers/kyverno/volsync-clusterpolicy.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ metadata:
66
argocd.argoproj.io/sync-wave: "2"
77
policies.kyverno.io/title: Generate VolSync Backup Resources
88
policies.kyverno.io/description: >-
9-
Automatically generates ExternalSecret, ReplicationSource, and ReplicationDestination
9+
Automatically generates Secret (with computed RESTIC_REPOSITORY), ReplicationSource, and ReplicationDestination
1010
for PVCs labeled with backup=hourly or backup=daily.
11-
User only needs: 1) PVC with backup label, 2) namespace annotation volsync.backube/privileged-movers: "true"
11+
User only needs: 1) PVC with backup label, 2) namespace label+annotation volsync.backube/privileged-movers: "true"
1212
spec:
1313
generateExisting: true
1414
rules:
@@ -45,7 +45,7 @@ spec:
4545
AWS_ACCESS_KEY_ID: "{{baseSecret.AWS_ACCESS_KEY_ID}}"
4646
AWS_SECRET_ACCESS_KEY: "{{baseSecret.AWS_SECRET_ACCESS_KEY}}"
4747
RESTIC_PASSWORD: "{{baseSecret.RESTIC_PASSWORD}}"
48-
RESTIC_REPOSITORY: "{{base64_encode('s3:http://192.168.10.133:30292/volsync/{{request.object.metadata.namespace}}-{{request.object.metadata.name}}')}}"
48+
RESTIC_REPOSITORY: "{{base64_encode('s3:http://192.168.10.133:30292/volsync-backup/{{request.object.metadata.namespace}}-{{request.object.metadata.name}}')}}"
4949

5050
# Rule 1: Generate ReplicationSource for hourly backups
5151
- name: generate-hourly-replicationsource
@@ -155,7 +155,7 @@ spec:
155155
AWS_ACCESS_KEY_ID: "{{baseSecret.AWS_ACCESS_KEY_ID}}"
156156
AWS_SECRET_ACCESS_KEY: "{{baseSecret.AWS_SECRET_ACCESS_KEY}}"
157157
RESTIC_PASSWORD: "{{baseSecret.RESTIC_PASSWORD}}"
158-
RESTIC_REPOSITORY: "{{base64_encode('s3:http://192.168.10.133:30292/volsync/{{request.object.metadata.namespace}}-{{request.object.metadata.name}}')}}"
158+
RESTIC_REPOSITORY: "{{base64_encode('s3:http://192.168.10.133:30292/volsync-backup/{{request.object.metadata.namespace}}-{{request.object.metadata.name}}')}}"
159159

160160
# Rule 3: Generate ReplicationSource for daily backups
161161
- name: generate-daily-replicationsource

0 commit comments

Comments
 (0)