Skip to content

Commit f72d40f

Browse files
committed
Migrate VolSync backup from Restic to Kopia with monitoring
Switches VolSync PVC backup/restore from Restic to Kopia for improved speed, compression, and deduplication. Updates documentation, Kyverno policy, and secret generation to use Kopia-specific configuration. Adds Prometheus alerting rules for VolSync and pvc-plumber health, backup failures, and maintenance. Includes a detailed comparison note with home-ops and documents new monitoring integration.
1 parent 2c2b9c4 commit f72d40f

5 files changed

Lines changed: 423 additions & 26 deletions

File tree

docs/backup-restore.md

Lines changed: 48 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,13 @@ This document describes the automated backup and restore system for Kubernetes P
44

55
## Overview
66

7-
The system automatically backs up PVCs to S3-compatible storage (RustFS/MinIO) and restores them on disaster recovery or app re-deployment. It uses a "look-before-you-leap" pattern to conditionally restore only when backups exist.
7+
The system automatically backs up PVCs to S3-compatible storage (RustFS/TrueNAS) using **Kopia** and restores them on disaster recovery or app re-deployment. It uses a "look-before-you-leap" pattern to conditionally restore only when backups exist.
8+
9+
### Why Kopia over Restic?
10+
11+
- **Faster**: Parallel uploads, better compression (zstd)
12+
- **Efficient**: Content-defined chunking with deduplication
13+
- **Maintained**: Active development, used by VolSync maintainers
814

915
## Architecture
1016

@@ -51,9 +57,10 @@ The system automatically backs up PVCs to S3-compatible storage (RustFS/MinIO) a
5157
- If backup exists: mutates PVC with `dataSourceRef` for auto-restore
5258

5359
### 4. VolSync
54-
- Performs actual backup/restore operations using Restic
60+
- Performs actual backup/restore operations using **Kopia**
5561
- Uses Longhorn snapshots for consistent backups
56-
- Stores data in S3 with Restic encryption
62+
- Stores data in S3 with Kopia encryption and zstd compression
63+
- Parallel uploads (parallelism: 2) for faster backups
5764

5865
## Sync Wave Order
5966

@@ -121,25 +128,40 @@ The `rustfs` item in 1Password must contain:
121128
|-------|--------------|---------|
122129
| `k8s-admin-access-key` | `k8s-admin` | S3 access key ID |
123130
| `k8s-admin-secret-key` | (secret) | S3 secret access key |
124-
| `restic_password` | (password) | Restic encryption key |
125-
| `restic_repository` | `s3:http://192.168.10.133:30292/volsync-backup/` | Base S3 path |
131+
| `kopia_password` | (password) | Kopia repository encryption key |
126132
| `endpoint` | `http://192.168.10.133:30292` | S3 endpoint (for pvc-plumber) |
127133
| `bucket` | `volsync-backup` | S3 bucket (for pvc-plumber) |
128134

135+
### Generated Secret Contents
136+
137+
Kyverno generates a secret per-PVC with:
138+
139+
| Key | Source | Purpose |
140+
|-----|--------|---------|
141+
| `KOPIA_PASSWORD` | 1Password | Repository encryption |
142+
| `KOPIA_S3_BUCKET` | Template | Bucket name |
143+
| `KOPIA_S3_ENDPOINT` | Template | S3 endpoint (without http://) |
144+
| `KOPIA_S3_PREFIX` | Template | `{namespace}/{pvc-name}` path |
145+
| `KOPIA_S3_DISABLE_TLS` | Template | `true` for http endpoints |
146+
| `KOPIA_S3_ACCESS_KEY_ID` | 1Password | S3 access key |
147+
| `KOPIA_S3_SECRET_ACCESS_KEY` | 1Password | S3 secret key |
148+
129149
## S3 Bucket Structure
130150

131151
```
132152
volsync-backup/
133153
├── {namespace}/
134154
│ └── {pvc-name}/
135-
│ ├── config # Restic repository config
136-
│ ├── data/ # Deduplicated backup data
137-
│ ├── index/ # Restic index files
138-
│ ├── keys/ # Encryption keys
139-
│ ├── locks/ # Lock files
140-
│ └── snapshots/ # Snapshot metadata
155+
│ ├── kopia.repository # Kopia repository config
156+
│ ├── kopia.blobcfg # Blob storage config
157+
│ ├── p/ # Pack files (deduplicated data)
158+
│ ├── q/ # Index blobs
159+
│ ├── n/ # Manifest blobs
160+
│ └── x/ # Session blobs
141161
```
142162
163+
Note: Kopia uses content-addressable storage with pack files for efficient deduplication.
164+
143165
## Critical Implementation Details
144166
145167
### Kyverno Policy: Operations Filter (Race Condition Fix)
@@ -243,7 +265,7 @@ rules:
243265
**Fix:** Add `mergePolicy: Merge` to ExternalSecret template.
244266

245267
**Verify:** `kubectl get secret volsync-<pvc-name> -n <namespace> -o json | jq '.data | keys'`
246-
Should show: `["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "RESTIC_PASSWORD", "RESTIC_REPOSITORY"]`
268+
Should show: `["KOPIA_PASSWORD", "KOPIA_S3_ACCESS_KEY_ID", "KOPIA_S3_BUCKET", "KOPIA_S3_DISABLE_TLS", "KOPIA_S3_ENDPOINT", "KOPIA_S3_PREFIX", "KOPIA_S3_SECRET_ACCESS_KEY"]`
247269

248270
### Backup Not Running
249271
1. Check ReplicationSource: `kubectl get replicationsource -n <namespace>`
@@ -264,11 +286,24 @@ The following namespaces are excluded from automatic backup:
264286
- `volsync-system`
265287
- `kyverno`
266288

289+
## Prometheus Monitoring
290+
291+
VolSync alerts are configured in `monitoring/prometheus-stack/volsync-alerts.yaml`:
292+
293+
| Alert | Severity | Description |
294+
|-------|----------|-------------|
295+
| `VolSyncControllerDown` | Critical | VolSync controller unavailable |
296+
| `VolSyncVolumeOutOfSync` | Critical | Backup failed or never completed |
297+
| `VolSyncMissedScheduledBackup` | Warning | Scheduled backup was skipped |
298+
| `VolSyncDurationTooLong` | Warning | Backup taking > 1 hour |
299+
| `PvcPlumberDown` | Critical | pvc-plumber service unavailable |
300+
267301
## Files
268302

269303
| File | Purpose |
270304
|------|---------|
271305
| `infrastructure/controllers/pvc-plumber/` | Backup existence checker service |
272-
| `infrastructure/controllers/kyverno/policies/volsync-pvc-backup-restore.yaml` | Kyverno policy |
306+
| `infrastructure/controllers/kyverno/policies/volsync-pvc-backup-restore.yaml` | Kyverno policy (Kopia) |
273307
| `infrastructure/storage/volsync/` | VolSync Helm chart + VolumeSnapshotClass |
274308
| `infrastructure/controllers/argocd/apps/pvc-plumber-app.yaml` | ArgoCD Application |
309+
| `monitoring/prometheus-stack/volsync-alerts.yaml` | Prometheus alerting rules |
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# VolSync Implementation Comparison: Our Approach vs home-ops (onedr0p)
2+
3+
**Date:** 2026-01-24
4+
**Purpose:** Reference for potential future reimplementation or improvements
5+
6+
## Summary
7+
8+
Our Kyverno + pvc-plumber approach is architecturally superior for DRY and ease of use. home-ops has better polish (monitoring, UI, Kopia speed). This note captures the full analysis.
9+
10+
---
11+
12+
## Our Approach
13+
14+
### Architecture
15+
- **Policy Engine:** Kyverno ClusterPolicy
16+
- **Backup Engine:** Restic (could migrate to Kopia)
17+
- **Storage Backend:** S3 (TrueNAS RustFS)
18+
- **Restore Detection:** pvc-plumber service (S3 HEAD request)
19+
- **Opt-in Mechanism:** Label on PVC (`backup: "hourly"` or `backup: "daily"`)
20+
21+
### How It Works
22+
1. Add `backup: "hourly"` label to any PVC
23+
2. Kyverno policy triggers on PVC CREATE
24+
3. pvc-plumber checks S3 for existing backup
25+
4. Kyverno generates: ExternalSecret, ReplicationSource, ReplicationDestination
26+
5. If backup exists, PVC gets `dataSourceRef` for auto-restore
27+
28+
### Key Files
29+
- `infrastructure/controllers/kyverno/policies/volsync-pvc-backup-restore.yaml`
30+
- `infrastructure/controllers/pvc-plumber/deployment.yaml`
31+
- `infrastructure/storage/volsync/`
32+
33+
### Strengths
34+
- **Zero per-app configuration** - just add a label
35+
- **Conditional restore** - fresh clusters start empty, DR restores data
36+
- **GitOps-agnostic** - works with ArgoCD, Flux, or anything
37+
- **Per-PVC repository isolation** - each PVC gets its own Restic repo
38+
- **S3 backend** - portable, resilient
39+
40+
### Weaknesses
41+
- Restic is slower than Kopia
42+
- No backup UI
43+
- No Prometheus alerts/Grafana dashboards
44+
- No maintenance jobs for repository cleanup
45+
46+
---
47+
48+
## home-ops Approach (onedr0p)
49+
50+
### Architecture
51+
- **Policy Engine:** Kustomize Components + MutatingAdmissionPolicy
52+
- **Backup Engine:** Kopia
53+
- **Storage Backend:** NFS (single server: `expanse.internal:/mnt/eros/VolsyncKopia`)
54+
- **Restore Detection:** None (always tries to restore via `IfNotPresent` label)
55+
- **Opt-in Mechanism:** Include component in Flux Kustomization
56+
57+
### How It Works
58+
1. Each app's `ks.yaml` includes the volsync component:
59+
```yaml
60+
spec:
61+
components:
62+
- ../../../../components/volsync
63+
postBuild:
64+
substitute:
65+
APP: sonarr
66+
VOLSYNC_CAPACITY: 5Gi
67+
```
68+
2. Component generates: ExternalSecret, PVC, ReplicationSource, ReplicationDestination
69+
3. MutatingAdmissionPolicy injects NFS volume into VolSync mover jobs
70+
4. MutatingAdmissionPolicy adds jitter (0-30s random sleep) to prevent backup storms
71+
72+
### Key Files (in home-ops/)
73+
- `kubernetes/components/volsync/` - Kustomize component
74+
- `kubernetes/apps/volsync-system/volsync/app/mutatingadmissionpolicy.yaml` - Job injection
75+
- `kubernetes/apps/volsync-system/volsync/maintenance/` - Repository maintenance
76+
- `kubernetes/apps/volsync-system/kopia/` - Kopia Web UI
77+
78+
### Strengths
79+
- Kopia is faster (parallel, better compression)
80+
- Kopia Web UI for browsing backups
81+
- Prometheus alerts for out-of-sync volumes
82+
- Grafana dashboard
83+
- Repository maintenance jobs (KopiaMaintenance)
84+
- Jitter prevents backup storms
85+
86+
### Weaknesses
87+
- **DRY violation** - every app needs ~10 lines in ks.yaml
88+
- **No conditional restore** - may fail on fresh clusters
89+
- **Flux-specific** - tied to Flux Kustomizations
90+
- **NFS dependency** - single point of failure
91+
92+
---
93+
94+
## Feature Comparison
95+
96+
| Feature | Our Solution | home-ops | Winner |
97+
|---------|-------------|----------|--------|
98+
| Lines of YAML per app | 1 (label) | ~10 (component + vars) | **Ours** |
99+
| Conditional restore | Yes | No | **Ours** |
100+
| Fresh cluster behavior | Works | May fail | **Ours** |
101+
| Backup engine speed | Restic (slower) | Kopia (faster) | home-ops |
102+
| Backup UI | None | Kopia Web UI | home-ops |
103+
| Monitoring | None | Prometheus + Grafana | home-ops |
104+
| Repository maintenance | None | KopiaMaintenance | home-ops |
105+
| Jitter for scheduling | None | MutatingAdmissionPolicy | home-ops |
106+
| GitOps tool independence | Yes | No (Flux-specific) | **Ours** |
107+
108+
---
109+
110+
## Improvements We Could Adopt
111+
112+
### 1. Switch to Kopia (instead of Restic)
113+
Change `restic:` to `kopia:` in Kyverno policy. Kopia supports S3 natively.
114+
115+
### 2. Add Prometheus Alerts
116+
```yaml
117+
apiVersion: monitoring.coreos.com/v1
118+
kind: PrometheusRule
119+
spec:
120+
groups:
121+
- name: volsync.rules
122+
rules:
123+
- alert: VolSyncVolumeOutOfSync
124+
expr: volsync_volume_out_of_sync == 1
125+
for: 5m
126+
labels:
127+
severity: critical
128+
```
129+
130+
### 3. Add Grafana Dashboard
131+
home-ops has one at `volsync-system/volsync/app/grafanadashboard.yaml`
132+
133+
### 4. Add Repository Maintenance
134+
Generate KopiaMaintenance jobs (or use Restic `prune` commands via CronJob)
135+
136+
### 5. Jitter (Optional)
137+
Add MutatingAdmissionPolicy to inject random sleep into VolSync jobs. Less relevant for single-node clusters.
138+
139+
---
140+
141+
## home-ops Useful Commands (from mod.just)
142+
143+
```bash
144+
# Manual snapshot all PVCs
145+
kubectl get replicationsources --no-headers -A | while read -r ns name _; do
146+
kubectl -n "$ns" patch replicationsources "$name" --type merge \
147+
-p '{"spec":{"trigger":{"manual":"'$(date +%s)'"}}}'
148+
done
149+
150+
# Browse a PVC (requires kubectl-browse-pvc plugin)
151+
kubectl browse-pvc -n <namespace> -i alpine:latest <claim>
152+
153+
# Suspend VolSync (for maintenance)
154+
kubectl -n volsync-system scale deployment volsync --replicas 0
155+
```
156+
157+
---
158+
159+
## Decision Record
160+
161+
**2026-01-24:** After thorough comparison, our label-driven Kyverno approach is preferred for:
162+
- Simpler developer experience (just add a label)
163+
- True conditional restore (pvc-plumber checks S3)
164+
- GitOps tool independence
165+
166+
Future improvements: Consider Kopia migration for speed, add monitoring.

infrastructure/controllers/kyverno/policies/volsync-pvc-backup-restore.yaml

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,12 @@ metadata:
55
name: volsync-pvc-backup-restore
66
annotations:
77
argocd.argoproj.io/sync-wave: "4"
8-
policies.kyverno.io/title: VolSync PVC Backup and Restore
8+
policies.kyverno.io/title: VolSync PVC Backup and Restore (Kopia)
99
policies.kyverno.io/description: >-
1010
Automatically configures VolSync backup and restore for PVCs with the
11-
label backup: "hourly" or backup: "daily". Checks S3 for existing backup
12-
via pvc-plumber and conditionally enables restore via VolumePopulator.
11+
label backup: "hourly" or backup: "daily". Uses Kopia for faster backups
12+
with compression. Checks S3 for existing backup via pvc-plumber and
13+
conditionally enables restore via VolumePopulator.
1314
spec:
1415
mutateExistingOnPolicyUpdate: false
1516
background: true
@@ -58,9 +59,9 @@ spec:
5859
kind: ReplicationDestination
5960
name: "{{request.object.metadata.name}}-restore"
6061

61-
# Rule 2: Generate ExternalSecret for per-PVC restic repository
62+
# Rule 2: Generate ExternalSecret for per-PVC Kopia repository
6263
# IMPORTANT: Only trigger on CREATE to avoid race conditions during PVC deletion
63-
- name: generate-restic-secret
64+
- name: generate-kopia-secret
6465
match:
6566
any:
6667
- resources:
@@ -107,20 +108,25 @@ spec:
107108
app.kubernetes.io/managed-by: kyverno
108109
volsync.backup/pvc: "{{request.object.metadata.name}}"
109110
data:
110-
RESTIC_REPOSITORY: "s3:http://192.168.10.133:30292/volsync-backup/{{request.object.metadata.namespace}}/{{request.object.metadata.name}}"
111+
# Kopia S3 repository configuration
112+
# Per-PVC path within bucket: namespace/pvc-name
113+
KOPIA_S3_BUCKET: "volsync-backup"
114+
KOPIA_S3_ENDPOINT: "192.168.10.133:30292"
115+
KOPIA_S3_PREFIX: "{{request.object.metadata.namespace}}/{{request.object.metadata.name}}"
116+
KOPIA_S3_DISABLE_TLS: "true"
111117
data:
112-
- secretKey: AWS_ACCESS_KEY_ID
118+
- secretKey: KOPIA_S3_ACCESS_KEY_ID
113119
remoteRef:
114120
key: rustfs
115121
property: k8s-admin-access-key
116-
- secretKey: AWS_SECRET_ACCESS_KEY
122+
- secretKey: KOPIA_S3_SECRET_ACCESS_KEY
117123
remoteRef:
118124
key: rustfs
119125
property: k8s-admin-secret-key
120-
- secretKey: RESTIC_PASSWORD
126+
- secretKey: KOPIA_PASSWORD
121127
remoteRef:
122128
key: rustfs
123-
property: restic_password
129+
property: kopia_password
124130

125131
# Rule 3: Generate ReplicationSource (backup schedule)
126132
# IMPORTANT: Only trigger on CREATE to avoid race conditions during PVC deletion
@@ -160,14 +166,18 @@ spec:
160166
trigger:
161167
# Use label value for schedule: hourly = every hour, daily = once per day at 2am
162168
schedule: "{{ request.object.metadata.labels.backup == 'hourly' && '0 * * * *' || '0 2 * * *' }}"
163-
restic:
164-
pruneIntervalDays: 7
169+
kopia:
165170
repository: "volsync-{{request.object.metadata.name}}"
171+
# Kopia-specific optimizations
172+
compression: zstd-fastest
173+
parallelism: 2
174+
# Retention policy
166175
retain:
167176
hourly: 24
168177
daily: 7
169178
weekly: 4
170179
monthly: 2
180+
# Snapshot-based backup via Longhorn
171181
copyMethod: Snapshot
172182
storageClassName: longhorn
173183
volumeSnapshotClassName: longhorn-snapclass
@@ -213,7 +223,7 @@ spec:
213223
spec:
214224
trigger:
215225
manual: restore-once
216-
restic:
226+
kopia:
217227
repository: "volsync-{{request.object.metadata.name}}"
218228
copyMethod: Snapshot
219229
storageClassName: longhorn

monitoring/prometheus-stack/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ resources:
99
- custom-alerts.yaml
1010
- custom-servicemonitors.yaml
1111
- longhorn-backup-alerts.yaml
12+
- volsync-alerts.yaml # VolSync backup/restore alerts
1213
- dcgm-exporter.yaml # DCGM GPU monitoring
1314
- gpu-alerts.yaml # GPU alerting rules
1415
- gpu-dashboard.yaml # Grafana dashboard

0 commit comments

Comments
 (0)