Skip to content

Commit 60dd03a

Browse files
committed
up
1 parent 6db419f commit 60dd03a

10 files changed

Lines changed: 390 additions & 1331 deletions

docs/backup-restore-architecture.md

Lines changed: 95 additions & 218 deletions
Original file line numberDiff line numberDiff line change
@@ -14,72 +14,82 @@ Add a label to your PVC → Backups happen automatically → Restores happen aut
1414

1515
That's it. No clicking buttons. No running restore commands. No editing configs.
1616

17-
### How It Works (Simple Version)
17+
### How It Works (Smart Restore)
18+
19+
The architecture uses a **"Look Before You Leap"** strategy to decide whether to restore data or start fresh.
1820

1921
When you deploy an app with `backup: "hourly"` on its PVC:
2022

21-
1. **If no backup exists** → App starts with fresh/empty storage, backups begin automatically
22-
2. **If backup exists in S3** → App automatically restores from the latest backup
23+
1. **Kyverno Intercepts**: Before the PVC is created, Kyverno pauses and checks S3.
24+
2. **S3 Check**: Kyverno pings the S3 bucket via an internal Service (`rustfs`).
25+
- **Found (200 OK):** Kyverno says "Ah, this is a restore!" and creates a restore job.
26+
- **Missing (404):** Kyverno says "New App!" and lets it start empty.
27+
3. **VolSync Response**: If a restore job was created, VolSync populates the PVC *before* the app starts.
2328

2429
The system figures out which scenario you're in and does the right thing.
2530

2631
---
2732

28-
## What Problems Does This Solve?
29-
30-
### Problem 1: "My cluster died, how do I restore?"
31-
32-
**Old way:** Manually restore each app's data from backups, one by one.
33-
34-
**This system:** Rebuild your cluster, deploy apps, data restores automatically.
35-
36-
### Problem 2: "I removed an app and want it back with my old data"
37-
38-
**Old way:** Hope you have backups, manually restore them.
39-
40-
**This system:** Re-add the app to ArgoCD, your old data comes back automatically.
41-
42-
### Problem 3: "I don't want to add backup boilerplate to every app"
43-
44-
**Old way:** Copy-paste 100+ lines of backup configuration for each app.
45-
46-
**This system:** Add one label: `backup: "hourly"`. Done.
47-
48-
---
49-
50-
## The User Experience
51-
52-
### For App Developers
33+
## Architecture
5334

54-
Just add this label to your PVC:
35+
### Components
5536

56-
```yaml
57-
apiVersion: v1
58-
kind: PersistentVolumeClaim
59-
metadata:
60-
name: my-app-data
61-
labels:
62-
backup: "hourly" # <-- This is all you need
63-
spec:
64-
accessModes: ["ReadWriteOnce"]
65-
resources:
66-
requests:
67-
storage: 10Gi
37+
```mermaid
38+
graph TD
39+
subgraph "External Storage"
40+
S3[("S3 / TrueNAS<br/>(192.168.10.133)")]
41+
end
42+
43+
subgraph "Kubernetes Cluster"
44+
subgraph "Volsync System"
45+
B[("Service: rustfs")] --> S3
46+
V[("VolSync Operator")]
47+
end
48+
49+
subgraph "Policy Engine"
50+
K[("Kyverno")]
51+
P_Smart[("Policy: Smart Restore")]
52+
end
53+
54+
subgraph "Application Namespace"
55+
PVC[("PVC: data-claim")]
56+
RD[("ReplicationDestination<br/>(Restore Job)")]
57+
RS[("ReplicationSource<br/>(Backup Job)")]
58+
end
59+
end
60+
61+
%% Flows
62+
K -- "1. apiCall (HEAD)" --> B
63+
B -.-> S3
64+
65+
%% Decision Logic
66+
K -- "2a. If 200 OK (Found)" --> RD
67+
RD -- "3. Pull Data" --> S3
68+
RD -- "4. Populate" --> PVC
69+
70+
K -- "2b. If 404 (Missing)" --> PVC
71+
PVC -- "5. Start Fresh" --> RS
72+
RS -- "6. Push Backups" --> S3
73+
74+
classDef external fill:#f9f,stroke:#333,stroke-width:2px;
75+
classDef policy fill:#ffd,stroke:#333,stroke-width:2px;
76+
classDef storage fill:#dfd,stroke:#333,stroke-width:2px;
77+
classDef app fill:#eef,stroke:#333,stroke-width:2px;
78+
79+
class S3 external;
80+
class K,P_Smart policy;
81+
class B,V,RD,RS storage;
82+
class PVC app;
6883
```
6984

70-
Everything else is automatic:
71-
- Backups run hourly to S3
72-
- If you delete and re-add the app, data restores automatically
73-
- If you rebuild the cluster, data restores automatically
74-
- If it's a fresh install, the app starts with empty storage (as expected)
85+
### Component Roles
7586

76-
### What You Should NEVER Have To Do
77-
78-
- Click "restore" in any UI
79-
- Run restore commands manually
80-
- Edit configuration to switch between "fresh" and "restore" modes
81-
- Remember which apps have backups
82-
- Manually trigger backup jobs
87+
| Component | Resource | Purpose |
88+
| :--- | :--- | :--- |
89+
| **Kyverno** | `ClusterPolicy/volsync-smart-restore` | **The Brain.** Performs an `apiCall` to S3 to check if a backup exists. If yes, generates the `ReplicationDestination`. |
90+
| **Kyverno** | `ClusterPolicy/generate-volsync-backup` | **The Safety.** Automatically generates the `ReplicationSource` so future backups happen. |
91+
| **VolSync** | `Service/rustfs` | **The Bridge.** Maps the external TrueNAS IP (192.168.10.133) to an internal DNS name so Kyverno can reach it. |
92+
| **VolSync** | `ReplicationDestination` | **The Restore.** Instructs VolSync to populate a PVC from S3. Validates data integrity before binding. |
8393

8494
---
8595

@@ -90,188 +100,55 @@ Everything else is automatic:
90100
You're setting up a brand new cluster with no previous data.
91101

92102
**What happens:**
93-
1. You deploy your apps via ArgoCD
94-
2. Each app with `backup: "hourly"` starts with empty storage
95-
3. Backups begin automatically in the background
96-
4. S3 now has your data for future restores
103+
1. You deploy your apps via ArgoCD.
104+
2. Kyverno checks S3 for `s3://backups/ns/app`. Result: **404 Not Found**.
105+
3. Kyverno allows the PVC to be created *without* a restore source.
106+
4. App starts with empty storage.
107+
5. Kyverno generates a `ReplicationSource`, starting hourly backups.
97108

98-
**Result:** Apps work normally, backups are set up for the future.
109+
**Result:** Apps work normally, clean slate.
99110

100111
### Scenario 2: Cluster Rebuild (Disaster Recovery)
101112

102113
Your cluster died. You rebuild it from scratch.
103114

104115
**What happens:**
105-
1. You bootstrap ArgoCD and infrastructure
106-
2. The system discovers your existing backups in S3
107-
3. When apps deploy, they automatically restore from S3
108-
4. Your data is back without any manual steps
109-
110-
**Result:** Full recovery with zero manual intervention.
111-
112-
### Scenario 3: Add a New App
113-
114-
You add a new app to an existing cluster.
115-
116-
**What happens:**
117-
- Same as Scenario 1 - no backup exists for this app yet
118-
- Starts fresh, backups begin automatically
119-
120-
### Scenario 4: Remove and Re-add an App
121-
122-
You remove an app from ArgoCD (maybe to test, maybe by accident). A week later, you want it back.
123-
124-
**What happens:**
125-
1. When you removed the app, the S3 backup remained (backups are external)
126-
2. When you re-add the app, the system finds the old backup
127-
3. Your data is automatically restored
116+
1. You bootstrap ArgoCD.
117+
2. Kyverno checks S3. Result: **200 OK** (Backup found!).
118+
3. Kyverno generates a `ReplicationDestination` pointing to that backup.
119+
4. Kyverno mutates the PVC to add `dataSourceRef: ReplicationDestination`.
120+
5. The PVC stays "Pending" while VolSync pulls data from S3.
121+
6. Once restored, the PVC binds, and the Pod starts.
128122

129-
**Result:** Your bookmarks, settings, data - all back automatically.
130-
131-
---
132-
133-
## Architecture
134-
135-
### Components
136-
137-
```
138-
┌─────────────────┐
139-
│ │
140-
│ S3 (External) │ Backups live here
141-
│ │ Survives cluster death
142-
│ │
143-
└────────┬────────┘
144-
145-
┌─────────────────────┼─────────────────────┐
146-
│ │ │
147-
│ KUBERNETES CLUSTER │
148-
│ │ │
149-
│ ┌────────────────┼────────────────┐ │
150-
│ │ │ │ │
151-
│ │ ┌───────────▼───────────┐ │ │
152-
│ │ │ │ │ │
153-
│ │ │ VOLSYNC │ │ │
154-
│ │ │ │ │ │
155-
│ │ │ ReplicationSource │ │ │
156-
│ │ │ (backs up to S3) │ │ │
157-
│ │ │ │ │ │
158-
│ │ │ ReplicationDestination │ │
159-
│ │ │ (restores from S3) │ │ │
160-
│ │ │ │ │ │
161-
│ │ └───────────┬───────────┘ │ │
162-
│ │ │ │ │
163-
│ │ ┌───────────▼───────────┐ │ │
164-
│ │ │ │ │ │
165-
│ │ │ KYVERNO │ │ │
166-
│ │ │ (Policy Engine) │ │ │
167-
│ │ │ │ │ │
168-
│ │ │ - Auto-creates │ │ │
169-
│ │ │ backup resources │ │ │
170-
│ │ │ - Auto-configures │ │ │
171-
│ │ │ restore │ │ │
172-
│ │ │ │ │ │
173-
│ │ └───────────┬───────────┘ │ │
174-
│ │ │ │ │
175-
│ │ ┌───────────▼───────────┐ │ │
176-
│ │ │ │ │ │
177-
│ │ │ LONGHORN │ │ │
178-
│ │ │ (Storage Driver) │ │ │
179-
│ │ │ │ │ │
180-
│ │ │ - Provisions PVCs │ │ │
181-
│ │ │ - Restores from │ │ │
182-
│ │ │ snapshots │ │ │
183-
│ │ │ │ │ │
184-
│ │ └───────────────────────┘ │ │
185-
│ │ │ │
186-
│ └─────────────────────────────────┘ │
187-
│ │
188-
└───────────────────────────────────────────┘
189-
```
190-
191-
### How the Pieces Fit Together
192-
193-
| Component | Role |
194-
|-----------|------|
195-
| **S3 (RustFS)** | External storage for backups. Survives cluster rebuilds. |
196-
| **VolSync** | Kubernetes operator that handles backup/restore via restic |
197-
| **Kyverno** | Policy engine that auto-generates backup resources when it sees the `backup` label |
198-
| **Longhorn** | Storage driver that provisions volumes and supports restoring from snapshots |
199-
| **ArgoCD** | GitOps controller that deploys apps (not backup-specific, but orchestrates everything) |
200-
201-
---
202-
203-
## Why VolSync Instead of Longhorn Backup?
204-
205-
Longhorn has built-in backup to S3, but it requires clicking "Restore" in the Longhorn UI. That violates our "zero manual intervention" principle.
206-
207-
VolSync can be fully automated through Kubernetes resources - no UI clicks needed.
208-
209-
---
210-
211-
## Storage Requirements
212-
213-
The storage system must support:
214-
215-
1. **Pod migration** - Pods can move between nodes without losing data
216-
2. **CSI Volume Populator** - Ability to restore from external snapshots
217-
3. **No UI dependency** - All operations via Kubernetes resources
218-
219-
Currently using **Longhorn**. Alternatives like OpenEBS or Rook-Ceph could work if they meet these requirements.
123+
**Result:** Fully automated recovery. Zero interaction.
220124

221125
---
222126

223127
## Technical Details
224128

225-
For the technical deep-dive including:
226-
- The timing challenges with Kubernetes admission webhooks
227-
- Detailed sequence diagrams for each scenario
228-
- Problems encountered and solutions
229-
- Implementation steps
230-
231-
See: [volsync-implementation-plan.md](./volsync-implementation-plan.md)
232-
233-
---
234-
235-
## FAQ
236-
237-
### Q: What if I don't want an app to be backed up?
238-
239-
Don't add the `backup` label to its PVC. Simple.
240-
241-
### Q: How often do backups run?
242-
243-
With `backup: "hourly"`, backups run every hour. You can also use `backup: "daily"`.
244-
245-
### Q: How much storage do backups use?
246-
247-
Backups use restic which does deduplication. Only changed blocks are stored after the first backup.
248-
249-
### Q: Can I restore to a specific point in time?
250-
251-
Not automatically. The system always restores from the latest backup. For point-in-time recovery, you'd need manual intervention.
252-
253-
### Q: What happens if S3 is unavailable during restore?
254-
255-
The PVC will be stuck pending until S3 is available. The system doesn't fall back to fresh storage automatically (that could cause data loss).
256-
257-
### Q: Is database data safe to backup this way?
129+
### The "Bridge" (Service)
130+
Since policies run inside the cluster, they need to reach external storage reliably. We create a Service `rustfs` that points to `192.168.10.133`.
131+
- **Why?** It allows using `http://rustfs/...` in policies instead of hardcoded IPs.
132+
- **Benefit:** If the Storage IP changes, you update one Service file, not 50 policies.
258133

259-
For simple databases, yes. For production databases like PostgreSQL, you should use database-native backup tools (like pgBackRest) that ensure consistency. This system is best for:
260-
- Application config files
261-
- Media libraries
262-
- Simple SQLite databases
263-
- User uploads
134+
### The "Smart" Policy (apiCall)
135+
Kyverno 1.10+ supports `apiCall` context variables. We use this to perform an HTTP HEAD request.
136+
- **Method:** `HEAD` (lightweight, just checks existence)
137+
- **Target:** `http://rustfs.volsync-system.svc:9000/volsync-backups/<namespace>/<pvc-name>`
138+
- **Logic:**
139+
- `if response.code == 200` -> **RESTORE**
140+
- `if response.code == 404` -> **FRESH**
264141

265142
---
266143

267144
## Current Implementation Status
268145

269-
| Component | Status |
270-
|-----------|--------|
271-
| VolSync Operator | Deployed |
272-
| Longhorn Storage | Deployed |
273-
| Kyverno Policies | Partially implemented (needs work) |
274-
| Pre-warm CronJob | Not yet implemented |
275-
| S3 Bucket | Configured |
146+
| Component | Status | Notes |
147+
|-----------|--------|-------|
148+
| VolSync Operator | Deployed | |
149+
| Longhorn Storage | Deployed | |
150+
| **Service Bridge** | ✅ Implemented | `rustfs-service.yaml` maps to 192.168.10.133 |
151+
| **Smart Policy** | ✅ Implemented | `volsync-smart-restore.yaml` uses apiCall |
152+
| Pre-warm CronJob | ❌ Removed | Replaced by Smart Policy |
276153

277-
See the [implementation plan](./volsync-implementation-plan.md) for next steps.
154+
See the [implementation plan](./volsync-implementation-plan.md) for deeper technical specs.

0 commit comments

Comments
 (0)