Skip to content

Commit 521c5a1

Browse files
committed
docs: finalize storage architecture and cleanup plan
1 parent f873ce5 commit 521c5a1

2 files changed

Lines changed: 74 additions & 151 deletions

File tree

docs/storage-architecture.md

Lines changed: 74 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,45 @@
11
# Storage Architecture & Disaster Recovery
22

3-
This document outlines the storage architecture for the cluster, focusing on data persistence, backup strategies, and disaster recovery workflows.
3+
This document outlines the storage architecture for the cluster, focusing on the **Zero-Touch Backup & Restore** system powered by VolSync and Kyverno.
44

55
## Overview
66

7-
The cluster uses a layered storage approach with **zero-touch backup and restore**:
8-
- **Longhorn**: Distributed block storage for runtime replication (2 replicas per volume)
9-
- **Snapshot Controller**: Manages VolumeSnapshot lifecycles and CRDs
10-
- **VolSync + Kyverno**: Fully automated backup/restore - just label your PVC!
11-
- **Database-native backups**: CloudNativePG and Crunchy Postgres backup directly to S3
7+
The cluster uses a "Smart Restore" strategy. When a PersistentVolumeClaim (PVC) is created, the system checks if a backup *already exists* in the offsite storage (TrueNAS S3) before deciding what to do.
128

13-
## Zero-Touch Architecture (Smart Restore)
9+
- **Backup Found?** -> Automatically trigger a Restore.
10+
- **Backup Missing?** -> Automatically configure a fresh Backup schedule.
11+
12+
## Zero-Touch Architecture (Direct IP Strategy)
13+
14+
To ensure maximum stability and avoid GitOps sync issues, the system connects directly to the offsite storage IP.
1415

1516
**User only needs to:**
16-
1. Add `backup: "hourly"` or `backup: "daily"` label to PVC
17-
2. Ensure namespace has `volsync.backube/privileged-movers: "true"` **label** (for base credentials)
18-
3. Ensure namespace has `volsync.backube/privileged-movers: "true"` **annotation** (for VolSync movers)
17+
1. Add `backup: "hourly"` (or `daily`) label to any PVC.
18+
2. That's it. (Credentials are auto-injected).
1919

2020
**System automatically provides:**
21-
- **Kyverno** makes a smart decision:
22-
- If backup exists? -> Creates Restore Job.
23-
- If new app? -> Creates Backup Job.
21+
1. **Smart Mutation:** Kyverno checks `http://192.168.10.133:9000/.../config`.
22+
2. **Conditional Logic:**
23+
* **200 OK:** Adds `dataSourceRef` to the PVC (forcing it to wait for restore).
24+
* **404 Not Found:** Lets PVC start empty.
25+
3. **Resource Generation:**
26+
* **Backup Job:** Always created (`ReplicationSource`).
27+
* **Restore Job:** Created ONLY if data exists (`ReplicationDestination`).
2428

2529
```
2630
┌─────────────────────────────────────────────────────────────────────────────────┐
27-
ZERO-TOUCH VOLSYNC ARCHITECTURE │
31+
SMART RESTORE ARCHITECTURE
2832
├─────────────────────────────────────────────────────────────────────────────────┤
2933
│ │
3034
│ USER PROVIDES: SYSTEM AUTO-GENERATES: │
3135
│ ┌─────────────────────┐ ┌─────────────────────────────────────┐ │
32-
│ │ PVC │ │ volsync-rustfs-base (per namespace) │ │
33-
│ │ labels: │ │ (ClusterExternalSecret → Secret) │ │
36+
│ │ PVC │ │ volsync-smart-protection │ │
37+
│ │ labels: │ │ (ClusterPolicy) │ │
3438
│ │ backup: hourly │ │ │ │
35-
│ └─────────────────────┘ │ {pvc}-volsync-secret (per PVC) │ │
36-
│ │ (Kyverno apiCall + base64 encode) │ │
37-
│ │ │ │
38-
│ │ ReplicationSource │ │
39-
│ │ (hourly/daily backups to S3) │ │
40-
│ │ │ │
41-
│ │ ReplicationDestination │ │
42-
│ │ (Creates ONLY if backup exists) │ │
39+
│ └─────────────────────┘ │ 1. Checks 192.168.10.133 (GET) │ │
40+
│ │ 2. Mutates PVC (if found) │ │
41+
│ │ 3. Creates Backup Job (always) │ │
42+
│ │ 4. Creates Restore Job (if found) │ │
4343
│ └─────────────────────────────────────┘ │
4444
│ │
4545
└─────────────────────────────────────────────────────────────────────────────────┘
@@ -54,121 +54,89 @@ graph TD
5454
end
5555
5656
subgraph "Kubernetes Cluster"
57-
subgraph "Volsync System"
58-
B[("Service: rustfs")] --> S3
59-
V[("VolSync Operator")]
60-
end
61-
6257
subgraph "Policy Engine"
63-
K[("Kyverno")]
64-
P_Smart[("Policy: Smart Restore")]
58+
K[("Kyverno Policy<br/>(Smart Restore)")]
6559
end
6660
6761
subgraph "Application Namespace"
6862
PVC[("PVC: data-claim")]
69-
RD[("ReplicationDestination<br/>(Restore Job)")]
70-
RS[("ReplicationSource<br/>(Backup Job)")]
63+
RD[("Restore Job<br/>(ReplicationDestination)")]
64+
RS[("Backup Job<br/>(ReplicationSource)")]
7165
end
7266
end
7367
7468
%% Flows
75-
K -- "1. apiCall (HEAD)" --> B
76-
B -.-> S3
69+
K -- "1. apiCall (GET)" --> S3
7770
7871
%% Decision Logic
7972
K -- "2a. If 200 OK (Found)" --> RD
8073
RD -- "3. Pull Data" --> S3
81-
RD -- "4. Populate" --> PVC
74+
RD -- "4. Link to PVC" --> PVC
8275
8376
K -- "2b. If 404 (Missing)" --> PVC
8477
PVC -- "5. Start Fresh" --> RS
8578
RS -- "6. Push Backups" --> S3
8679
```
8780

88-
## Kyverno Policies
89-
90-
### Smart Restore Policy (`volsync-smart-restore`)
91-
**The "Look Before You Leap" logic.**
92-
- Trigger: PVC creation.
93-
- Check: `apiCall` (HTTP HEAD) to S3 bucket.
94-
- Action:
95-
- If HTTP 200: Generate `ReplicationDestination`.
96-
- If HTTP 404: Do nothing (Allow fresh install).
97-
98-
### Generate Policy (`generate-volsync-backup`)
99-
**The "Backup Insurance".**
100-
- Trigger: PVC creation (always).
101-
- Action: Generate `ReplicationSource`.
102-
- Result: Ensures all fresh apps eventually get backed up.
103-
104-
### Mutate Policy (`volsync-auto-restore`)
105-
**The "Connection".**
106-
- Trigger: PVC creation.
107-
- Check: Does `ReplicationDestination` exist?
108-
- Action: If yes, add `dataSourceRef` to PVC.
109-
- Result: PVC waits for VolSync to restore data before binding.
81+
## Kyverno Policy Details
82+
83+
All logic is consolidated into a single policy: **`infrastructure/controllers/kyverno/volsync-smart-restore.yaml`**.
84+
85+
### Rules
86+
1. **`link-restore-if-exists` (Mutate)**
87+
* **Check:** `GET http://192.168.10.133:9000/.../config`
88+
* **Action:** If 200, adds `dataSourceRef: ReplicationDestination` to the PVC.
89+
* **Why:** This pauses the PVC binding until VolSync populates the volume.
90+
91+
2. **`generate-restore-job` (Generate)**
92+
* **Check:** Same as above.
93+
* **Action:** Creates the `ReplicationDestination` CRD that actually performs the download.
94+
95+
3. **`generate-backup-job` (Generate)**
96+
* **Check:** None (Always runs).
97+
* **Action:** Creates `ReplicationSource` to ensure the new data is backed up going forward.
11098

11199
## 4. Disaster Recovery Scenarios
112100

113101
### Scenario 1: New App (First Deployment)
114102
```
115-
1. User creates PVC with backup: hourly label.
116-
2. Kyverno checks S3: 404 Not Found.
117-
3. Kyverno allows PVC creation WITHOUT dataSourceRef.
118-
4. App starts with empty volume.
119-
5. Kyverno generates ReplicationSource.
120-
6. First backup runs.
103+
1. User creates PVC.
104+
2. Kyverno checks S3 -> 404.
105+
3. PVC created empty (ready immediately).
106+
4. Kyverno creates Backup Job.
107+
5. First backup runs in 1 hour.
121108
```
122109

123-
### Scenario 2: Cluster Rebuild (Disaster Recovery)
110+
### Scenario 2: Cluster Rebuild (Total DR)
124111
```
125-
1. Bootstrap new Cluster.
126-
2. ArgoCD syncs app.
127-
3. Kyverno checks S3: 200 OK (Found backup!).
128-
4. Kyverno generates ReplicationDestination.
129-
5. Kyverno mutates PVC to add dataSourceRef.
130-
6. PVC waits (Pending) while VolSync pulls data.
131-
7. Volume restores -> Pod starts.
112+
1. Cluster is rebuilt. ArgoCD installs app.
113+
2. Kyverno checks S3 -> 200.
114+
3. Kyverno creates Restore Job.
115+
4. Kyverno Links PVC to Restore Job.
116+
5. Pod waits in Pending...
117+
6. VolSync downloads data...
118+
7. PVC binds. Pod starts.
132119
```
133120

134-
### Scenario 3: Manual Restore (Specific Point in Time)
135-
```bash
136-
# 1. Scale down the application
137-
kubectl scale deployment <app> -n <namespace> --replicas=0
138-
139-
# 2. Delete the PVC
140-
kubectl delete pvc <pvc-name> -n <namespace>
141-
142-
# 3. (Optional) Update ReplicationDestination to point to older snapshot if needed
143-
# ArgoCD will typically re-sync the latest.
144-
145-
# 4. ArgoCD recreates PVC
146-
# Kyverno sees backup exists -> generates RD -> restores.
147-
```
148-
149-
## 5. Defense Layers Summary
150-
151-
| Layer | Protects Against | Recovery Time | Manual Intervention |
152-
|-------|------------------|---------------|---------------------|
153-
| Longhorn replicas | Node failure | Instant | None |
154-
| VolSync + Kyverno | Cluster loss | ~5-15 minutes | None (Zero Touch) |
155-
156-
## 8. Configuration Files
121+
## 5. Configuration Files
157122

158123
| Component | Location |
159124
|-----------|----------|
160-
| VolSync operator | `infrastructure/storage/volsync/` |
161-
| **RustFS Service** | `infrastructure/storage/volsync/rustfs-service.yaml` |
162-
| **Smart Restore Policy** | `infrastructure/controllers/kyverno/volsync-smart-restore.yaml` |
163-
| Generate Policy | `infrastructure/controllers/kyverno/volsync-clusterpolicy.yaml` |
164-
| Mutate Policy | `infrastructure/controllers/kyverno/volsync-auto-restore.yaml` |
125+
| **Smart Policy (The Brain)** | `infrastructure/controllers/kyverno/volsync-smart-restore.yaml` |
126+
| **VolSync Kustomization** | `infrastructure/storage/volsync/kustomization.yaml` |
127+
| **Credentials** | `infrastructure/storage/volsync/rustfs-credentials.yaml` |
128+
129+
## 6. Troubleshooting
165130

166-
## 9. Troubleshooting
131+
### "OutOfSync" in ArgoCD?
132+
This is usually cosmetic due to Kyverno status updates.
133+
We have applied `ignoreDifferences` in the AppSet to silence `ClusterPolicy` status fields.
134+
If it persists, verify your local Git matches the API Server.
167135

168-
### Kyverno apiCall Failed
169-
If Kyverno cannot reach S3, check the Service Bridge:
136+
### Manual Restore
137+
To force a restore to a specific point in time:
170138
```bash
171-
kubectl get svc rustfs -n volsync-system
172-
kubectl describe endpoints rustfs -n volsync-system
173-
# Should point to 192.168.10.133:9000
139+
kubectl delete pvc <pvc-name>
140+
# Update the ReplicationDestination spec if needed to point to a specific snapshot ID
141+
# Re-create PVC (ArgoCD will do this).
174142
```

docs/volsync-implementation-plan.md

Lines changed: 0 additions & 45 deletions
This file was deleted.

0 commit comments

Comments
 (0)