Skip to content

Commit caddb04

Browse files
committed
Create volsync-troubleshooting.md
1 parent aa2a625 commit caddb04

1 file changed

Lines changed: 252 additions & 0 deletions

File tree

docs/volsync-troubleshooting.md

Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# VolSync Backup System Troubleshooting
2+
3+
## Architecture Overview
4+
5+
The backup system is **fully automatic** - no user intervention required:
6+
7+
```
8+
┌─────────────────────────────────────────────────────────────────────────────┐
9+
│ AUTOMATIC BACKUP FLOW │
10+
├─────────────────────────────────────────────────────────────────────────────┤
11+
│ │
12+
│ 1. User creates PVC with label: backup=hourly │
13+
│ ↓ │
14+
│ 2. Kyverno ClusterPolicy detects labeled PVC │
15+
│ ↓ │
16+
│ 3. Kyverno generates THREE resources automatically: │
17+
│ ├── Secret (per-PVC S3 credentials + repo path) │
18+
│ ├── ReplicationSource (hourly backup job) │
19+
│ └── ReplicationDestination (one-time restore on PVC creation) │
20+
│ ↓ │
21+
│ 4. VolSync runs backup every hour (0 * * * *) │
22+
│ ↓ │
23+
│ 5. Data stored in RustFS S3: volsync-backup/<namespace>/<pvc-name> │
24+
│ │
25+
└─────────────────────────────────────────────────────────────────────────────┘
26+
```
27+
28+
## Components
29+
30+
| Component | Purpose | Location |
31+
|-----------|---------|----------|
32+
| Kyverno ClusterPolicy | Auto-generates VolSync resources | `infrastructure/controllers/kyverno/volsync-smart-restore.yaml` |
33+
| VolSync Operator | Runs restic backup/restore jobs | `volsync-system` namespace |
34+
| ClusterExternalSecret | Copies base S3 creds to namespaces | Creates `volsync-rustfs-base` secret |
35+
| RustFS | S3-compatible backup storage | TrueNAS @ 192.168.10.133:30292 |
36+
37+
## Prerequisites
38+
39+
For automatic backups to work, a namespace needs:
40+
41+
1. **Label on namespace**: `volsync.backube/privileged-movers=true`
42+
2. **Base secret present**: `volsync-rustfs-base` (created by ClusterExternalSecret)
43+
3. **PVC label**: `backup=hourly`
44+
45+
## Quick Status Check
46+
47+
```bash
48+
# Check all backup jobs
49+
kubectl get replicationsource -A
50+
51+
# Check for stuck pods
52+
kubectl get pods -A | grep volsync | grep -v Running | grep -v Completed
53+
54+
# Check Longhorn volume health
55+
kubectl get volumes.longhorn.io -n longhorn-system -o custom-columns=NAME:.metadata.name,STATE:.status.state,ROBUSTNESS:.status.robustness | grep -E "(faulted|unknown)"
56+
57+
# Check Kyverno policy status
58+
kubectl get clusterpolicy volsync-smart-protection
59+
```
60+
61+
## Known Issues & Solutions
62+
63+
### Issue 1: Kyverno JMESPath Function Error
64+
65+
**Symptom**: Per-PVC secrets not being generated
66+
67+
**Cause**: Invalid JMESPath function `concat()` - should be `join('', [...])`
68+
69+
**Fix Applied**: Changed in `volsync-smart-restore.yaml`:
70+
```yaml
71+
# WRONG
72+
RESTIC_REPOSITORY: "{{ base64_encode(concat(...)) }}"
73+
74+
# CORRECT
75+
RESTIC_REPOSITORY: "{{ base64_encode(join('', [base64_decode(baseSecret.RESTIC_REPOSITORY_BASE), request.object.metadata.namespace, '/', request.object.metadata.name])) }}"
76+
```
77+
78+
**Verify**:
79+
```bash
80+
kubectl get secret -A | grep volsync-secret
81+
# Should see <pvc-name>-volsync-secret in each namespace
82+
```
83+
84+
---
85+
86+
### Issue 2: Longhorn Volumes Faulted After Mass Pod Restart
87+
88+
**Symptom**: VolSync pods stuck in `ContainerCreating`, error: `volume is not ready for workloads`
89+
90+
**Cause**: When all pods are killed simultaneously, Longhorn engines die unexpectedly and volumes enter faulted/unknown state
91+
92+
**Events showing this**:
93+
```
94+
Warning DetachedUnexpectedly Engine of volume pvc-xxx dead unexpectedly, setting v.Status.Robustness to faulted
95+
```
96+
97+
**Solution Options**:
98+
99+
1. **Wait for auto-recovery** (Longhorn auto-salvage is enabled)
100+
```bash
101+
# Monitor recovery
102+
watch kubectl get volumes.longhorn.io -n longhorn-system -o custom-columns=NAME:.metadata.name,STATE:.status.state,ROBUSTNESS:.status.robustness
103+
```
104+
105+
2. **Force delete faulted volumes** (if they're VolSync cache volumes)
106+
```bash
107+
# Delete faulted Longhorn volumes (cache/restore PVCs only!)
108+
for vol in $(kubectl get volumes.longhorn.io -n longhorn-system -o json | jq -r '.items[] | select(.status.robustness == "faulted") | .metadata.name'); do
109+
echo "Deleting: $vol"
110+
kubectl delete volume.longhorn.io $vol -n longhorn-system
111+
done
112+
```
113+
114+
3. **Trigger re-evaluation** by annotating PVCs
115+
```bash
116+
kubectl annotate pvc <pvc-name> -n <namespace> kyverno.io/trigger="$(date +%s)" --overwrite
117+
```
118+
119+
---
120+
121+
### Issue 3: ReplicationDestination Pods Stuck
122+
123+
**Symptom**: `volsync-dst-*-restore-*` pods stuck in ContainerCreating
124+
125+
**Cause**: Restore jobs try to create cache volumes which may conflict with existing ones or fail to provision
126+
127+
**Solution**: The restore jobs only matter for disaster recovery. Backups work independently.
128+
129+
```bash
130+
# Delete stuck restore jobs (backups continue working)
131+
kubectl delete replicationdestination -A --all
132+
133+
# Clean up orphaned restore PVCs
134+
kubectl delete pvc -A -l volsync.backube/replicationdestination
135+
```
136+
137+
---
138+
139+
### Issue 4: No Base Secret in Namespace
140+
141+
**Symptom**: Kyverno policy precondition fails, no resources generated
142+
143+
**Check**:
144+
```bash
145+
kubectl get secret volsync-rustfs-base -n <namespace>
146+
```
147+
148+
**Cause**: Namespace missing label for ClusterExternalSecret
149+
150+
**Fix**:
151+
```bash
152+
kubectl label namespace <namespace> volsync.backube/privileged-movers=true
153+
```
154+
155+
---
156+
157+
### Issue 5: Backup Never Runs
158+
159+
**Symptom**: ReplicationSource exists but LAST_SYNC is empty
160+
161+
**Check schedule**:
162+
```bash
163+
kubectl get replicationsource <name> -n <namespace> -o yaml | grep -A5 trigger
164+
```
165+
166+
**Manual trigger**:
167+
```bash
168+
kubectl patch replicationsource <name> -n <namespace> --type merge -p '{"spec":{"trigger":{"manual":"'$(date +%s)'"}}}'
169+
```
170+
171+
---
172+
173+
## Longhorn Settings Reference
174+
175+
Current settings that affect VolSync:
176+
177+
| Setting | Value | Impact |
178+
|---------|-------|--------|
179+
| `auto-salvage` | true | Automatically recovers faulted volumes |
180+
| `default-replica-count` | 3 | Requires 3 nodes for full redundancy |
181+
| `replica-soft-anti-affinity` | false | Replicas must be on different nodes |
182+
183+
## RustFS Bucket Structure
184+
185+
```
186+
volsync-backup/
187+
├── home-assistant/
188+
│ └── config/ # Home Assistant config PVC
189+
├── karakeep/
190+
│ ├── data-pvc/ # Karakeep data
191+
│ └── meilisearch-pvc/ # Karakeep search index
192+
├── khoj/
193+
│ └── config/
194+
├── n8n/
195+
│ └── data/
196+
├── open-webui/
197+
│ ├── data/
198+
│ └── storage/
199+
├── paperless-ngx/
200+
│ ├── data/
201+
│ └── media/
202+
└── redis-instance/
203+
└── redis-master-0/
204+
```
205+
206+
## Session Notes: 2026-01-18
207+
208+
### Problems Observed
209+
210+
1. **Kyverno policy had invalid JMESPath**: `concat()` doesn't exist, changed to `join('', [...])`
211+
212+
2. **Mass pod restart caused Longhorn volume faults**: After killing all pods, Longhorn engines died unexpectedly causing volumes to enter faulted/unknown state
213+
214+
3. **VolSync cache volumes stuck**: New PVCs for caches were created but couldn't attach due to Longhorn recovery state
215+
216+
### Current Status
217+
218+
| Namespace | PVC | Backup Status |
219+
|-----------|-----|---------------|
220+
| home-assistant | config | ✅ Working |
221+
| karakeep | data-pvc | ✅ Working |
222+
| karakeep | meilisearch-pvc | ✅ Working |
223+
| khoj | config | ✅ Working |
224+
| n8n | data | ✅ Working |
225+
| open-webui | data | ✅ Working |
226+
| open-webui | storage | ❌ Stuck (Longhorn recovery) |
227+
| paperless-ngx | data | ❌ Stuck (Longhorn recovery) |
228+
| paperless-ngx | media | ❌ Stuck (Longhorn recovery) |
229+
| redis-instance | redis-master-0 | ❌ Stuck (Longhorn recovery) |
230+
| volsync-test | volsync-test-data | ✅ Working |
231+
232+
### Files Modified
233+
234+
- `infrastructure/controllers/kyverno/volsync-smart-restore.yaml` - Fixed JMESPath function
235+
236+
### Next Steps
237+
238+
1. Monitor Longhorn volume recovery: `watch kubectl get volumes.longhorn.io -n longhorn-system | grep -E "(faulted|unknown)"`
239+
2. Once volumes recover, stuck backups should auto-resume at next scheduled time
240+
3. If volumes don't recover, delete faulted volumes and let Kyverno regenerate
241+
242+
### Root Cause Analysis
243+
244+
The VolSync/Kyverno system works correctly. The issues were:
245+
246+
1. **One-time bug**: JMESPath syntax error in Kyverno policy (now fixed)
247+
2. **Transient issue**: Longhorn volume recovery after mass pod restart (will self-heal)
248+
249+
The system IS fully automatic when:
250+
- Longhorn is healthy
251+
- Kyverno is running
252+
- Base secrets are in place

0 commit comments

Comments
 (0)