|
| 1 | +# VolSync Backup System Troubleshooting |
| 2 | + |
| 3 | +## Architecture Overview |
| 4 | + |
| 5 | +The backup system is **fully automatic** - no user intervention required: |
| 6 | + |
| 7 | +``` |
| 8 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 9 | +│ AUTOMATIC BACKUP FLOW │ |
| 10 | +├─────────────────────────────────────────────────────────────────────────────┤ |
| 11 | +│ │ |
| 12 | +│ 1. User creates PVC with label: backup=hourly │ |
| 13 | +│ ↓ │ |
| 14 | +│ 2. Kyverno ClusterPolicy detects labeled PVC │ |
| 15 | +│ ↓ │ |
| 16 | +│ 3. Kyverno generates THREE resources automatically: │ |
| 17 | +│ ├── Secret (per-PVC S3 credentials + repo path) │ |
| 18 | +│ ├── ReplicationSource (hourly backup job) │ |
| 19 | +│ └── ReplicationDestination (one-time restore on PVC creation) │ |
| 20 | +│ ↓ │ |
| 21 | +│ 4. VolSync runs backup every hour (0 * * * *) │ |
| 22 | +│ ↓ │ |
| 23 | +│ 5. Data stored in RustFS S3: volsync-backup/<namespace>/<pvc-name> │ |
| 24 | +│ │ |
| 25 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 26 | +``` |
| 27 | + |
| 28 | +## Components |
| 29 | + |
| 30 | +| Component | Purpose | Location | |
| 31 | +|-----------|---------|----------| |
| 32 | +| Kyverno ClusterPolicy | Auto-generates VolSync resources | `infrastructure/controllers/kyverno/volsync-smart-restore.yaml` | |
| 33 | +| VolSync Operator | Runs restic backup/restore jobs | `volsync-system` namespace | |
| 34 | +| ClusterExternalSecret | Copies base S3 creds to namespaces | Creates `volsync-rustfs-base` secret | |
| 35 | +| RustFS | S3-compatible backup storage | TrueNAS @ 192.168.10.133:30292 | |
| 36 | + |
| 37 | +## Prerequisites |
| 38 | + |
| 39 | +For automatic backups to work, a namespace needs: |
| 40 | + |
| 41 | +1. **Label on namespace**: `volsync.backube/privileged-movers=true` |
| 42 | +2. **Base secret present**: `volsync-rustfs-base` (created by ClusterExternalSecret) |
| 43 | +3. **PVC label**: `backup=hourly` |
| 44 | + |
| 45 | +## Quick Status Check |
| 46 | + |
| 47 | +```bash |
| 48 | +# Check all backup jobs |
| 49 | +kubectl get replicationsource -A |
| 50 | + |
| 51 | +# Check for stuck pods |
| 52 | +kubectl get pods -A | grep volsync | grep -v Running | grep -v Completed |
| 53 | + |
| 54 | +# Check Longhorn volume health |
| 55 | +kubectl get volumes.longhorn.io -n longhorn-system -o custom-columns=NAME:.metadata.name,STATE:.status.state,ROBUSTNESS:.status.robustness | grep -E "(faulted|unknown)" |
| 56 | + |
| 57 | +# Check Kyverno policy status |
| 58 | +kubectl get clusterpolicy volsync-smart-protection |
| 59 | +``` |
| 60 | + |
| 61 | +## Known Issues & Solutions |
| 62 | + |
| 63 | +### Issue 1: Kyverno JMESPath Function Error |
| 64 | + |
| 65 | +**Symptom**: Per-PVC secrets not being generated |
| 66 | + |
| 67 | +**Cause**: Invalid JMESPath function `concat()` - should be `join('', [...])` |
| 68 | + |
| 69 | +**Fix Applied**: Changed in `volsync-smart-restore.yaml`: |
| 70 | +```yaml |
| 71 | +# WRONG |
| 72 | +RESTIC_REPOSITORY: "{{ base64_encode(concat(...)) }}" |
| 73 | + |
| 74 | +# CORRECT |
| 75 | +RESTIC_REPOSITORY: "{{ base64_encode(join('', [base64_decode(baseSecret.RESTIC_REPOSITORY_BASE), request.object.metadata.namespace, '/', request.object.metadata.name])) }}" |
| 76 | +``` |
| 77 | +
|
| 78 | +**Verify**: |
| 79 | +```bash |
| 80 | +kubectl get secret -A | grep volsync-secret |
| 81 | +# Should see <pvc-name>-volsync-secret in each namespace |
| 82 | +``` |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +### Issue 2: Longhorn Volumes Faulted After Mass Pod Restart |
| 87 | + |
| 88 | +**Symptom**: VolSync pods stuck in `ContainerCreating`, error: `volume is not ready for workloads` |
| 89 | + |
| 90 | +**Cause**: When all pods are killed simultaneously, Longhorn engines die unexpectedly and volumes enter faulted/unknown state |
| 91 | + |
| 92 | +**Events showing this**: |
| 93 | +``` |
| 94 | +Warning DetachedUnexpectedly Engine of volume pvc-xxx dead unexpectedly, setting v.Status.Robustness to faulted |
| 95 | +``` |
| 96 | + |
| 97 | +**Solution Options**: |
| 98 | + |
| 99 | +1. **Wait for auto-recovery** (Longhorn auto-salvage is enabled) |
| 100 | + ```bash |
| 101 | + # Monitor recovery |
| 102 | + watch kubectl get volumes.longhorn.io -n longhorn-system -o custom-columns=NAME:.metadata.name,STATE:.status.state,ROBUSTNESS:.status.robustness |
| 103 | + ``` |
| 104 | + |
| 105 | +2. **Force delete faulted volumes** (if they're VolSync cache volumes) |
| 106 | + ```bash |
| 107 | + # Delete faulted Longhorn volumes (cache/restore PVCs only!) |
| 108 | + for vol in $(kubectl get volumes.longhorn.io -n longhorn-system -o json | jq -r '.items[] | select(.status.robustness == "faulted") | .metadata.name'); do |
| 109 | + echo "Deleting: $vol" |
| 110 | + kubectl delete volume.longhorn.io $vol -n longhorn-system |
| 111 | + done |
| 112 | + ``` |
| 113 | + |
| 114 | +3. **Trigger re-evaluation** by annotating PVCs |
| 115 | + ```bash |
| 116 | + kubectl annotate pvc <pvc-name> -n <namespace> kyverno.io/trigger="$(date +%s)" --overwrite |
| 117 | + ``` |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +### Issue 3: ReplicationDestination Pods Stuck |
| 122 | + |
| 123 | +**Symptom**: `volsync-dst-*-restore-*` pods stuck in ContainerCreating |
| 124 | + |
| 125 | +**Cause**: Restore jobs try to create cache volumes which may conflict with existing ones or fail to provision |
| 126 | + |
| 127 | +**Solution**: The restore jobs only matter for disaster recovery. Backups work independently. |
| 128 | + |
| 129 | +```bash |
| 130 | +# Delete stuck restore jobs (backups continue working) |
| 131 | +kubectl delete replicationdestination -A --all |
| 132 | + |
| 133 | +# Clean up orphaned restore PVCs |
| 134 | +kubectl delete pvc -A -l volsync.backube/replicationdestination |
| 135 | +``` |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +### Issue 4: No Base Secret in Namespace |
| 140 | + |
| 141 | +**Symptom**: Kyverno policy precondition fails, no resources generated |
| 142 | + |
| 143 | +**Check**: |
| 144 | +```bash |
| 145 | +kubectl get secret volsync-rustfs-base -n <namespace> |
| 146 | +``` |
| 147 | + |
| 148 | +**Cause**: Namespace missing label for ClusterExternalSecret |
| 149 | + |
| 150 | +**Fix**: |
| 151 | +```bash |
| 152 | +kubectl label namespace <namespace> volsync.backube/privileged-movers=true |
| 153 | +``` |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +### Issue 5: Backup Never Runs |
| 158 | + |
| 159 | +**Symptom**: ReplicationSource exists but LAST_SYNC is empty |
| 160 | + |
| 161 | +**Check schedule**: |
| 162 | +```bash |
| 163 | +kubectl get replicationsource <name> -n <namespace> -o yaml | grep -A5 trigger |
| 164 | +``` |
| 165 | + |
| 166 | +**Manual trigger**: |
| 167 | +```bash |
| 168 | +kubectl patch replicationsource <name> -n <namespace> --type merge -p '{"spec":{"trigger":{"manual":"'$(date +%s)'"}}}' |
| 169 | +``` |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## Longhorn Settings Reference |
| 174 | + |
| 175 | +Current settings that affect VolSync: |
| 176 | + |
| 177 | +| Setting | Value | Impact | |
| 178 | +|---------|-------|--------| |
| 179 | +| `auto-salvage` | true | Automatically recovers faulted volumes | |
| 180 | +| `default-replica-count` | 3 | Requires 3 nodes for full redundancy | |
| 181 | +| `replica-soft-anti-affinity` | false | Replicas must be on different nodes | |
| 182 | + |
| 183 | +## RustFS Bucket Structure |
| 184 | + |
| 185 | +``` |
| 186 | +volsync-backup/ |
| 187 | +├── home-assistant/ |
| 188 | +│ └── config/ # Home Assistant config PVC |
| 189 | +├── karakeep/ |
| 190 | +│ ├── data-pvc/ # Karakeep data |
| 191 | +│ └── meilisearch-pvc/ # Karakeep search index |
| 192 | +├── khoj/ |
| 193 | +│ └── config/ |
| 194 | +├── n8n/ |
| 195 | +│ └── data/ |
| 196 | +├── open-webui/ |
| 197 | +│ ├── data/ |
| 198 | +│ └── storage/ |
| 199 | +├── paperless-ngx/ |
| 200 | +│ ├── data/ |
| 201 | +│ └── media/ |
| 202 | +└── redis-instance/ |
| 203 | + └── redis-master-0/ |
| 204 | +``` |
| 205 | + |
| 206 | +## Session Notes: 2026-01-18 |
| 207 | + |
| 208 | +### Problems Observed |
| 209 | + |
| 210 | +1. **Kyverno policy had invalid JMESPath**: `concat()` doesn't exist, changed to `join('', [...])` |
| 211 | + |
| 212 | +2. **Mass pod restart caused Longhorn volume faults**: After killing all pods, Longhorn engines died unexpectedly causing volumes to enter faulted/unknown state |
| 213 | + |
| 214 | +3. **VolSync cache volumes stuck**: New PVCs for caches were created but couldn't attach due to Longhorn recovery state |
| 215 | + |
| 216 | +### Current Status |
| 217 | + |
| 218 | +| Namespace | PVC | Backup Status | |
| 219 | +|-----------|-----|---------------| |
| 220 | +| home-assistant | config | ✅ Working | |
| 221 | +| karakeep | data-pvc | ✅ Working | |
| 222 | +| karakeep | meilisearch-pvc | ✅ Working | |
| 223 | +| khoj | config | ✅ Working | |
| 224 | +| n8n | data | ✅ Working | |
| 225 | +| open-webui | data | ✅ Working | |
| 226 | +| open-webui | storage | ❌ Stuck (Longhorn recovery) | |
| 227 | +| paperless-ngx | data | ❌ Stuck (Longhorn recovery) | |
| 228 | +| paperless-ngx | media | ❌ Stuck (Longhorn recovery) | |
| 229 | +| redis-instance | redis-master-0 | ❌ Stuck (Longhorn recovery) | |
| 230 | +| volsync-test | volsync-test-data | ✅ Working | |
| 231 | + |
| 232 | +### Files Modified |
| 233 | + |
| 234 | +- `infrastructure/controllers/kyverno/volsync-smart-restore.yaml` - Fixed JMESPath function |
| 235 | + |
| 236 | +### Next Steps |
| 237 | + |
| 238 | +1. Monitor Longhorn volume recovery: `watch kubectl get volumes.longhorn.io -n longhorn-system | grep -E "(faulted|unknown)"` |
| 239 | +2. Once volumes recover, stuck backups should auto-resume at next scheduled time |
| 240 | +3. If volumes don't recover, delete faulted volumes and let Kyverno regenerate |
| 241 | + |
| 242 | +### Root Cause Analysis |
| 243 | + |
| 244 | +The VolSync/Kyverno system works correctly. The issues were: |
| 245 | + |
| 246 | +1. **One-time bug**: JMESPath syntax error in Kyverno policy (now fixed) |
| 247 | +2. **Transient issue**: Longhorn volume recovery after mass pod restart (will self-heal) |
| 248 | + |
| 249 | +The system IS fully automatic when: |
| 250 | +- Longhorn is healthy |
| 251 | +- Kyverno is running |
| 252 | +- Base secrets are in place |
0 commit comments