|
| 1 | +# 🗄️ Longhorn Backup Implementation Summary |
| 2 | + |
| 3 | +## 🎯 What We Built |
| 4 | + |
| 5 | +A **production-grade Longhorn backup and disaster recovery system** with TrueNAS Scale integration, featuring: |
| 6 | + |
| 7 | +### 📦 Components Created |
| 8 | + |
| 9 | +1. **Backup Configuration** (`infrastructure/storage/longhorn/backup-settings.yaml`) |
| 10 | + - NFS and S3 backup target support |
| 11 | + - Compression and concurrent backup settings |
| 12 | + - Snapshot data integrity checks |
| 13 | + |
| 14 | +2. **Recurring Jobs** (`infrastructure/storage/longhorn/recurring-jobs.yaml`) |
| 15 | + - **Critical Data**: Hourly snapshots + Daily backups (30-day retention) |
| 16 | + - **Important Data**: 4-hour snapshots + Daily backups (14-day retention) |
| 17 | + - **Standard Data**: Daily snapshots + Weekly backups (4-week retention) |
| 18 | + |
| 19 | +3. **Backup Management Script** (`scripts/longhorn-backup-management.sh`) |
| 20 | + - Interactive menu-driven backup operations |
| 21 | + - TrueNAS NFS/S3 configuration |
| 22 | + - Manual backup/restore operations |
| 23 | + - Volume labeling by data tier |
| 24 | + - Disaster recovery procedures |
| 25 | + |
| 26 | +4. **Monitoring & Alerting** (`monitoring/prometheus-stack/longhorn-backup-alerts.yaml`) |
| 27 | + - 12 comprehensive Prometheus alert rules |
| 28 | + - Backup failure detection |
| 29 | + - Storage capacity monitoring |
| 30 | + - Volume health alerts |
| 31 | + |
| 32 | +5. **Comprehensive Documentation** |
| 33 | + - **[Longhorn Backup Guide](docs/longhorn-backup-guide.md)** - Complete setup and operations |
| 34 | + - **[Emergency Procedures](docs/runbooks/longhorn-emergency-procedures.md)** - Critical incident response |
| 35 | + |
| 36 | +## 🏗️ Architecture Overview |
| 37 | + |
| 38 | +``` |
| 39 | +┌─────────────────────────────────────────────────────────────────┐ |
| 40 | +│ Kubernetes Cluster │ |
| 41 | +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ |
| 42 | +│ │ Critical │ │ Important │ │ Standard │ │ |
| 43 | +│ │ Data Tier │ │ Data Tier │ │ Data Tier │ │ |
| 44 | +│ │ │ │ │ │ │ │ |
| 45 | +│ │ • Databases │ │ • Media Files │ │ • Logs │ │ |
| 46 | +│ │ • User Data │ │ • Configs │ │ • Cache │ │ |
| 47 | +│ │ • Immich │ │ • Home Auto │ │ • Temp Data │ │ |
| 48 | +│ │ │ │ │ │ │ │ |
| 49 | +│ │ Hourly Snaps │ │ 4hr Snaps │ │ Daily Snaps │ │ |
| 50 | +│ │ Daily Backups │ │ Daily Backups │ │ Weekly Backups │ │ |
| 51 | +│ │ 30d Retention │ │ 14d Retention │ │ 4w Retention │ │ |
| 52 | +│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ |
| 53 | +│ │ │ |
| 54 | +│ ▼ │ |
| 55 | +│ ┌─────────────────┐ │ |
| 56 | +│ │ Longhorn Storage │ │ |
| 57 | +│ │ Snapshots & │ │ |
| 58 | +│ │ Backups │ │ |
| 59 | +│ └─────────────────┘ │ |
| 60 | +└─────────────────────────────────────────────────────────────────┘ |
| 61 | + │ |
| 62 | + ▼ |
| 63 | +┌─────────────────────────────────────────────────────────────────┐ |
| 64 | +│ TrueNAS Scale │ |
| 65 | +│ ┌─────────────────┐ ┌─────────────────┐ │ |
| 66 | +│ │ NFS Share │ │ MinIO S3 │ │ |
| 67 | +│ │ (Primary Target)│ │ (Alternative) │ │ |
| 68 | +│ │ │ │ │ │ |
| 69 | +│ │ /mnt/tank/ │ │ longhorn- │ │ |
| 70 | +│ │ longhorn-backups│ │ backups bucket │ │ |
| 71 | +│ └─────────────────┘ └─────────────────┘ │ |
| 72 | +│ │ │ |
| 73 | +│ ▼ │ |
| 74 | +│ ┌─────────────────┐ │ |
| 75 | +│ │ ZFS Snapshots │ │ |
| 76 | +│ │ (Additional │ │ |
| 77 | +│ │ Protection) │ │ |
| 78 | +│ └─────────────────┘ │ |
| 79 | +└─────────────────────────────────────────────────────────────────┘ |
| 80 | +``` |
| 81 | + |
| 82 | +## 🎯 Data Tier Strategy |
| 83 | + |
| 84 | +### Critical Data (RTO: 1 hour, RPO: 1 hour) |
| 85 | +- **Namespaces**: `cloudnative-pg`, `immich`, `khoj`, `paperless-ngx` |
| 86 | +- **Schedule**: Hourly snapshots, daily backups |
| 87 | +- **Retention**: 24 snapshots, 30 backups |
| 88 | +- **Examples**: PostgreSQL databases, user photos, documents |
| 89 | + |
| 90 | +### Important Data (RTO: 4 hours, RPO: 4 hours) |
| 91 | +- **Namespaces**: `frigate`, `jellyfin`, `plex`, `home-assistant`, `hoarder` |
| 92 | +- **Schedule**: 4-hour snapshots, daily backups |
| 93 | +- **Retention**: 12 snapshots, 14 backups |
| 94 | +- **Examples**: Media libraries, security footage, configurations |
| 95 | + |
| 96 | +### Standard Data (RTO: 24 hours, RPO: 24 hours) |
| 97 | +- **Namespaces**: All others |
| 98 | +- **Schedule**: Daily snapshots, weekly backups |
| 99 | +- **Retention**: 7 snapshots, 4 backups |
| 100 | +- **Examples**: Logs, cache, temporary data |
| 101 | + |
| 102 | +## 🚀 Quick Start Guide |
| 103 | + |
| 104 | +### 1. Deploy Backup Configuration |
| 105 | +```bash |
| 106 | +# Apply backup settings and recurring jobs |
| 107 | +kubectl apply -f infrastructure/storage/longhorn/backup-settings.yaml |
| 108 | +kubectl apply -f infrastructure/storage/longhorn/recurring-jobs.yaml |
| 109 | +``` |
| 110 | + |
| 111 | +### 2. Configure TrueNAS Backup Target |
| 112 | +```bash |
| 113 | +# Run interactive backup management script |
| 114 | +chmod +x scripts/longhorn-backup-management.sh |
| 115 | +./scripts/longhorn-backup-management.sh |
| 116 | + |
| 117 | +# Select option 1: Configure backup target (NFS/S3) |
| 118 | +# Enter your TrueNAS IP and NFS path |
| 119 | +``` |
| 120 | + |
| 121 | +### 3. Label Volumes by Data Tier |
| 122 | +```bash |
| 123 | +# Auto-label volumes based on namespace |
| 124 | +./scripts/longhorn-backup-management.sh |
| 125 | +# Select option 10: Label volumes by data tier |
| 126 | +``` |
| 127 | + |
| 128 | +### 4. Verify Backup Health |
| 129 | +```bash |
| 130 | +# Check backup system status |
| 131 | +./scripts/longhorn-backup-management.sh |
| 132 | +# Select option 9: Check backup system health |
| 133 | +``` |
| 134 | + |
| 135 | +## 🔧 TrueNAS Scale Setup |
| 136 | + |
| 137 | +### 1. Create NFS Dataset |
| 138 | +```bash |
| 139 | +# On TrueNAS Scale |
| 140 | +zfs create tank/longhorn-backups |
| 141 | +chmod 755 /mnt/tank/longhorn-backups |
| 142 | +chown root:wheel /mnt/tank/longhorn-backups |
| 143 | +``` |
| 144 | + |
| 145 | +### 2. Configure NFS Share |
| 146 | +- **Path**: `/mnt/tank/longhorn-backups` |
| 147 | +- **Networks**: Your Kubernetes subnet (e.g., `10.0.0.0/24`) |
| 148 | +- **Maproot User**: `root` |
| 149 | +- **Maproot Group**: `wheel` |
| 150 | + |
| 151 | +### 3. Optional: ZFS Auto-Snapshots |
| 152 | +```bash |
| 153 | +zfs set com.sun:auto-snapshot=true tank/longhorn-backups |
| 154 | +zfs set com.sun:auto-snapshot:hourly=48 tank/longhorn-backups |
| 155 | +zfs set com.sun:auto-snapshot:daily=30 tank/longhorn-backups |
| 156 | +``` |
| 157 | + |
| 158 | +## 📊 Monitoring & Alerting |
| 159 | + |
| 160 | +### Prometheus Alerts Configured |
| 161 | +- **LonghornBackupTargetUnavailable** - Critical backup target issues |
| 162 | +- **LonghornBackupFailed** - Failed backup jobs |
| 163 | +- **LonghornNoRecentBackup** - Missing backups for critical data |
| 164 | +- **LonghornBackupStorageHigh** - Storage capacity warnings |
| 165 | +- **LonghornVolumeFaulted** - Critical volume failures |
| 166 | +- **LonghornNodeStorageLow** - Node storage capacity warnings |
| 167 | + |
| 168 | +### Grafana Dashboards |
| 169 | +- Volume health and performance metrics |
| 170 | +- Backup job status and progress |
| 171 | +- Storage utilization across nodes |
| 172 | +- Snapshot chain length monitoring |
| 173 | + |
| 174 | +## 🚨 Emergency Procedures |
| 175 | + |
| 176 | +### Volume Faulted (CRITICAL - 15 min RTO) |
| 177 | +1. Stop applications using the volume |
| 178 | +2. Assess volume state and replicas |
| 179 | +3. Create emergency backup if possible |
| 180 | +4. Restore from most recent backup |
| 181 | +5. Update PVC to use recovered volume |
| 182 | + |
| 183 | +### Backup Target Down (HIGH - 30 min RTO) |
| 184 | +1. Test NFS/S3 connectivity |
| 185 | +2. Check TrueNAS service status |
| 186 | +3. Verify network connectivity |
| 187 | +4. Reconfigure backup target if needed |
| 188 | + |
| 189 | +### Complete Cluster Loss (CRITICAL - 4 hour RTO) |
| 190 | +1. Deploy Longhorn on new cluster |
| 191 | +2. Configure backup target |
| 192 | +3. Restore critical volumes first |
| 193 | +4. Recreate PVCs and deploy applications |
| 194 | +5. Verify data integrity |
| 195 | + |
| 196 | +## 📈 Backup Scheduling |
| 197 | + |
| 198 | +| **Time** | **Action** | **Target** | |
| 199 | +|----------|------------|------------| |
| 200 | +| Every hour | Snapshot | Critical data | |
| 201 | +| Every 4 hours | Snapshot | Important data | |
| 202 | +| Daily 2 AM | Backup | Critical data | |
| 203 | +| Daily 3 AM | Backup | Important data | |
| 204 | +| Daily 4 AM | Snapshot | Standard data | |
| 205 | +| Weekly Sunday 1 AM | Full backup | All data | |
| 206 | +| Weekly Sunday 5 AM | Backup | Standard data | |
| 207 | + |
| 208 | +## 🛠️ Management Operations |
| 209 | + |
| 210 | +### Script Features (`scripts/longhorn-backup-management.sh`) |
| 211 | +1. **Configure backup target** (NFS/S3) |
| 212 | +2. **List all volumes** with status |
| 213 | +3. **Create manual snapshot** for any volume |
| 214 | +4. **Create manual backup** for any volume |
| 215 | +5. **List snapshots** for specific volume |
| 216 | +6. **List backups** for specific volume |
| 217 | +7. **Restore from backup** to new volume |
| 218 | +8. **Disaster recovery backup** (all critical) |
| 219 | +9. **Check backup system health** |
| 220 | +10. **Label volumes by data tier** |
| 221 | +11. **Cleanup old backups** |
| 222 | + |
| 223 | +### Key Commands |
| 224 | +```bash |
| 225 | +# Quick health check |
| 226 | +kubectl get volumes -n longhorn-system -o custom-columns="NAME:.metadata.name,STATE:.status.state,ROBUSTNESS:.status.robustness" |
| 227 | + |
| 228 | +# Check backup jobs |
| 229 | +kubectl get backups -n longhorn-system --sort-by=.metadata.creationTimestamp |
| 230 | + |
| 231 | +# Monitor recurring jobs |
| 232 | +kubectl get recurringjobs -n longhorn-system |
| 233 | + |
| 234 | +# Check backup target |
| 235 | +kubectl get setting backup-target -n longhorn-system -o yaml |
| 236 | +``` |
| 237 | + |
| 238 | +## 🎯 Production Benefits |
| 239 | + |
| 240 | +1. **Automated Protection**: No manual intervention required |
| 241 | +2. **Tiered Strategy**: Different protection levels based on data criticality |
| 242 | +3. **TrueNAS Integration**: Leverages enterprise-grade ZFS storage |
| 243 | +4. **Comprehensive Monitoring**: Proactive alerting on backup failures |
| 244 | +5. **Emergency Procedures**: Documented recovery processes |
| 245 | +6. **Scriptable Operations**: Automation-friendly management tools |
| 246 | + |
| 247 | +## 📋 Next Steps |
| 248 | + |
| 249 | +1. **Test the backup system**: |
| 250 | + ```bash |
| 251 | + ./scripts/longhorn-backup-management.sh |
| 252 | + # Create test backup and verify restore |
| 253 | + ``` |
| 254 | + |
| 255 | +2. **Configure TrueNAS**: |
| 256 | + - Set up NFS share |
| 257 | + - Configure ZFS snapshots |
| 258 | + - Test connectivity |
| 259 | + |
| 260 | +3. **Monitor backup health**: |
| 261 | + - Check Grafana dashboards |
| 262 | + - Verify Prometheus alerts |
| 263 | + - Test emergency procedures |
| 264 | + |
| 265 | +4. **Schedule regular testing**: |
| 266 | + - Monthly restore tests |
| 267 | + - Quarterly DR drills |
| 268 | + - Annual full cluster recovery |
| 269 | + |
| 270 | +This implementation provides **enterprise-grade backup and disaster recovery** for your Longhorn storage, ensuring your critical data is protected with multiple layers of redundancy and automated recovery procedures. 🛡️ |
0 commit comments