|
| 1 | +# Longhorn Node Removal Procedure |
| 2 | + |
| 3 | +## Problem You Experienced |
| 4 | + |
| 5 | +When nodes `talos-blj-72f` and `talos-kyk-7ek` were removed from your cluster without proper Longhorn evacuation: |
| 6 | + |
| 7 | +1. **40+ volumes became faulted** because they had replicas on the removed nodes |
| 8 | +2. **PVCs stuck in Terminating state** due to finalizers |
| 9 | +3. **Applications couldn't start** (Prometheus, Loki, Gitea) |
| 10 | +4. **Manual intervention required** to patch PVCs and delete pods |
| 11 | + |
| 12 | +## Root Causes Fixed |
| 13 | + |
| 14 | +### 1. Configuration Changes Applied |
| 15 | + |
| 16 | +#### Added `node-failure-settings.yaml`: |
| 17 | +- `node-down-pod-deletion-policy`: Changed from `do-nothing` to `delete-both-statefulset-and-deployment-pod` |
| 18 | +- `orphan-auto-deletion`: Enabled to automatically clean up orphaned data |
| 19 | +- `storage-reserved-percentage-default`: Increased from 10% to 25% |
| 20 | + |
| 21 | +#### Updated `values.yaml`: |
| 22 | +- `storageMinimalAvailablePercentage`: 10% → 25% |
| 23 | + |
| 24 | +These changes are in Git and will be applied by ArgoCD on next sync. |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +## PROPER Node Removal Procedure |
| 29 | + |
| 30 | +**ALWAYS follow these steps BEFORE removing a node from Kubernetes!** |
| 31 | + |
| 32 | +### Step 1: Check Node Health in Longhorn |
| 33 | + |
| 34 | +```bash |
| 35 | +# View Longhorn node status |
| 36 | +kubectl get nodes.longhorn.io -n longhorn-system |
| 37 | + |
| 38 | +# Check replica distribution on the node you want to remove |
| 39 | +NODE_NAME="talos-xyz-abc" # Replace with actual node name |
| 40 | +kubectl get nodes.longhorn.io $NODE_NAME -n longhorn-system \ |
| 41 | + -o jsonpath='{.status.diskStatus.*.scheduledReplica}' | jq |
| 42 | +``` |
| 43 | + |
| 44 | +### Step 2: Disable Scheduling on the Node |
| 45 | + |
| 46 | +This prevents new replicas from being scheduled to the node: |
| 47 | + |
| 48 | +```bash |
| 49 | +kubectl patch node.longhorn.io $NODE_NAME -n longhorn-system \ |
| 50 | + --type=merge -p '{"spec":{"allowScheduling":false}}' |
| 51 | +``` |
| 52 | + |
| 53 | +Verify: |
| 54 | +```bash |
| 55 | +kubectl get nodes.longhorn.io $NODE_NAME -n longhorn-system | grep -i schedulable |
| 56 | +``` |
| 57 | + |
| 58 | +### Step 3: Request Replica Eviction |
| 59 | + |
| 60 | +This migrates all replicas off the node: |
| 61 | + |
| 62 | +```bash |
| 63 | +kubectl patch node.longhorn.io $NODE_NAME -n longhorn-system \ |
| 64 | + --type=merge -p '{"spec":{"evictionRequested":true}}' |
| 65 | +``` |
| 66 | + |
| 67 | +### Step 4: Monitor Replica Migration |
| 68 | + |
| 69 | +**This is critical** - wait for ALL replicas to migrate: |
| 70 | + |
| 71 | +```bash |
| 72 | +# Watch the migration process |
| 73 | +watch kubectl get nodes.longhorn.io $NODE_NAME -n longhorn-system |
| 74 | + |
| 75 | +# Check scheduled replica count (should become 0) |
| 76 | +kubectl get nodes.longhorn.io $NODE_NAME -n longhorn-system \ |
| 77 | + -o jsonpath='{.status.diskStatus.*.scheduledReplica}' | jq '. | length' |
| 78 | + |
| 79 | +# View migration progress for all volumes |
| 80 | +kubectl get volumes -n longhorn-system \ |
| 81 | + -o jsonpath='{range .items[?(@.status.kubernetesStatus.lastPodRefAt!="")]}{.metadata.name}{"\t"}{.status.robustness}{"\n"}{end}' |
| 82 | +``` |
| 83 | + |
| 84 | +**Wait until:** |
| 85 | +- Scheduled replica count = 0 |
| 86 | +- All volumes show `robustness: healthy` |
| 87 | +- No replica rebuilding in progress |
| 88 | + |
| 89 | +This may take **10-30 minutes** depending on data size. |
| 90 | + |
| 91 | +### Step 5: Remove Node from Longhorn |
| 92 | + |
| 93 | +Only after ALL replicas are migrated: |
| 94 | + |
| 95 | +```bash |
| 96 | +kubectl delete node.longhorn.io $NODE_NAME -n longhorn-system |
| 97 | +``` |
| 98 | + |
| 99 | +### Step 6: Drain and Remove Kubernetes Node |
| 100 | + |
| 101 | +Now it's safe to remove the node from Kubernetes: |
| 102 | + |
| 103 | +```bash |
| 104 | +# Drain the node (this will evict all pods) |
| 105 | +kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data --timeout=10m |
| 106 | + |
| 107 | +# Verify no pods are running (except DaemonSets) |
| 108 | +kubectl get pods --all-namespaces --field-selector spec.nodeName=$NODE_NAME |
| 109 | + |
| 110 | +# Delete the node |
| 111 | +kubectl delete node $NODE_NAME |
| 112 | +``` |
| 113 | + |
| 114 | +### Step 7: Verify Cluster Health |
| 115 | + |
| 116 | +```bash |
| 117 | +# Check all nodes |
| 118 | +kubectl get nodes |
| 119 | + |
| 120 | +# Verify all volumes are healthy |
| 121 | +kubectl get volumes -n longhorn-system | grep -v healthy |
| 122 | + |
| 123 | +# Check for any faulted volumes (should be none) |
| 124 | +kubectl get volumes -n longhorn-system | grep faulted |
| 125 | +``` |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +## Emergency Recovery (If Node Already Removed) |
| 130 | + |
| 131 | +If you've already removed a node and have faulted volumes: |
| 132 | + |
| 133 | +### 1. Identify Faulted Volumes |
| 134 | + |
| 135 | +```bash |
| 136 | +kubectl get volumes -n longhorn-system -o json | \ |
| 137 | + jq -r '.items[] | select(.status.robustness=="faulted") | |
| 138 | + {name: .metadata.name, state: .status.state, pv: .status.kubernetesStatus.pvName}' |
| 139 | +``` |
| 140 | + |
| 141 | +### 2. For Detached Faulted Volumes (Safe to Delete) |
| 142 | + |
| 143 | +```bash |
| 144 | +# List them |
| 145 | +kubectl get volumes -n longhorn-system -o json | \ |
| 146 | + jq -r '.items[] | select(.status.state=="detached" and .status.robustness=="faulted") | .metadata.name' |
| 147 | + |
| 148 | +# Delete them (they're not in use) |
| 149 | +for vol in $(kubectl get volumes -n longhorn-system -o json | \ |
| 150 | + jq -r '.items[] | select(.status.state=="detached" and .status.robustness=="faulted") | .metadata.name'); do |
| 151 | + echo "Deleting faulted volume: $vol" |
| 152 | + kubectl delete volume $vol -n longhorn-system |
| 153 | +done |
| 154 | +``` |
| 155 | + |
| 156 | +### 3. For Attached Faulted Volumes (More Complex) |
| 157 | + |
| 158 | +These require manual intervention: |
| 159 | + |
| 160 | +```bash |
| 161 | +# Find the PVC using the volume |
| 162 | +PV_NAME="pvc-xxxxx" |
| 163 | +kubectl get pvc -A -o json | \ |
| 164 | + jq -r '.items[] | select(.spec.volumeName=="'$PV_NAME'") | |
| 165 | + {namespace: .metadata.namespace, name: .metadata.name, status: .status.phase}' |
| 166 | +``` |
| 167 | + |
| 168 | +If PVC is `Terminating`: |
| 169 | + |
| 170 | +```bash |
| 171 | +# Remove finalizers |
| 172 | +kubectl patch pvc <pvc-name> -n <namespace> \ |
| 173 | + -p '{"metadata":{"finalizers":null}}' --type=merge |
| 174 | + |
| 175 | +# Delete the pod using it |
| 176 | +kubectl delete pod <pod-name> -n <namespace> |
| 177 | +``` |
| 178 | + |
| 179 | +The pod will recreate with a new PVC. |
| 180 | + |
| 181 | +--- |
| 182 | + |
| 183 | +## Monitoring and Alerting |
| 184 | + |
| 185 | +### Check Longhorn Health Regularly |
| 186 | + |
| 187 | +Add this to your maintenance routine: |
| 188 | + |
| 189 | +```bash |
| 190 | +#!/bin/bash |
| 191 | +# longhorn-health-check.sh |
| 192 | + |
| 193 | +echo "=== Longhorn Nodes ===" |
| 194 | +kubectl get nodes.longhorn.io -n longhorn-system |
| 195 | + |
| 196 | +echo -e "\n=== Faulted Volumes ===" |
| 197 | +kubectl get volumes -n longhorn-system | grep faulted || echo "None" |
| 198 | + |
| 199 | +echo -e "\n=== Degraded Volumes ===" |
| 200 | +kubectl get volumes -n longhorn-system | grep degraded || echo "None" |
| 201 | + |
| 202 | +echo -e "\n=== Storage Capacity ===" |
| 203 | +kubectl get nodes.longhorn.io -n longhorn-system \ |
| 204 | + -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.diskStatus.*.storageAvailable}{"\t/\t"}{.status.diskStatus.*.storageMaximum}{"\n"}{end}' | \ |
| 205 | + awk '{printf "%s\t%.2f GB / %.2f GB\n", $1, $2/1e9, $4/1e9}' |
| 206 | +``` |
| 207 | + |
| 208 | +### Prometheus Alerts |
| 209 | + |
| 210 | +Your [monitoring/prometheus-stack/longhorn-backup-alerts.yaml](../monitoring/prometheus-stack/longhorn-backup-alerts.yaml) should include: |
| 211 | + |
| 212 | +```yaml |
| 213 | +apiVersion: monitoring.coreos.com/v1 |
| 214 | +kind: PrometheusRule |
| 215 | +metadata: |
| 216 | + name: longhorn-health-alerts |
| 217 | + namespace: prometheus-stack |
| 218 | + labels: |
| 219 | + release: kube-prometheus-stack |
| 220 | +spec: |
| 221 | + groups: |
| 222 | + - name: longhorn.health |
| 223 | + interval: 30s |
| 224 | + rules: |
| 225 | + - alert: LonghornVolumeFaulted |
| 226 | + expr: longhorn_volume_robustness == 3 |
| 227 | + for: 5m |
| 228 | + labels: |
| 229 | + severity: critical |
| 230 | + annotations: |
| 231 | + summary: "Longhorn volume {{ $labels.volume }} is faulted" |
| 232 | + description: "Volume has been in faulted state for 5 minutes - data may be lost" |
| 233 | + |
| 234 | + - alert: LonghornNodeDown |
| 235 | + expr: longhorn_node_status{condition="ready"} == 0 |
| 236 | + for: 10m |
| 237 | + labels: |
| 238 | + severity: warning |
| 239 | + annotations: |
| 240 | + summary: "Longhorn node {{ $labels.node }} is down" |
| 241 | + description: "Check node health and migrate replicas if needed" |
| 242 | + |
| 243 | + - alert: LonghornDiskSpaceLow |
| 244 | + expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) > 0.75 |
| 245 | + for: 15m |
| 246 | + labels: |
| 247 | + severity: warning |
| 248 | + annotations: |
| 249 | + summary: "Longhorn disk space low on {{ $labels.node }}" |
| 250 | + description: "Disk usage is above 75% - consider adding storage or cleaning up" |
| 251 | +``` |
| 252 | +
|
| 253 | +--- |
| 254 | +
|
| 255 | +## Best Practices Summary |
| 256 | +
|
| 257 | +### DO: |
| 258 | +✅ Always disable scheduling before removing a node |
| 259 | +✅ Request eviction and wait for migration to complete |
| 260 | +✅ Monitor replica migration progress |
| 261 | +✅ Verify all volumes are healthy before final removal |
| 262 | +✅ Keep at least 25% free space on Longhorn disks |
| 263 | +✅ Use replica count of 3 for critical data (already configured) |
| 264 | +✅ Test your backup/restore procedures regularly |
| 265 | +
|
| 266 | +### DON'T: |
| 267 | +❌ Remove a Kubernetes node without evacuating Longhorn first |
| 268 | +❌ Force delete nodes with `kubectl delete node --force` |
| 269 | +❌ Ignore faulted or degraded volumes |
| 270 | +❌ Let disk space drop below 25% available |
| 271 | +❌ Skip the monitoring step during migration |
| 272 | + |
| 273 | +--- |
| 274 | + |
| 275 | +## Your Current Configuration |
| 276 | + |
| 277 | +**Replica Count:** 3 (good for resilience) |
| 278 | +**Storage Reserved:** 25% (prevents "insufficient storage" errors) |
| 279 | +**Auto-balance:** best-effort (distributes replicas evenly) |
| 280 | +**Node Down Policy:** delete-both-statefulset-and-deployment-pod (auto-recovery) |
| 281 | +**Orphan Auto-Delete:** true (cleans up stuck PVCs) |
| 282 | + |
| 283 | +These settings are now in your Git repo and will be applied automatically. |
| 284 | + |
| 285 | +--- |
| 286 | + |
| 287 | +## Quick Reference |
| 288 | + |
| 289 | +```bash |
| 290 | +# Check if node is safe to remove |
| 291 | +kubectl get nodes.longhorn.io <node> -n longhorn-system -o jsonpath='{.status.diskStatus.*.scheduledReplica}' | jq '. | length' |
| 292 | +
|
| 293 | +# If output is 0, node is empty and safe to remove |
| 294 | +# If output is > 0, follow the evacuation procedure above |
| 295 | +``` |
| 296 | + |
| 297 | +**Remember:** Patience during replica migration saves hours of recovery work! |
0 commit comments