Skip to content

Commit 9829f9a

Browse files
committed
cleanup
1 parent 6ab86b6 commit 9829f9a

17 files changed

Lines changed: 1396 additions & 146 deletions

README.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,8 +111,17 @@ talosctl apply-config --nodes <node-ip-2> --file iac/talos/clusterconfig/<node-2
111111
### 4. Install Gateway API CRDs
112112
This is a prerequisite for Cilium's Gateway API integration.
113113
```bash
114-
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
115-
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/experimental-install.yaml
114+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml
115+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/experimental-install.yaml
116+
kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/experimental-install.yaml
117+
```
118+
### Install Cilium CNI with Gateway API support
119+
```bash
120+
kubectl kustomize infrastructure/networking/cilium --enable-helm | kubectl apply -f -
121+
122+
123+
# Verify Cilium is running
124+
kubectl get pods -n kube-system -l k8s-app=cilium
116125
```
117126

118127
### 5. Configure Secret Management
@@ -128,6 +137,7 @@ This cluster uses [1Password Connect](https://developer.1password.com/docs/conne
128137

129138
3. **Create Kubernetes Secrets**:
130139
```bash
140+
eval $(op signin)
131141
export OP_CREDENTIALS=$(op read op://homelabproxmox/1passwordconnect/1password-credentials.json | base64 | tr -d '\n')
132142
export OP_CONNECT_TOKEN=$(op read 'op://homelabproxmox/1password-operator-token/credential')
133143

docs/LONGHORN-NODE-REMOVAL.md

Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
# Longhorn Node Removal Procedure
2+
3+
## Problem You Experienced
4+
5+
When nodes `talos-blj-72f` and `talos-kyk-7ek` were removed from your cluster without proper Longhorn evacuation:
6+
7+
1. **40+ volumes became faulted** because they had replicas on the removed nodes
8+
2. **PVCs stuck in Terminating state** due to finalizers
9+
3. **Applications couldn't start** (Prometheus, Loki, Gitea)
10+
4. **Manual intervention required** to patch PVCs and delete pods
11+
12+
## Root Causes Fixed
13+
14+
### 1. Configuration Changes Applied
15+
16+
#### Added `node-failure-settings.yaml`:
17+
- `node-down-pod-deletion-policy`: Changed from `do-nothing` to `delete-both-statefulset-and-deployment-pod`
18+
- `orphan-auto-deletion`: Enabled to automatically clean up orphaned data
19+
- `storage-reserved-percentage-default`: Increased from 10% to 25%
20+
21+
#### Updated `values.yaml`:
22+
- `storageMinimalAvailablePercentage`: 10% → 25%
23+
24+
These changes are in Git and will be applied by ArgoCD on next sync.
25+
26+
---
27+
28+
## PROPER Node Removal Procedure
29+
30+
**ALWAYS follow these steps BEFORE removing a node from Kubernetes!**
31+
32+
### Step 1: Check Node Health in Longhorn
33+
34+
```bash
35+
# View Longhorn node status
36+
kubectl get nodes.longhorn.io -n longhorn-system
37+
38+
# Check replica distribution on the node you want to remove
39+
NODE_NAME="talos-xyz-abc" # Replace with actual node name
40+
kubectl get nodes.longhorn.io $NODE_NAME -n longhorn-system \
41+
-o jsonpath='{.status.diskStatus.*.scheduledReplica}' | jq
42+
```
43+
44+
### Step 2: Disable Scheduling on the Node
45+
46+
This prevents new replicas from being scheduled to the node:
47+
48+
```bash
49+
kubectl patch node.longhorn.io $NODE_NAME -n longhorn-system \
50+
--type=merge -p '{"spec":{"allowScheduling":false}}'
51+
```
52+
53+
Verify:
54+
```bash
55+
kubectl get nodes.longhorn.io $NODE_NAME -n longhorn-system | grep -i schedulable
56+
```
57+
58+
### Step 3: Request Replica Eviction
59+
60+
This migrates all replicas off the node:
61+
62+
```bash
63+
kubectl patch node.longhorn.io $NODE_NAME -n longhorn-system \
64+
--type=merge -p '{"spec":{"evictionRequested":true}}'
65+
```
66+
67+
### Step 4: Monitor Replica Migration
68+
69+
**This is critical** - wait for ALL replicas to migrate:
70+
71+
```bash
72+
# Watch the migration process
73+
watch kubectl get nodes.longhorn.io $NODE_NAME -n longhorn-system
74+
75+
# Check scheduled replica count (should become 0)
76+
kubectl get nodes.longhorn.io $NODE_NAME -n longhorn-system \
77+
-o jsonpath='{.status.diskStatus.*.scheduledReplica}' | jq '. | length'
78+
79+
# View migration progress for all volumes
80+
kubectl get volumes -n longhorn-system \
81+
-o jsonpath='{range .items[?(@.status.kubernetesStatus.lastPodRefAt!="")]}{.metadata.name}{"\t"}{.status.robustness}{"\n"}{end}'
82+
```
83+
84+
**Wait until:**
85+
- Scheduled replica count = 0
86+
- All volumes show `robustness: healthy`
87+
- No replica rebuilding in progress
88+
89+
This may take **10-30 minutes** depending on data size.
90+
91+
### Step 5: Remove Node from Longhorn
92+
93+
Only after ALL replicas are migrated:
94+
95+
```bash
96+
kubectl delete node.longhorn.io $NODE_NAME -n longhorn-system
97+
```
98+
99+
### Step 6: Drain and Remove Kubernetes Node
100+
101+
Now it's safe to remove the node from Kubernetes:
102+
103+
```bash
104+
# Drain the node (this will evict all pods)
105+
kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data --timeout=10m
106+
107+
# Verify no pods are running (except DaemonSets)
108+
kubectl get pods --all-namespaces --field-selector spec.nodeName=$NODE_NAME
109+
110+
# Delete the node
111+
kubectl delete node $NODE_NAME
112+
```
113+
114+
### Step 7: Verify Cluster Health
115+
116+
```bash
117+
# Check all nodes
118+
kubectl get nodes
119+
120+
# Verify all volumes are healthy
121+
kubectl get volumes -n longhorn-system | grep -v healthy
122+
123+
# Check for any faulted volumes (should be none)
124+
kubectl get volumes -n longhorn-system | grep faulted
125+
```
126+
127+
---
128+
129+
## Emergency Recovery (If Node Already Removed)
130+
131+
If you've already removed a node and have faulted volumes:
132+
133+
### 1. Identify Faulted Volumes
134+
135+
```bash
136+
kubectl get volumes -n longhorn-system -o json | \
137+
jq -r '.items[] | select(.status.robustness=="faulted") |
138+
{name: .metadata.name, state: .status.state, pv: .status.kubernetesStatus.pvName}'
139+
```
140+
141+
### 2. For Detached Faulted Volumes (Safe to Delete)
142+
143+
```bash
144+
# List them
145+
kubectl get volumes -n longhorn-system -o json | \
146+
jq -r '.items[] | select(.status.state=="detached" and .status.robustness=="faulted") | .metadata.name'
147+
148+
# Delete them (they're not in use)
149+
for vol in $(kubectl get volumes -n longhorn-system -o json | \
150+
jq -r '.items[] | select(.status.state=="detached" and .status.robustness=="faulted") | .metadata.name'); do
151+
echo "Deleting faulted volume: $vol"
152+
kubectl delete volume $vol -n longhorn-system
153+
done
154+
```
155+
156+
### 3. For Attached Faulted Volumes (More Complex)
157+
158+
These require manual intervention:
159+
160+
```bash
161+
# Find the PVC using the volume
162+
PV_NAME="pvc-xxxxx"
163+
kubectl get pvc -A -o json | \
164+
jq -r '.items[] | select(.spec.volumeName=="'$PV_NAME'") |
165+
{namespace: .metadata.namespace, name: .metadata.name, status: .status.phase}'
166+
```
167+
168+
If PVC is `Terminating`:
169+
170+
```bash
171+
# Remove finalizers
172+
kubectl patch pvc <pvc-name> -n <namespace> \
173+
-p '{"metadata":{"finalizers":null}}' --type=merge
174+
175+
# Delete the pod using it
176+
kubectl delete pod <pod-name> -n <namespace>
177+
```
178+
179+
The pod will recreate with a new PVC.
180+
181+
---
182+
183+
## Monitoring and Alerting
184+
185+
### Check Longhorn Health Regularly
186+
187+
Add this to your maintenance routine:
188+
189+
```bash
190+
#!/bin/bash
191+
# longhorn-health-check.sh
192+
193+
echo "=== Longhorn Nodes ==="
194+
kubectl get nodes.longhorn.io -n longhorn-system
195+
196+
echo -e "\n=== Faulted Volumes ==="
197+
kubectl get volumes -n longhorn-system | grep faulted || echo "None"
198+
199+
echo -e "\n=== Degraded Volumes ==="
200+
kubectl get volumes -n longhorn-system | grep degraded || echo "None"
201+
202+
echo -e "\n=== Storage Capacity ==="
203+
kubectl get nodes.longhorn.io -n longhorn-system \
204+
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.diskStatus.*.storageAvailable}{"\t/\t"}{.status.diskStatus.*.storageMaximum}{"\n"}{end}' | \
205+
awk '{printf "%s\t%.2f GB / %.2f GB\n", $1, $2/1e9, $4/1e9}'
206+
```
207+
208+
### Prometheus Alerts
209+
210+
Your [monitoring/prometheus-stack/longhorn-backup-alerts.yaml](../monitoring/prometheus-stack/longhorn-backup-alerts.yaml) should include:
211+
212+
```yaml
213+
apiVersion: monitoring.coreos.com/v1
214+
kind: PrometheusRule
215+
metadata:
216+
name: longhorn-health-alerts
217+
namespace: prometheus-stack
218+
labels:
219+
release: kube-prometheus-stack
220+
spec:
221+
groups:
222+
- name: longhorn.health
223+
interval: 30s
224+
rules:
225+
- alert: LonghornVolumeFaulted
226+
expr: longhorn_volume_robustness == 3
227+
for: 5m
228+
labels:
229+
severity: critical
230+
annotations:
231+
summary: "Longhorn volume {{ $labels.volume }} is faulted"
232+
description: "Volume has been in faulted state for 5 minutes - data may be lost"
233+
234+
- alert: LonghornNodeDown
235+
expr: longhorn_node_status{condition="ready"} == 0
236+
for: 10m
237+
labels:
238+
severity: warning
239+
annotations:
240+
summary: "Longhorn node {{ $labels.node }} is down"
241+
description: "Check node health and migrate replicas if needed"
242+
243+
- alert: LonghornDiskSpaceLow
244+
expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) > 0.75
245+
for: 15m
246+
labels:
247+
severity: warning
248+
annotations:
249+
summary: "Longhorn disk space low on {{ $labels.node }}"
250+
description: "Disk usage is above 75% - consider adding storage or cleaning up"
251+
```
252+
253+
---
254+
255+
## Best Practices Summary
256+
257+
### DO:
258+
✅ Always disable scheduling before removing a node
259+
✅ Request eviction and wait for migration to complete
260+
✅ Monitor replica migration progress
261+
✅ Verify all volumes are healthy before final removal
262+
✅ Keep at least 25% free space on Longhorn disks
263+
✅ Use replica count of 3 for critical data (already configured)
264+
✅ Test your backup/restore procedures regularly
265+
266+
### DON'T:
267+
❌ Remove a Kubernetes node without evacuating Longhorn first
268+
❌ Force delete nodes with `kubectl delete node --force`
269+
❌ Ignore faulted or degraded volumes
270+
❌ Let disk space drop below 25% available
271+
❌ Skip the monitoring step during migration
272+
273+
---
274+
275+
## Your Current Configuration
276+
277+
**Replica Count:** 3 (good for resilience)
278+
**Storage Reserved:** 25% (prevents "insufficient storage" errors)
279+
**Auto-balance:** best-effort (distributes replicas evenly)
280+
**Node Down Policy:** delete-both-statefulset-and-deployment-pod (auto-recovery)
281+
**Orphan Auto-Delete:** true (cleans up stuck PVCs)
282+
283+
These settings are now in your Git repo and will be applied automatically.
284+
285+
---
286+
287+
## Quick Reference
288+
289+
```bash
290+
# Check if node is safe to remove
291+
kubectl get nodes.longhorn.io <node> -n longhorn-system -o jsonpath='{.status.diskStatus.*.scheduledReplica}' | jq '. | length'
292+
293+
# If output is 0, node is empty and safe to remove
294+
# If output is > 0, follow the evacuation procedure above
295+
```
296+
297+
**Remember:** Patience during replica migration saves hours of recovery work!

infrastructure/networking/cilium/kustomization.yaml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,12 @@ kind: Kustomization
44
# kubePrism (localhost:7445) handles control plane HA via Omni's SideroLink
55
# Only need service LoadBalancer resources for applications
66

7-
resources:
8-
- ip-pool.yaml
9-
- l2-policy.yaml
10-
# REMOVED: l2-announcement.yaml (duplicate of l2-policy.yaml - was causing ARP conflicts)
7+
# NOTE: Apply ip-pool.yaml and l2-policy.yaml AFTER Cilium is running
8+
# These require Cilium CRDs to be installed first
9+
10+
resources:
11+
- ip-pool.yaml
12+
- l2-policy.yaml
1113

1214
helmCharts:
1315
- name: cilium
@@ -19,4 +21,4 @@ helmCharts:
1921
valuesFile: values.yaml
2022

2123
generatorOptions:
22-
disableNameSuffixHash: true
24+
disableNameSuffixHash: true

0 commit comments

Comments
 (0)