Skip to content

Commit 1c6a776

Browse files
committed
switch to omni
1 parent f09c3e2 commit 1c6a776

15 files changed

Lines changed: 1495 additions & 160 deletions

README.md

Lines changed: 1 addition & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ A GitOps-driven Kubernetes cluster using **Talos OS** (secure, immutable Linux f
2020
- [MinIO S3 Backup Configuration](#-minio-s3-backup-configuration)
2121
- [Documentation](#-documentation)
2222
- [Troubleshooting](#-troubleshooting)
23-
- [Upgrade](#upgrade)
2423

2524
## 📋 Prerequisites
2625

@@ -439,33 +438,4 @@ The patterns and structure remain the same - this is **production-grade GitOps**
439438

440439
## 📜 License
441440

442-
MIT License - See [LICENSE](LICENSE) for details
443-
444-
## Upgrade
445-
446-
This repo includes a guided, repeatable process to upgrade Longhorn to v1.10.x safely.
447-
448-
- Read the runbook: `docs/runbooks/longhorn-1.10-upgrade.md`
449-
- Key steps:
450-
- Normalize CRD conversion spec (older installs may leave webhook fields)
451-
- Migrate all Longhorn CRDs to stored version `v1beta2` (mandatory for v1.10)
452-
- Sync the Longhorn Helm release via ArgoCD and validate
453-
454-
Quick commands from repo root:
455-
456-
```bash
457-
# 1) Fix legacy CRD conversion blocks atomically
458-
./scripts/longhorn-fix-crd-conversion.sh
459-
460-
# 2) Migrate CRD storedVersions to v1beta2 (safe to re-run)
461-
./scripts/longhorn-v110-crd-migration.sh
462-
463-
# 3) Verify only v1beta2 is present
464-
kubectl get crd -l app.kubernetes.io/name=longhorn -o=jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.storedVersions}{"\n"}{end}'
465-
466-
# 4) Re-sync Longhorn in ArgoCD and verify pods in longhorn-system
467-
```
468-
469-
Notes:
470-
- The chart is pinned in `infrastructure/storage/longhorn/kustomization.yaml` and values in `infrastructure/storage/longhorn/values.yaml`.
471-
- We avoid per-engine JSON booleans in values to sidestep a known 1.10.0 parsing issue; revisit when broadly enabling the V2 data engine.
441+
MIT License - See [LICENSE](LICENSE) for details

docs/CILIUM-QUICKSTART.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Quick Start: Install Cilium on Omni Cluster
2+
3+
## Current Status
4+
- ✅ Cluster managed by Omni (192.168.10.15 / omni.vanillax.me)
5+
- ✅ Nodes are up but NotReady (no CNI)
6+
- ✅ Ready to install Cilium
7+
8+
## One-Command Install
9+
10+
```bash
11+
# Install Gateway API CRDs first
12+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
13+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/experimental-install.yaml
14+
15+
# Install Cilium with VIP configuration
16+
kubectl kustomize infrastructure/networking/cilium --enable-helm | kubectl apply -f -
17+
18+
# Watch it come up (takes 2-3 minutes)
19+
kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium -w
20+
```
21+
22+
## Quick Verification
23+
24+
```bash
25+
# 1. Check Cilium pods
26+
kubectl get pods -n kube-system | grep cilium
27+
28+
# 2. Verify nodes are Ready
29+
kubectl get nodes
30+
31+
# 3. Check Cilium status
32+
kubectl exec -n kube-system ds/cilium -- cilium-dbg status --brief
33+
```
34+
35+
## Correct Config for Omni! ✅
36+
37+
Your `infrastructure/networking/cilium/values.yaml` has been updated with the right settings:
38+
39+
```yaml
40+
# ✅ kubePrism handles control plane HA via Omni's SideroLink
41+
k8sServiceHost: localhost
42+
k8sServicePort: 7445
43+
44+
# ✅ Native routing for better performance (same L2 network)
45+
routingMode: native
46+
ipv4NativeRoutingCIDR: 10.14.0.0/16
47+
48+
# ✅ L2 announcements for service LoadBalancers
49+
l2announcements:
50+
enabled: true
51+
52+
# ✅ Updated cluster name
53+
cluster:
54+
name: talos-proxmox-prod
55+
```
56+
57+
**Why kubePrism?** It runs on every node and automatically load balances API requests to all 3 control planes via Omni's SideroLink network. This is the Talos/Omni way!
58+
59+
## What Happens
60+
61+
1. **Gateway API CRDs** installed → Cilium can use Gateway API
62+
2. **Cilium Helm chart** deployed → CNI, operator, hubble all start
63+
3. **Cilium connects via kubePrism** → localhost:7445 load balances to all 3 control planes
64+
4. **L2 announcements** enabled → For service LoadBalancers
65+
5. **Nodes become Ready** → CNI is working, pods can schedule
66+
67+
## After Installation
68+
69+
### Verify Installation
70+
71+
1. Check Cilium is using kubePrism:
72+
```bash
73+
kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i k8s
74+
# Should show: localhost:7445
75+
```
76+
77+
2. Open Omni UI: http://192.168.10.15 or https://omni.vanillax.me
78+
3. Verify all 3 control plane nodes are healthy
79+
4. kubePrism on each node automatically load balances to all control planes!
80+
81+
### Bootstrap ArgoCD
82+
83+
```bash
84+
# Once nodes are Ready, bootstrap GitOps
85+
kustomize build infrastructure/controllers/argocd --enable-helm | kubectl apply -f -
86+
kubectl wait --for condition=established --timeout=60s crd/applications.argoproj.io
87+
kubectl wait --for=condition=Available deployment/argocd-server -n argocd --timeout=300s
88+
kubectl apply -f infrastructure/controllers/argocd/root.yaml
89+
```
90+
91+
## Troubleshooting
92+
93+
### Cilium pods stuck in Init
94+
95+
**Check API connectivity**:
96+
```bash
97+
kubectl logs -n kube-system -l app.kubernetes.io/name=cilium --tail=20
98+
```
99+
100+
**Fix**: Verify kubePrism is running on nodes:
101+
```bash
102+
talosctl --context omni -n <node-ip> service kubePrism
103+
# Should show: STATE: Running
104+
```
105+
106+
### Nodes still NotReady
107+
108+
**Check Cilium status**:
109+
```bash
110+
kubectl exec -n kube-system ds/cilium -- cilium-dbg status
111+
```
112+
113+
**Verify native routing**:
114+
```bash
115+
kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i "routing mode"
116+
# Should show: native
117+
```
118+
119+
## Summary
120+
121+
Your Cilium configuration is **ready for Omni**! The key settings:
122+
123+
-`k8sServiceHost: localhost` (kubePrism handles control plane HA)
124+
-`k8sServicePort: 7445` (kubePrism port)
125+
-`routingMode: native` (better performance on same L2 network)
126+
-`ipv4NativeRoutingCIDR: 10.14.0.0/16` (pod CIDR specified)
127+
-`cluster.name: talos-proxmox-prod` (updated name)
128+
- ✅ L2 announcements for service LoadBalancers
129+
- ✅ Removed control plane VIP resources (kubePrism handles this)
130+
131+
**kubePrism FTW!** It automatically load balances API requests to all 3 control planes via Omni's SideroLink. 🚀

docs/CILIUM-SUCCESS.md

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# ✅ Cilium Successfully Installed!
2+
3+
## Installation Summary
4+
5+
**Date**: October 12, 2025
6+
**Cilium Version**: 1.18.2
7+
**Status**: ✅ **SUCCESS**
8+
9+
## Verification Results
10+
11+
### ✅ All Nodes Ready
12+
```
13+
NAME STATUS ROLES AGE VERSION
14+
talos-071-5jz Ready control-plane 33m v1.34.1
15+
talos-971-dpt Ready control-plane 33m v1.34.1
16+
talos-c7r-dgh Ready control-plane 33m v1.34.1
17+
talos-blj-72f Ready <none> 32m v1.34.1
18+
talos-kyk-7ek Ready <none> 32m v1.34.1
19+
talos-o31-0s1 Ready <none> 32m v1.34.1
20+
talos-w4s-zts Ready <none> 32m v1.34.1
21+
```
22+
23+
**3 Control Plane Nodes + 4 Worker Nodes = 7 Total** 🎯
24+
25+
### ✅ Cilium Pods Running
26+
```
27+
- cilium DaemonSet: 7/7 pods Running
28+
- cilium-envoy DaemonSet: 7/7 pods Running
29+
- cilium-operator: 1/1 Running
30+
- hubble-relay: Running
31+
- hubble-ui: 2/2 Running
32+
```
33+
34+
### ✅ Cilium Status: OK
35+
36+
**Key Configuration Verified**:
37+
-**Routing Mode**: Native (better performance!)
38+
-**kube-proxy Replacement**: True
39+
-**API Connectivity**: localhost:7445 (kubePrism) ✨
40+
-**Masquerading**: BPF (10.14.0.0/16)
41+
-**Pod CIDR**: 10.14.0.0/16
42+
-**Gateway API**: Enabled
43+
-**Hubble**: OK (observability ready)
44+
-**Cluster Health**: 6/7 reachable (normal during initial sync)
45+
46+
## What's Working
47+
48+
1.**CNI Operational** - All nodes have network connectivity
49+
2.**Native Routing** - Direct pod-to-pod communication (no tunneling overhead)
50+
3.**kubePrism Load Balancing** - API requests balanced across 3 control planes
51+
4.**kube-proxy Replacement** - Cilium handling all service load balancing
52+
5.**Hubble Observability** - Network visibility and monitoring ready
53+
6.**Gateway API Support** - Ready for modern ingress/routing
54+
7.**L2 Announcements** - LoadBalancer services will get IPs from pool
55+
56+
## Network Details
57+
58+
- **Cluster Pod CIDR**: 10.14.0.0/16
59+
- **Service CIDR**: 10.15.0.0/16 (from cluster config)
60+
- **LoadBalancer IP Pool**: 192.168.10.50-192.168.10.99 (for services)
61+
- **Control Plane Access**: Via kubePrism at localhost:7445
62+
- **Routing Mode**: Native (same L2 network)
63+
64+
## Next Steps
65+
66+
### 1. Verify Gateway API CRDs
67+
68+
```bash
69+
kubectl get crd | grep gateway
70+
```
71+
72+
If not installed yet:
73+
```bash
74+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
75+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/experimental-install.yaml
76+
```
77+
78+
### 2. Bootstrap ArgoCD
79+
80+
Now that CNI is working and nodes are Ready, deploy the GitOps stack:
81+
82+
```bash
83+
cd /Users/mitchross/Documents/Programming/k3s-argocd-proxmox
84+
85+
# Bootstrap ArgoCD
86+
kustomize build infrastructure/controllers/argocd --enable-helm | kubectl apply -f -
87+
88+
# Wait for CRDs
89+
kubectl wait --for condition=established --timeout=60s crd/applications.argoproj.io
90+
91+
# Wait for ArgoCD server
92+
kubectl wait --for=condition=Available deployment/argocd-server -n argocd --timeout=300s
93+
94+
# Apply root application (starts GitOps self-management)
95+
kubectl apply -f infrastructure/controllers/argocd/root.yaml
96+
97+
# Watch applications sync
98+
kubectl get applications -n argocd -w
99+
```
100+
101+
### 3. Test LoadBalancer IP Pool
102+
103+
Create a test service to verify L2 announcements work:
104+
105+
```bash
106+
# Create test deployment
107+
kubectl create deployment nginx --image=nginx --replicas=2
108+
109+
# Expose as LoadBalancer
110+
kubectl expose deployment nginx --port=80 --type=LoadBalancer
111+
112+
# Check if it gets an IP from pool (192.168.10.50-99)
113+
kubectl get svc nginx -w
114+
```
115+
116+
### 4. Access Hubble UI (Optional)
117+
118+
```bash
119+
# Port forward to Hubble UI
120+
kubectl port-forward -n kube-system svc/hubble-ui 8080:80
121+
122+
# Open in browser: http://localhost:8080
123+
```
124+
125+
## Monitoring
126+
127+
### Check Cilium Health
128+
```bash
129+
kubectl exec -n kube-system ds/cilium -- cilium-dbg status --brief
130+
```
131+
132+
### View Hubble Flows (Network Traffic)
133+
```bash
134+
kubectl exec -n kube-system ds/cilium -- hubble observe --follow
135+
```
136+
137+
### Check LoadBalancer IP Pools
138+
```bash
139+
kubectl get ciliumloadbalancerippool -n kube-system
140+
```
141+
142+
### Check L2 Announcement Policies
143+
```bash
144+
kubectl get ciliuml2announcementpolicy -n kube-system
145+
```
146+
147+
## Configuration Files Used
148+
149+
-`infrastructure/networking/cilium/values.yaml`
150+
- Cluster: talos-proxmox-prod
151+
- Routing: native
152+
- API: localhost:7445 (kubePrism)
153+
- Pod CIDR: 10.14.0.0/16
154+
155+
-`infrastructure/networking/cilium/ip-pool.yaml`
156+
- LoadBalancer IPs: 192.168.10.50-192.168.10.99
157+
158+
-`infrastructure/networking/cilium/l2-policy.yaml`
159+
- L2 announcements for services
160+
161+
## Troubleshooting Commands
162+
163+
If you encounter issues:
164+
165+
```bash
166+
# Check Cilium logs
167+
kubectl logs -n kube-system ds/cilium --tail=50
168+
169+
# Check Cilium operator logs
170+
kubectl logs -n kube-system deployment/cilium-operator --tail=50
171+
172+
# Verify node connectivity
173+
kubectl exec -n kube-system ds/cilium -- cilium-dbg node list
174+
175+
# Check BPF maps
176+
kubectl exec -n kube-system ds/cilium -- cilium-dbg bpf lb list
177+
178+
# Verify routing
179+
kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i routing
180+
```
181+
182+
## Success Metrics
183+
184+
-**All 7 nodes**: Ready
185+
-**Cilium pods**: 7/7 Running
186+
-**Cilium status**: OK
187+
-**Routing mode**: Native ✨
188+
-**API connectivity**: kubePrism ✨
189+
-**Hubble**: Operational
190+
-**Controller health**: 29/29
191+
192+
## What Made This Work
193+
194+
1. **kubePrism** - Used localhost:7445 for API access (correct for Omni!)
195+
2. **Native routing** - Better performance on same L2 network
196+
3. **Correct Pod CIDR** - 10.14.0.0/16 specified for native mode
197+
4. **Clean config** - Removed unnecessary control plane VIP resources
198+
199+
## Congratulations! 🎉
200+
201+
Your Talos cluster with Omni management now has:
202+
- ✅ Full CNI functionality via Cilium
203+
- ✅ High-performance native routing
204+
- ✅ Control plane HA via kubePrism
205+
- ✅ Network observability via Hubble
206+
- ✅ Ready for production workloads
207+
208+
**Time to deploy your applications!** 🚀
209+
210+
---
211+
212+
**Cluster Name**: talos-proxmox-prod
213+
**Management**: Sidero Omni (192.168.10.15)
214+
**CNI**: Cilium 1.18.2
215+
**Status**: Production Ready ✅

0 commit comments

Comments
 (0)