mitchross
diff --git a/‎README.md‎
Lines changed: 1 addition & 31 deletions b/‎README.md‎
Lines changed: 1 addition & 31 deletions
diff --git a/‎docs/CILIUM-QUICKSTART.md‎
Lines changed: 131 additions & 0 deletions b/‎docs/CILIUM-QUICKSTART.md‎
Lines changed: 131 additions & 0 deletions
diff --git a/‎docs/CILIUM-SUCCESS.md‎
Lines changed: 215 additions & 0 deletions b/‎docs/CILIUM-SUCCESS.md‎
Lines changed: 215 additions & 0 deletions
@@ -20,7 +20,6 @@ A GitOps-driven Kubernetes cluster using **Talos OS** (secure, immutable Linux f
 - [MinIO S3 Backup Configuration](#-minio-s3-backup-configuration)
 - [Documentation](#-documentation)
 - [Troubleshooting](#-troubleshooting)
-- [Upgrade](#upgrade)
 
 ## 📋 Prerequisites
 
@@ -439,33 +438,4 @@ The patterns and structure remain the same - this is **production-grade GitOps**
 
 ## 📜 License
 
-MIT License - See [LICENSE](LICENSE) for details
-
-## Upgrade
-
-This repo includes a guided, repeatable process to upgrade Longhorn to v1.10.x safely.
-
-- Read the runbook: `docs/runbooks/longhorn-1.10-upgrade.md`
-- Key steps:
-  - Normalize CRD conversion spec (older installs may leave webhook fields)
-  - Migrate all Longhorn CRDs to stored version `v1beta2` (mandatory for v1.10)
-  - Sync the Longhorn Helm release via ArgoCD and validate
-
-Quick commands from repo root:
-
-```bash
-# 1) Fix legacy CRD conversion blocks atomically
-./scripts/longhorn-fix-crd-conversion.sh
-
-# 2) Migrate CRD storedVersions to v1beta2 (safe to re-run)
-./scripts/longhorn-v110-crd-migration.sh
-
-# 3) Verify only v1beta2 is present
-kubectl get crd -l app.kubernetes.io/name=longhorn -o=jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.storedVersions}{"\n"}{end}'
-
-# 4) Re-sync Longhorn in ArgoCD and verify pods in longhorn-system
-```
-
-Notes:
-- The chart is pinned in `infrastructure/storage/longhorn/kustomization.yaml` and values in `infrastructure/storage/longhorn/values.yaml`.
-- We avoid per-engine JSON booleans in values to sidestep a known 1.10.0 parsing issue; revisit when broadly enabling the V2 data engine.
+MIT License - See [LICENSE](LICENSE) for details
@@ -0,0 +1,131 @@
+# Quick Start: Install Cilium on Omni Cluster
+
+## Current Status
+- ✅ Cluster managed by Omni (192.168.10.15 / omni.vanillax.me)
+- ✅ Nodes are up but NotReady (no CNI)
+- ✅ Ready to install Cilium
+
+## One-Command Install
+
+```bash
+# Install Gateway API CRDs first
+kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
+kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/experimental-install.yaml
+
+# Install Cilium with VIP configuration
+kubectl kustomize infrastructure/networking/cilium --enable-helm | kubectl apply -f -
+
+# Watch it come up (takes 2-3 minutes)
+kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium -w
+```
+
+## Quick Verification
+
+```bash
+# 1. Check Cilium pods
+kubectl get pods -n kube-system | grep cilium
+
+# 2. Verify nodes are Ready
+kubectl get nodes
+
+# 3. Check Cilium status
+kubectl exec -n kube-system ds/cilium -- cilium-dbg status --brief
+```
+
+## Correct Config for Omni! ✅
+
+Your `infrastructure/networking/cilium/values.yaml` has been updated with the right settings:
+
+```yaml
+# ✅ kubePrism handles control plane HA via Omni's SideroLink
+k8sServiceHost: localhost
+k8sServicePort: 7445
+
+# ✅ Native routing for better performance (same L2 network)
+routingMode: native
+ipv4NativeRoutingCIDR: 10.14.0.0/16
+
+# ✅ L2 announcements for service LoadBalancers
+l2announcements:
+  enabled: true
+
+# ✅ Updated cluster name
+cluster:
+  name: talos-proxmox-prod
+```
+
+**Why kubePrism?** It runs on every node and automatically load balances API requests to all 3 control planes via Omni's SideroLink network. This is the Talos/Omni way!
+
+## What Happens
+
+1. **Gateway API CRDs** installed → Cilium can use Gateway API
+2. **Cilium Helm chart** deployed → CNI, operator, hubble all start
+3. **Cilium connects via kubePrism** → localhost:7445 load balances to all 3 control planes
+4. **L2 announcements** enabled → For service LoadBalancers
+5. **Nodes become Ready** → CNI is working, pods can schedule
+
+## After Installation
+
+### Verify Installation
+
+1. Check Cilium is using kubePrism:
+   ```bash
+   kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i k8s
+   # Should show: localhost:7445
+   ```
+
+2. Open Omni UI: http://192.168.10.15 or https://omni.vanillax.me
+3. Verify all 3 control plane nodes are healthy
+4. kubePrism on each node automatically load balances to all control planes!
+
+### Bootstrap ArgoCD
+
+```bash
+# Once nodes are Ready, bootstrap GitOps
+kustomize build infrastructure/controllers/argocd --enable-helm | kubectl apply -f -
+kubectl wait --for condition=established --timeout=60s crd/applications.argoproj.io
+kubectl wait --for=condition=Available deployment/argocd-server -n argocd --timeout=300s
+kubectl apply -f infrastructure/controllers/argocd/root.yaml
+```
+
+## Troubleshooting
+
+### Cilium pods stuck in Init
+
+**Check API connectivity**:
+```bash
+kubectl logs -n kube-system -l app.kubernetes.io/name=cilium --tail=20
+```
+
+**Fix**: Verify kubePrism is running on nodes:
+```bash
+talosctl --context omni -n <node-ip> service kubePrism
+# Should show: STATE: Running
+```
+
+### Nodes still NotReady
+
+**Check Cilium status**:
+```bash
+kubectl exec -n kube-system ds/cilium -- cilium-dbg status
+```
+
+**Verify native routing**:
+```bash
+kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i "routing mode"
+# Should show: native
+```
+
+## Summary
+
+Your Cilium configuration is **ready for Omni**! The key settings:
+
+- ✅ `k8sServiceHost: localhost` (kubePrism handles control plane HA)
+- ✅ `k8sServicePort: 7445` (kubePrism port)
+- ✅ `routingMode: native` (better performance on same L2 network)
+- ✅ `ipv4NativeRoutingCIDR: 10.14.0.0/16` (pod CIDR specified)
+- ✅ `cluster.name: talos-proxmox-prod` (updated name)
+- ✅ L2 announcements for service LoadBalancers
+- ✅ Removed control plane VIP resources (kubePrism handles this)
+
+**kubePrism FTW!** It automatically load balances API requests to all 3 control planes via Omni's SideroLink. 🚀
@@ -0,0 +1,215 @@
+# ✅ Cilium Successfully Installed!
+
+## Installation Summary
+
+**Date**: October 12, 2025  
+**Cilium Version**: 1.18.2  
+**Status**: ✅ **SUCCESS**
+
+## Verification Results
+
+### ✅ All Nodes Ready
+```
+NAME            STATUS   ROLES           AGE   VERSION
+talos-071-5jz   Ready    control-plane   33m   v1.34.1
+talos-971-dpt   Ready    control-plane   33m   v1.34.1
+talos-c7r-dgh   Ready    control-plane   33m   v1.34.1
+talos-blj-72f   Ready    <none>          32m   v1.34.1
+talos-kyk-7ek   Ready    <none>          32m   v1.34.1
+talos-o31-0s1   Ready    <none>          32m   v1.34.1
+talos-w4s-zts   Ready    <none>          32m   v1.34.1
+```
+
+**3 Control Plane Nodes + 4 Worker Nodes = 7 Total** 🎯
+
+### ✅ Cilium Pods Running
+```
+- cilium DaemonSet: 7/7 pods Running
+- cilium-envoy DaemonSet: 7/7 pods Running
+- cilium-operator: 1/1 Running
+- hubble-relay: Running
+- hubble-ui: 2/2 Running
+```
+
+### ✅ Cilium Status: OK
+
+**Key Configuration Verified**:
+- ✅ **Routing Mode**: Native (better performance!)
+- ✅ **kube-proxy Replacement**: True
+- ✅ **API Connectivity**: localhost:7445 (kubePrism) ✨
+- ✅ **Masquerading**: BPF (10.14.0.0/16)
+- ✅ **Pod CIDR**: 10.14.0.0/16
+- ✅ **Gateway API**: Enabled
+- ✅ **Hubble**: OK (observability ready)
+- ✅ **Cluster Health**: 6/7 reachable (normal during initial sync)
+
+## What's Working
+
+1. ✅ **CNI Operational** - All nodes have network connectivity
+2. ✅ **Native Routing** - Direct pod-to-pod communication (no tunneling overhead)
+3. ✅ **kubePrism Load Balancing** - API requests balanced across 3 control planes
+4. ✅ **kube-proxy Replacement** - Cilium handling all service load balancing
+5. ✅ **Hubble Observability** - Network visibility and monitoring ready
+6. ✅ **Gateway API Support** - Ready for modern ingress/routing
+7. ✅ **L2 Announcements** - LoadBalancer services will get IPs from pool
+
+## Network Details
+
+- **Cluster Pod CIDR**: 10.14.0.0/16
+- **Service CIDR**: 10.15.0.0/16 (from cluster config)
+- **LoadBalancer IP Pool**: 192.168.10.50-192.168.10.99 (for services)
+- **Control Plane Access**: Via kubePrism at localhost:7445
+- **Routing Mode**: Native (same L2 network)
+
+## Next Steps
+
+### 1. Verify Gateway API CRDs
+
+```bash
+kubectl get crd | grep gateway
+```
+
+If not installed yet:
+```bash
+kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
+kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/experimental-install.yaml
+```
+
+### 2. Bootstrap ArgoCD
+
+Now that CNI is working and nodes are Ready, deploy the GitOps stack:
+
+```bash
+cd /Users/mitchross/Documents/Programming/k3s-argocd-proxmox
+
+# Bootstrap ArgoCD
+kustomize build infrastructure/controllers/argocd --enable-helm | kubectl apply -f -
+
+# Wait for CRDs
+kubectl wait --for condition=established --timeout=60s crd/applications.argoproj.io
+
+# Wait for ArgoCD server
+kubectl wait --for=condition=Available deployment/argocd-server -n argocd --timeout=300s
+
+# Apply root application (starts GitOps self-management)
+kubectl apply -f infrastructure/controllers/argocd/root.yaml
+
+# Watch applications sync
+kubectl get applications -n argocd -w
+```
+
+### 3. Test LoadBalancer IP Pool
+
+Create a test service to verify L2 announcements work:
+
+```bash
+# Create test deployment
+kubectl create deployment nginx --image=nginx --replicas=2
+
+# Expose as LoadBalancer
+kubectl expose deployment nginx --port=80 --type=LoadBalancer
+
+# Check if it gets an IP from pool (192.168.10.50-99)
+kubectl get svc nginx -w
+```
+
+### 4. Access Hubble UI (Optional)
+
+```bash
+# Port forward to Hubble UI
+kubectl port-forward -n kube-system svc/hubble-ui 8080:80
+
+# Open in browser: http://localhost:8080
+```
+
+## Monitoring
+
+### Check Cilium Health
+```bash
+kubectl exec -n kube-system ds/cilium -- cilium-dbg status --brief
+```
+
+### View Hubble Flows (Network Traffic)
+```bash
+kubectl exec -n kube-system ds/cilium -- hubble observe --follow
+```
+
+### Check LoadBalancer IP Pools
+```bash
+kubectl get ciliumloadbalancerippool -n kube-system
+```
+
+### Check L2 Announcement Policies
+```bash
+kubectl get ciliuml2announcementpolicy -n kube-system
+```
+
+## Configuration Files Used
+
+- ✅ `infrastructure/networking/cilium/values.yaml`
+  - Cluster: talos-proxmox-prod
+  - Routing: native
+  - API: localhost:7445 (kubePrism)
+  - Pod CIDR: 10.14.0.0/16
+
+- ✅ `infrastructure/networking/cilium/ip-pool.yaml`
+  - LoadBalancer IPs: 192.168.10.50-192.168.10.99
+
+- ✅ `infrastructure/networking/cilium/l2-policy.yaml`
+  - L2 announcements for services
+
+## Troubleshooting Commands
+
+If you encounter issues:
+
+```bash
+# Check Cilium logs
+kubectl logs -n kube-system ds/cilium --tail=50
+
+# Check Cilium operator logs
+kubectl logs -n kube-system deployment/cilium-operator --tail=50
+
+# Verify node connectivity
+kubectl exec -n kube-system ds/cilium -- cilium-dbg node list
+
+# Check BPF maps
+kubectl exec -n kube-system ds/cilium -- cilium-dbg bpf lb list
+
+# Verify routing
+kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i routing
+```
+
+## Success Metrics
+
+- ✅ **All 7 nodes**: Ready
+- ✅ **Cilium pods**: 7/7 Running
+- ✅ **Cilium status**: OK
+- ✅ **Routing mode**: Native ✨
+- ✅ **API connectivity**: kubePrism ✨
+- ✅ **Hubble**: Operational
+- ✅ **Controller health**: 29/29
+
+## What Made This Work
+
+1. **kubePrism** - Used localhost:7445 for API access (correct for Omni!)
+2. **Native routing** - Better performance on same L2 network
+3. **Correct Pod CIDR** - 10.14.0.0/16 specified for native mode
+4. **Clean config** - Removed unnecessary control plane VIP resources
+
+## Congratulations! 🎉
+
+Your Talos cluster with Omni management now has:
+- ✅ Full CNI functionality via Cilium
+- ✅ High-performance native routing
+- ✅ Control plane HA via kubePrism
+- ✅ Network observability via Hubble
+- ✅ Ready for production workloads
+
+**Time to deploy your applications!** 🚀
+
+---
+
+**Cluster Name**: talos-proxmox-prod  
+**Management**: Sidero Omni (192.168.10.15)  
+**CNI**: Cilium 1.18.2  
+**Status**: Production Ready ✅