|
| 1 | +# Network Topology |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The cluster uses two separate networks: |
| 6 | +1. **Main LAN (192.168.10.0/24)** - 2.5G over switch - all cluster traffic, API, etc. |
| 7 | +2. **Storage Network (172.31.250.0/24)** - 10G DAC point-to-point - fast NFS/iSCSI to TrueNAS |
| 8 | + |
| 9 | +## Physical Topology |
| 10 | + |
| 11 | +``` |
| 12 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 13 | +│ NETWORK TOPOLOGY │ |
| 14 | +├─────────────────────────────────────────────────────────────────────────────┤ |
| 15 | +│ │ |
| 16 | +│ ┌─────────────────┐ 10G DAC (Direct) ┌─────────────────┐ │ |
| 17 | +│ │ Proxmox │◄───────────────────────────────►│ TrueNAS │ │ |
| 18 | +│ │ hp-server-1 │ 172.31.250.2/24 │ 192.168.10.133│ │ |
| 19 | +│ │ │ ↕ │ │ │ |
| 20 | +│ │ vmbr1 (eno49) │ 172.31.250.1/24 │ enp67s0 (10G) │ │ |
| 21 | +│ │ │ (no switch!) │ │ │ |
| 22 | +│ └────────┬────────┘ └────────┬────────┘ │ |
| 23 | +│ │ │ │ |
| 24 | +│ vmbr0 │ 192.168.10.14/24 │ 192.168.10.133 |
| 25 | +│ (ens2) │ │ (2.5G) │ |
| 26 | +│ │ │ │ |
| 27 | +│ ▼ ▼ │ |
| 28 | +│ ┌────────────────────────────────────────────────────────────────────┐ │ |
| 29 | +│ │ 2.5G SWITCH (Main LAN) │ │ |
| 30 | +│ │ 192.168.10.0/24 │ │ |
| 31 | +│ └────────────────────────────────────────────────────────────────────┘ │ |
| 32 | +│ │ │ │ │ │ |
| 33 | +│ ▼ ▼ ▼ ▼ │ |
| 34 | +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ |
| 35 | +│ │ Control Plane│ │ Control Plane│ │ Control Plane│ │ Workers │ │ |
| 36 | +│ │ .237 │ │ .76 │ │ .140 │ │ .164/.219/.159│ │ |
| 37 | +│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ |
| 38 | +│ │ |
| 39 | +│ ┌──────────────────────────────────────────────────────────────────┐ │ |
| 40 | +│ │ GPU Worker VM 100 │ │ |
| 41 | +│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ |
| 42 | +│ │ │ net0 (ens18) │ │ net1 (ens19) │ │ │ |
| 43 | +│ │ │ vmbr0 → Main LAN│ │ vmbr1 → 10G DAC │ │ │ |
| 44 | +│ │ │ 192.168.10.x │ │ 172.31.250.10 │ │ │ |
| 45 | +│ │ │ (DHCP) │ │ (Static) │ │ │ |
| 46 | +│ │ │ *** PRIMARY *** │ │ Storage only! │ │ │ |
| 47 | +│ │ └─────────────────┘ └─────────────────┘ │ │ |
| 48 | +│ └──────────────────────────────────────────────────────────────────┘ │ |
| 49 | +│ │ |
| 50 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 51 | +``` |
| 52 | + |
| 53 | +## IP Assignments |
| 54 | + |
| 55 | +### Main LAN (192.168.10.0/24) |
| 56 | + |
| 57 | +| Device | IP | Purpose | |
| 58 | +|--------|-----|---------| |
| 59 | +| Router/Gateway | 192.168.10.1 | Default route | |
| 60 | +| Proxmox (hp-server-1) | 192.168.10.14 | Hypervisor | |
| 61 | +| TrueNAS | 192.168.10.133 | NAS (NFS/SMB/MinIO S3) | |
| 62 | +| Control Plane 1 | 192.168.10.237 | K8s master | |
| 63 | +| Control Plane 2 | 192.168.10.76 | K8s master | |
| 64 | +| Control Plane 3 | 192.168.10.140 | K8s master | |
| 65 | +| Worker 1 | 192.168.10.164 | K8s worker | |
| 66 | +| Worker 2 | 192.168.10.219 | K8s worker | |
| 67 | +| Worker 3 | 192.168.10.159 | K8s worker | |
| 68 | +| GPU Worker | 192.168.10.x (DHCP) | K8s GPU worker - **must use this for kubelet** | |
| 69 | +| Wyze Bridge | 192.168.10.46 | RTSP camera streams | |
| 70 | +| LoadBalancer Pool | 192.168.10.32-63 (/27) | Cilium L2 announcements | |
| 71 | + |
| 72 | +### Storage Network (172.31.250.0/24) |
| 73 | + |
| 74 | +**Point-to-point 10G DAC - NO SWITCH** |
| 75 | + |
| 76 | +| Device | IP | Interface | Purpose | |
| 77 | +|--------|-----|-----------|---------| |
| 78 | +| TrueNAS | 172.31.250.1 | enp67s0 (10G SFP+) | Storage server | |
| 79 | +| Proxmox | 172.31.250.2 | eno49 → vmbr1 | Hypervisor | |
| 80 | +| GPU Worker VM | 172.31.250.10 | ens19 (net1) | Fast storage access | |
| 81 | + |
| 82 | +## Critical Configuration Notes |
| 83 | + |
| 84 | +### GPU Worker Dual-NIC Setup |
| 85 | + |
| 86 | +The GPU worker VM has two NICs: |
| 87 | +- **net0 (ens18)** → vmbr0 → Main LAN (192.168.10.x) - **PRIMARY for Kubernetes** |
| 88 | +- **net1 (ens19)** → vmbr1 → 10G Storage (172.31.250.x) - **Storage traffic only** |
| 89 | + |
| 90 | +**IMPORTANT**: Kubernetes/kubelet MUST register with the 192.168.10.x address, NOT the 172.31.250.x address. The 10G network is isolated and only reaches TrueNAS. |
| 91 | + |
| 92 | +### Why This Matters |
| 93 | + |
| 94 | +If kubelet registers with 172.31.250.10: |
| 95 | +- ❌ Other nodes can't reach it (different subnet, no routing) |
| 96 | +- ❌ kubectl logs/exec fails (API server can't reach kubelet) |
| 97 | +- ❌ Pods scheduled there become unreachable |
| 98 | +- ❌ Services don't work |
| 99 | + |
| 100 | +### Talos Configuration Requirements |
| 101 | + |
| 102 | +```yaml |
| 103 | +machine: |
| 104 | + network: |
| 105 | + interfaces: |
| 106 | + - interface: ens18 # Main LAN - must be primary |
| 107 | + dhcp: true |
| 108 | + routes: |
| 109 | + - network: 0.0.0.0/0 # Default route MUST go through main LAN |
| 110 | + gateway: 192.168.10.1 |
| 111 | + - interface: ens19 # 10G storage - secondary |
| 112 | + dhcp: false |
| 113 | + addresses: |
| 114 | + - 172.31.250.10/24 |
| 115 | + # NO default route here! |
| 116 | + kubelet: |
| 117 | + nodeIP: <192.168.10.x> # Force kubelet to use main LAN IP |
| 118 | +``` |
| 119 | +
|
| 120 | +## Proxmox Bridge Configuration |
| 121 | +
|
| 122 | +| Bridge | Physical NIC | CIDR | Purpose | |
| 123 | +|--------|--------------|------|---------| |
| 124 | +| vmbr0 | ens2 | 192.168.10.14/24 | Main LAN | |
| 125 | +| vmbr1 | eno49 | 172.31.250.2/24 | 10G DAC to TrueNAS | |
| 126 | +
|
| 127 | +## TrueNAS Network Configuration |
| 128 | +
|
| 129 | +| Interface | IP | Speed | Purpose | |
| 130 | +|-----------|-----|-------|---------| |
| 131 | +| enp67s0 | 172.31.250.1/24 | 10G SFP+ DAC | Fast storage (Proxmox direct) | |
| 132 | +| enp67s0d1 | - | 10G SFP+ | Unused (second port) | |
| 133 | +| enx04421a41f284 | 192.168.10.133/24 | 2.5G USB | Main LAN access | |
| 134 | +
|
| 135 | +## Whitelisted Storage Access |
| 136 | +
|
| 137 | +The Cilium network policy allows these storage connections: |
| 138 | +
|
| 139 | +| Destination | Ports | Purpose | |
| 140 | +|-------------|-------|---------| |
| 141 | +| 192.168.10.133 | 2049, 111 | NFS | |
| 142 | +| 192.168.10.133 | 445 | SMB | |
| 143 | +| 192.168.10.133 | 9000 | MinIO S3 | |
| 144 | +| 172.31.250.1 | 2049, 445, 9000 | 10G storage (GPU worker only) | |
| 145 | +
|
| 146 | +## Troubleshooting |
| 147 | +
|
| 148 | +### GPU Worker Shows Wrong IP |
| 149 | +
|
| 150 | +If `kubectl get nodes -o wide` shows 172.31.250.10 for GPU worker: |
| 151 | + |
| 152 | +1. Check if DHCP is working on ens18 |
| 153 | +2. Verify default route goes through 192.168.10.1 |
| 154 | +3. Force kubelet nodeIP in Talos config |
| 155 | +4. Reboot the node after config changes |
| 156 | + |
| 157 | +### Can't Reach GPU Worker |
| 158 | + |
| 159 | +```bash |
| 160 | +# From another node, test connectivity |
| 161 | +ping 192.168.10.x # Should work (main LAN) |
| 162 | +ping 172.31.250.10 # Will fail (different subnet, no routing) |
| 163 | +``` |
| 164 | + |
| 165 | +### Storage Performance Testing |
| 166 | + |
| 167 | +```bash |
| 168 | +# Test 10G link from GPU worker to TrueNAS |
| 169 | +kubectl exec -n <ns> <gpu-pod> -- dd if=/dev/zero of=/mnt/nfs/test bs=1G count=1 |
| 170 | +# Should see ~1GB/s+ throughput on 10G link |
| 171 | +``` |
0 commit comments