Skip to content

Commit e6f786d

Browse files
committed
up
1 parent fd13aba commit e6f786d

1 file changed

Lines changed: 171 additions & 0 deletions

File tree

docs/network-topology.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Network Topology
2+
3+
## Overview
4+
5+
The cluster uses two separate networks:
6+
1. **Main LAN (192.168.10.0/24)** - 2.5G over switch - all cluster traffic, API, etc.
7+
2. **Storage Network (172.31.250.0/24)** - 10G DAC point-to-point - fast NFS/iSCSI to TrueNAS
8+
9+
## Physical Topology
10+
11+
```
12+
┌─────────────────────────────────────────────────────────────────────────────┐
13+
│ NETWORK TOPOLOGY │
14+
├─────────────────────────────────────────────────────────────────────────────┤
15+
│ │
16+
│ ┌─────────────────┐ 10G DAC (Direct) ┌─────────────────┐ │
17+
│ │ Proxmox │◄───────────────────────────────►│ TrueNAS │ │
18+
│ │ hp-server-1 │ 172.31.250.2/24 │ 192.168.10.133│ │
19+
│ │ │ ↕ │ │ │
20+
│ │ vmbr1 (eno49) │ 172.31.250.1/24 │ enp67s0 (10G) │ │
21+
│ │ │ (no switch!) │ │ │
22+
│ └────────┬────────┘ └────────┬────────┘ │
23+
│ │ │ │
24+
│ vmbr0 │ 192.168.10.14/24 │ 192.168.10.133
25+
│ (ens2) │ │ (2.5G) │
26+
│ │ │ │
27+
│ ▼ ▼ │
28+
│ ┌────────────────────────────────────────────────────────────────────┐ │
29+
│ │ 2.5G SWITCH (Main LAN) │ │
30+
│ │ 192.168.10.0/24 │ │
31+
│ └────────────────────────────────────────────────────────────────────┘ │
32+
│ │ │ │ │ │
33+
│ ▼ ▼ ▼ ▼ │
34+
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
35+
│ │ Control Plane│ │ Control Plane│ │ Control Plane│ │ Workers │ │
36+
│ │ .237 │ │ .76 │ │ .140 │ │ .164/.219/.159│ │
37+
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
38+
│ │
39+
│ ┌──────────────────────────────────────────────────────────────────┐ │
40+
│ │ GPU Worker VM 100 │ │
41+
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
42+
│ │ │ net0 (ens18) │ │ net1 (ens19) │ │ │
43+
│ │ │ vmbr0 → Main LAN│ │ vmbr1 → 10G DAC │ │ │
44+
│ │ │ 192.168.10.x │ │ 172.31.250.10 │ │ │
45+
│ │ │ (DHCP) │ │ (Static) │ │ │
46+
│ │ │ *** PRIMARY *** │ │ Storage only! │ │ │
47+
│ │ └─────────────────┘ └─────────────────┘ │ │
48+
│ └──────────────────────────────────────────────────────────────────┘ │
49+
│ │
50+
└─────────────────────────────────────────────────────────────────────────────┘
51+
```
52+
53+
## IP Assignments
54+
55+
### Main LAN (192.168.10.0/24)
56+
57+
| Device | IP | Purpose |
58+
|--------|-----|---------|
59+
| Router/Gateway | 192.168.10.1 | Default route |
60+
| Proxmox (hp-server-1) | 192.168.10.14 | Hypervisor |
61+
| TrueNAS | 192.168.10.133 | NAS (NFS/SMB/MinIO S3) |
62+
| Control Plane 1 | 192.168.10.237 | K8s master |
63+
| Control Plane 2 | 192.168.10.76 | K8s master |
64+
| Control Plane 3 | 192.168.10.140 | K8s master |
65+
| Worker 1 | 192.168.10.164 | K8s worker |
66+
| Worker 2 | 192.168.10.219 | K8s worker |
67+
| Worker 3 | 192.168.10.159 | K8s worker |
68+
| GPU Worker | 192.168.10.x (DHCP) | K8s GPU worker - **must use this for kubelet** |
69+
| Wyze Bridge | 192.168.10.46 | RTSP camera streams |
70+
| LoadBalancer Pool | 192.168.10.32-63 (/27) | Cilium L2 announcements |
71+
72+
### Storage Network (172.31.250.0/24)
73+
74+
**Point-to-point 10G DAC - NO SWITCH**
75+
76+
| Device | IP | Interface | Purpose |
77+
|--------|-----|-----------|---------|
78+
| TrueNAS | 172.31.250.1 | enp67s0 (10G SFP+) | Storage server |
79+
| Proxmox | 172.31.250.2 | eno49 → vmbr1 | Hypervisor |
80+
| GPU Worker VM | 172.31.250.10 | ens19 (net1) | Fast storage access |
81+
82+
## Critical Configuration Notes
83+
84+
### GPU Worker Dual-NIC Setup
85+
86+
The GPU worker VM has two NICs:
87+
- **net0 (ens18)** → vmbr0 → Main LAN (192.168.10.x) - **PRIMARY for Kubernetes**
88+
- **net1 (ens19)** → vmbr1 → 10G Storage (172.31.250.x) - **Storage traffic only**
89+
90+
**IMPORTANT**: Kubernetes/kubelet MUST register with the 192.168.10.x address, NOT the 172.31.250.x address. The 10G network is isolated and only reaches TrueNAS.
91+
92+
### Why This Matters
93+
94+
If kubelet registers with 172.31.250.10:
95+
- ❌ Other nodes can't reach it (different subnet, no routing)
96+
- ❌ kubectl logs/exec fails (API server can't reach kubelet)
97+
- ❌ Pods scheduled there become unreachable
98+
- ❌ Services don't work
99+
100+
### Talos Configuration Requirements
101+
102+
```yaml
103+
machine:
104+
network:
105+
interfaces:
106+
- interface: ens18 # Main LAN - must be primary
107+
dhcp: true
108+
routes:
109+
- network: 0.0.0.0/0 # Default route MUST go through main LAN
110+
gateway: 192.168.10.1
111+
- interface: ens19 # 10G storage - secondary
112+
dhcp: false
113+
addresses:
114+
- 172.31.250.10/24
115+
# NO default route here!
116+
kubelet:
117+
nodeIP: <192.168.10.x> # Force kubelet to use main LAN IP
118+
```
119+
120+
## Proxmox Bridge Configuration
121+
122+
| Bridge | Physical NIC | CIDR | Purpose |
123+
|--------|--------------|------|---------|
124+
| vmbr0 | ens2 | 192.168.10.14/24 | Main LAN |
125+
| vmbr1 | eno49 | 172.31.250.2/24 | 10G DAC to TrueNAS |
126+
127+
## TrueNAS Network Configuration
128+
129+
| Interface | IP | Speed | Purpose |
130+
|-----------|-----|-------|---------|
131+
| enp67s0 | 172.31.250.1/24 | 10G SFP+ DAC | Fast storage (Proxmox direct) |
132+
| enp67s0d1 | - | 10G SFP+ | Unused (second port) |
133+
| enx04421a41f284 | 192.168.10.133/24 | 2.5G USB | Main LAN access |
134+
135+
## Whitelisted Storage Access
136+
137+
The Cilium network policy allows these storage connections:
138+
139+
| Destination | Ports | Purpose |
140+
|-------------|-------|---------|
141+
| 192.168.10.133 | 2049, 111 | NFS |
142+
| 192.168.10.133 | 445 | SMB |
143+
| 192.168.10.133 | 9000 | MinIO S3 |
144+
| 172.31.250.1 | 2049, 445, 9000 | 10G storage (GPU worker only) |
145+
146+
## Troubleshooting
147+
148+
### GPU Worker Shows Wrong IP
149+
150+
If `kubectl get nodes -o wide` shows 172.31.250.10 for GPU worker:
151+
152+
1. Check if DHCP is working on ens18
153+
2. Verify default route goes through 192.168.10.1
154+
3. Force kubelet nodeIP in Talos config
155+
4. Reboot the node after config changes
156+
157+
### Can't Reach GPU Worker
158+
159+
```bash
160+
# From another node, test connectivity
161+
ping 192.168.10.x # Should work (main LAN)
162+
ping 172.31.250.10 # Will fail (different subnet, no routing)
163+
```
164+
165+
### Storage Performance Testing
166+
167+
```bash
168+
# Test 10G link from GPU worker to TrueNAS
169+
kubectl exec -n <ns> <gpu-pod> -- dd if=/dev/zero of=/mnt/nfs/test bs=1G count=1
170+
# Should see ~1GB/s+ throughput on 10G link
171+
```

0 commit comments

Comments
 (0)