Skip to content

Commit 89ffa13

Browse files
committed
10gig switch to fix my woes
1 parent 490f482 commit 89ffa13

14 files changed

Lines changed: 86 additions & 169 deletions

File tree

.github/renovate.json5

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,20 @@
100100
],
101101
versioning: 'regex:^server-cuda-b(?<major>\\d+)$',
102102
},
103+
{
104+
description: 'Ignore NVIDIA CUDA updates (breaking changes)',
105+
matchPackagePatterns: [
106+
'^nvidia/cuda',
107+
],
108+
enabled: false,
109+
},
110+
{
111+
description: 'Ignore Meilisearch updates (breaking changes)',
112+
matchPackagePatterns: [
113+
'^getmeili/meilisearch',
114+
],
115+
enabled: false,
116+
},
103117
],
104118
ignorePaths: [
105119
'**/charts/**',

docs/network-policy.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,8 +105,7 @@ These specific IPs are allowed on specific ports only:
105105

106106
| IP | Hostname | Allowed Ports | Purpose |
107107
|----|----------|---------------|---------|
108-
| 192.168.10.133 | TrueNAS | 2049 (NFS), 111 (RPC), 445 (SMB), 9000 (MinIO) | Storage backend |
109-
| 172.31.250.1 | TrueNAS SMB | 445 (SMB) | SMB shares for apps |
108+
| 192.168.10.133 | TrueNAS | 2049 (NFS), 111 (RPC), 445 (SMB), 9000 (MinIO), 30292-30293 (RustFS) | Storage backend (10G) |
110109
| 192.168.10.46 | Wyze Bridge | 8554 (RTSP) | Camera streams for Frigate |
111110
| 192.168.10.14 | Proxmox | 8006 (API) | Omni/Terraform integration |
112111
| 192.168.10.32/27 | LB Pool | All | Cilium L2 LoadBalancer IPs |

docs/network-topology.md

Lines changed: 43 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
## Overview
44

5-
The cluster uses two separate networks:
6-
1. **Main LAN (192.168.10.0/24)** - 2.5G over switch - all cluster traffic, API, etc.
7-
2. **Storage Network (172.31.250.0/24)** - 10G DAC point-to-point - fast NFS/iSCSI to TrueNAS
5+
The cluster uses a single network with 10G switch infrastructure:
6+
- **Main LAN (192.168.10.0/24)** - All cluster traffic via 10G switch
7+
- **TrueNAS Storage** - 192.168.10.133 (10G connected via switch)
88

99
## Physical Topology
1010

@@ -13,38 +13,33 @@ The cluster uses two separate networks:
1313
│ NETWORK TOPOLOGY │
1414
├─────────────────────────────────────────────────────────────────────────────┤
1515
│ │
16-
│ ┌─────────────────┐ 10G DAC (Direct) ┌─────────────────┐ │
17-
│ │ Proxmox │◄───────────────────────────────►│ TrueNAS │ │
18-
│ │ hp-server-1 │ 172.31.250.2/24 │ 192.168.10.133│ │
19-
│ │ │ ↕ │ │ │
20-
│ │ vmbr1 (eno49) │ 172.31.250.1/24 │ enp67s0 (10G) │ │
21-
│ │ │ (no switch!) │ │ │
22-
│ └────────┬────────┘ └────────┬────────┘ │
23-
│ │ │ │
24-
│ vmbr0 │ 192.168.10.14/24 │ 192.168.10.133
25-
│ (ens2) │ │ (2.5G) │
26-
│ │ │ │
27-
│ ▼ ▼ │
16+
│ ┌─────────────────┐ ┌─────────────────┐ │
17+
│ │ Proxmox │ │ TrueNAS │ │
18+
│ │ hp-server-1 │ │ 192.168.10.133 │ │
19+
│ │ 192.168.10.14 │ │ │ │
20+
│ └────────┬────────┘ └────────┬────────┘ │
21+
│ │ 10G │ 10G │
22+
│ │ │ │
23+
│ ▼ ▼ │
2824
│ ┌────────────────────────────────────────────────────────────────────┐ │
29-
│ │ 2.5G SWITCH (Main LAN) │ │
30-
│ │ 192.168.10.0/24 │ │
25+
│ │ 10G SWITCH │ │
26+
│ │ 192.168.10.0/24 │ │
3127
│ └────────────────────────────────────────────────────────────────────┘ │
32-
│ │ │ │ │ │
33-
│ ▼ ▼ ▼ ▼ │
34-
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
35-
│ │ Control Plane│ │ Control Plane│ │ Control Plane│ │ Workers │ │
36-
│ │ .237 │ │ .76 │ │ .140 │ │ .164/.219/.159│ │
37-
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
28+
│ │ │ │ │
29+
│ ▼ ▼ ▼ ▼
30+
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
31+
│ │ Control Plane│ │ Control Plane│ │ Control Plane│ │ Workers │
32+
│ │ .237 │ │ .76 │ │ .140 │ │ .164/.219/.159│
33+
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
3834
│ │
3935
│ ┌──────────────────────────────────────────────────────────────────┐ │
4036
│ │ GPU Worker VM 100 │ │
41-
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
42-
│ │ │ net0 (ens18) │ │ net1 (ens19) │ │ │
43-
│ │ │ vmbr0 → Main LAN│ │ vmbr1 → 10G DAC │ │ │
44-
│ │ │ 192.168.10.x │ │ 172.31.250.10 │ │ │
45-
│ │ │ (DHCP) │ │ (Static) │ │ │
46-
│ │ │ *** PRIMARY *** │ │ Storage only! │ │ │
47-
│ │ └─────────────────┘ └─────────────────┘ │ │
37+
│ │ ┌─────────────────┐ │ │
38+
│ │ │ net0 (ens18) │ │ │
39+
│ │ │ vmbr0 → 10G LAN │ │ │
40+
│ │ │ 192.168.10.x │ │ │
41+
│ │ │ (DHCP) │ │ │
42+
│ │ └─────────────────┘ │ │
4843
│ └──────────────────────────────────────────────────────────────────┘ │
4944
│ │
5045
└─────────────────────────────────────────────────────────────────────────────┘
@@ -58,79 +53,42 @@ The cluster uses two separate networks:
5853
|--------|-----|---------|
5954
| Router/Gateway | 192.168.10.1 | Default route |
6055
| Proxmox (hp-server-1) | 192.168.10.14 | Hypervisor |
61-
| TrueNAS | 192.168.10.133 | NAS (NFS/SMB/MinIO S3) |
56+
| TrueNAS | 192.168.10.133 | NAS (NFS/SMB/MinIO S3) - 10G |
6257
| Control Plane 1 | 192.168.10.237 | K8s master |
6358
| Control Plane 2 | 192.168.10.76 | K8s master |
6459
| Control Plane 3 | 192.168.10.140 | K8s master |
6560
| Worker 1 | 192.168.10.164 | K8s worker |
6661
| Worker 2 | 192.168.10.219 | K8s worker |
6762
| Worker 3 | 192.168.10.159 | K8s worker |
68-
| GPU Worker | 192.168.10.x (DHCP) | K8s GPU worker - **must use this for kubelet** |
63+
| GPU Worker | 192.168.10.x (DHCP) | K8s GPU worker |
6964
| Wyze Bridge | 192.168.10.46 | RTSP camera streams |
7065
| LoadBalancer Pool | 192.168.10.32-63 (/27) | Cilium L2 announcements |
7166

72-
### Storage Network (172.31.250.0/24)
73-
74-
**Point-to-point 10G DAC - NO SWITCH**
75-
76-
| Device | IP | Interface | Purpose |
77-
|--------|-----|-----------|---------|
78-
| TrueNAS | 172.31.250.1 | enp67s0 (10G SFP+) | Storage server |
79-
| Proxmox | 172.31.250.2 | eno49 → vmbr1 | Hypervisor |
80-
| GPU Worker VM | 172.31.250.10 | ens19 (net1) | Fast storage access |
81-
82-
## Critical Configuration Notes
83-
84-
### GPU Worker Dual-NIC Setup
85-
86-
The GPU worker VM has two NICs:
87-
- **net0 (ens18)** → vmbr0 → Main LAN (192.168.10.x) - **PRIMARY for Kubernetes**
88-
- **net1 (ens19)** → vmbr1 → 10G Storage (172.31.250.x) - **Storage traffic only**
89-
90-
**IMPORTANT**: Kubernetes/kubelet MUST register with the 192.168.10.x address, NOT the 172.31.250.x address. The 10G network is isolated and only reaches TrueNAS.
91-
92-
### Why This Matters
93-
94-
If kubelet registers with 172.31.250.10:
95-
- ❌ Other nodes can't reach it (different subnet, no routing)
96-
- ❌ kubectl logs/exec fails (API server can't reach kubelet)
97-
- ❌ Pods scheduled there become unreachable
98-
- ❌ Services don't work
99-
100-
### Talos Configuration Requirements
67+
## Talos Configuration
10168

10269
```yaml
10370
machine:
10471
network:
10572
interfaces:
106-
- interface: ens18 # Main LAN - must be primary
73+
- interface: ens18
10774
dhcp: true
108-
routes:
109-
- network: 0.0.0.0/0 # Default route MUST go through main LAN
110-
gateway: 192.168.10.1
111-
- interface: ens19 # 10G storage - secondary
112-
dhcp: false
113-
addresses:
114-
- 172.31.250.10/24
115-
# NO default route here!
11675
kubelet:
117-
nodeIP: <192.168.10.x> # Force kubelet to use main LAN IP
76+
nodeIP:
77+
validSubnets:
78+
- 192.168.10.0/24
11879
```
11980
12081
## Proxmox Bridge Configuration
12182
12283
| Bridge | Physical NIC | CIDR | Purpose |
12384
|--------|--------------|------|---------|
124-
| vmbr0 | ens2 | 192.168.10.14/24 | Main LAN |
125-
| vmbr1 | eno49 | 172.31.250.2/24 | 10G DAC to TrueNAS |
85+
| vmbr0 | ens2 | 192.168.10.14/24 | Main LAN (10G) |
12686
12787
## TrueNAS Network Configuration
12888
12989
| Interface | IP | Speed | Purpose |
13090
|-----------|-----|-------|---------|
131-
| enp67s0 | 172.31.250.1/24 | 10G SFP+ DAC | Fast storage (Proxmox direct) |
132-
| enp67s0d1 | - | 10G SFP+ | Unused (second port) |
133-
| enx04421a41f284 | 192.168.10.133/24 | 2.5G USB | Main LAN access |
91+
| enp67s0 | 192.168.10.133/24 | 10G SFP+ | Main LAN (via 10G switch) |
13492
13593
## Whitelisted Storage Access
13694
@@ -141,31 +99,24 @@ The Cilium network policy allows these storage connections:
14199
| 192.168.10.133 | 2049, 111 | NFS |
142100
| 192.168.10.133 | 445 | SMB |
143101
| 192.168.10.133 | 9000 | MinIO S3 |
144-
| 172.31.250.1 | 2049, 445, 9000 | 10G storage (GPU worker only) |
102+
| 192.168.10.133 | 30292, 30293 | RustFS |
145103
146104
## Troubleshooting
147105
148-
### GPU Worker Shows Wrong IP
149-
150-
If `kubectl get nodes -o wide` shows 172.31.250.10 for GPU worker:
151-
152-
1. Check if DHCP is working on ens18
153-
2. Verify default route goes through 192.168.10.1
154-
3. Force kubelet nodeIP in Talos config
155-
4. Reboot the node after config changes
156-
157-
### Can't Reach GPU Worker
106+
### Can't Reach Storage
158107
159108
```bash
160-
# From another node, test connectivity
161-
ping 192.168.10.x # Should work (main LAN)
162-
ping 172.31.250.10 # Will fail (different subnet, no routing)
109+
# Test connectivity to TrueNAS
110+
ping 192.168.10.133
111+
112+
# Test NFS mount
113+
showmount -e 192.168.10.133
163114
```
164115

165116
### Storage Performance Testing
166117

167118
```bash
168-
# Test 10G link from GPU worker to TrueNAS
169-
kubectl exec -n <ns> <gpu-pod> -- dd if=/dev/zero of=/mnt/nfs/test bs=1G count=1
119+
# Test 10G link to TrueNAS
120+
kubectl exec -n <ns> <pod> -- dd if=/dev/zero of=/mnt/nfs/test bs=1G count=1
170121
# Should see ~1GB/s+ throughput on 10G link
171122
```

infrastructure/networking/cilium/policies/block-lan-access.yaml

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -70,17 +70,6 @@ spec:
7070
- port: "9000"
7171
protocol: TCP
7272

73-
# ============================================
74-
# ALLOW: TrueNAS - SMB Storage
75-
# (Used by ComfyUI, Ollama, Frigate, etc.)
76-
# ============================================
77-
- toCIDR:
78-
- 172.31.250.1/32
79-
toPorts:
80-
- ports:
81-
- port: "445"
82-
protocol: TCP
83-
8473
# ============================================
8574
# ALLOW: Wyze Bridge RTSP (for Frigate)
8675
# ============================================

infrastructure/storage/csi-driver-nfs/kustomization.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,7 @@ apiVersion: kustomize.config.k8s.io/v1beta1
22
kind: Kustomization
33
namespace: csi-driver-nfs
44
helmCharts:
5-
# Single NFS CSI deployment - controller on GPU node (has access to BOTH networks)
6-
# GPU node can reach 172.31.250.1 (10G DAC) AND 192.168.10.133 (2.5G LAN)
5+
# NFS CSI deployment - all nodes reach TrueNAS via 10G switch
76
- name: csi-driver-nfs
87
repo: https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
98
version: 4.12.1

infrastructure/storage/csi-driver-nfs/storage-class.yaml

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,15 @@
11
# =============================================================================
2-
# 10G DAC NFS Storage Classes (172.31.250.1) - GPU workloads only
3-
# For non-GPU workloads, use SMB storage classes on 2.5G network
2+
# 10G NFS Storage Classes (192.168.10.133) - via 10G switch
43
# =============================================================================
54

6-
# ComfyUI NFS storage class (10G DAC)
5+
# ComfyUI NFS storage class (10G)
76
apiVersion: storage.k8s.io/v1
87
kind: StorageClass
98
metadata:
109
name: nfs-comfyui-10g
1110
provisioner: nfs.csi.k8s.io
1211
parameters:
13-
server: 172.31.250.1
12+
server: 192.168.10.133
1413
share: /mnt/BigTank/k8s/comfyui
1514
reclaimPolicy: Retain
1615
volumeBindingMode: Immediate
@@ -19,14 +18,14 @@ mountOptions:
1918
- nolock
2019
- tcp
2120
---
22-
# llama-cpp NFS storage class (10G DAC)
21+
# llama-cpp NFS storage class (10G)
2322
apiVersion: storage.k8s.io/v1
2423
kind: StorageClass
2524
metadata:
2625
name: nfs-llama-cpp-10g
2726
provisioner: nfs.csi.k8s.io
2827
parameters:
29-
server: 172.31.250.1
28+
server: 192.168.10.133
3029
share: /mnt/BigTank/k8s/llama-cpp
3130
reclaimPolicy: Retain
3231
volumeBindingMode: Immediate
Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,9 @@
1-
# Single NFS CSI deployment
2-
# Controller runs on GPU node which has access to BOTH networks:
3-
# - 172.31.250.1 (10G DAC) for GPU workloads
4-
# - 192.168.10.133 (2.5G LAN) for other workloads
1+
# NFS CSI deployment
2+
# All nodes can reach TrueNAS at 192.168.10.133 via 10G switch
53

64
storageClass:
75
create: false
86

9-
# Controller on GPU node (has dual network access)
107
controller:
118
runOnControlPlane: false
129
runOnMaster: false
13-
nodeSelector:
14-
feature.node.kubernetes.io/pci-0300_10de.present: "true"
15-
tolerations:
16-
- key: "nvidia.com/gpu"
17-
operator: "Exists"
18-
effect: "NoSchedule"

infrastructure/storage/csi-driver-smb/storage-class.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ metadata:
55
name: comfyui-smb
66
provisioner: smb.csi.k8s.io
77
parameters:
8-
source: //172.31.250.1/k8s/comfyui
8+
source: //192.168.10.133/k8s/comfyui
99
csi.storage.k8s.io/node-stage-secret-name: smbcreds
1010
# ✅ CORRECTED: Point to the namespace where the secret is actually created
1111
csi.storage.k8s.io/node-stage-secret-namespace: csi-driver-smb
@@ -26,7 +26,7 @@ metadata:
2626
provisioner: smb.csi.k8s.io
2727
parameters:
2828
# Point directly to the ollama subfolder on the SMB share
29-
source: //172.31.250.1/k8s/ollama
29+
source: //192.168.10.133/k8s/ollama
3030
csi.storage.k8s.io/node-stage-secret-name: smbcreds
3131
csi.storage.k8s.io/node-stage-secret-namespace: csi-driver-smb
3232
mountOptions:
@@ -46,7 +46,7 @@ metadata:
4646
provisioner: smb.csi.k8s.io
4747
parameters:
4848
# Point directly to the ollama subfolder on the SMB share
49-
source: //172.31.250.1/k8s/frigate
49+
source: //192.168.10.133/k8s/frigate
5050
csi.storage.k8s.io/node-stage-secret-name: smbcreds
5151
csi.storage.k8s.io/node-stage-secret-namespace: csi-driver-smb
5252
mountOptions:
@@ -66,7 +66,7 @@ metadata:
6666
provisioner: smb.csi.k8s.io
6767
parameters:
6868
# Point directly to the ollama subfolder on the SMB share
69-
source: //172.31.250.1/k8s/llama-cpp
69+
source: //192.168.10.133/k8s/llama-cpp
7070
csi.storage.k8s.io/node-stage-secret-name: smbcreds
7171
csi.storage.k8s.io/node-stage-secret-namespace: csi-driver-smb
7272
mountOptions:

my-apps/ai/comfyui/pvc.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ spec:
1111
persistentVolumeReclaimPolicy: Retain
1212
storageClassName: nfs-comfyui-10g
1313
nfs:
14-
server: 172.31.250.1
14+
server: 192.168.10.133
1515
path: /mnt/BigTank/k8s/comfyui
1616
mountOptions:
1717
- nfsvers=4.1

my-apps/ai/llama-cpp/pvc.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ spec:
1111
persistentVolumeReclaimPolicy: Retain
1212
storageClassName: nfs-llama-cpp-10g
1313
nfs:
14-
server: 172.31.250.1
14+
server: 192.168.10.133
1515
path: /mnt/BigTank/k8s/llama-cpp
1616
mountOptions:
1717
- nfsvers=4.1

0 commit comments

Comments
 (0)