Skip to content

Commit 02a3918

Browse files
Feat: Add real topology data and Working ipv6.
1 parent 664a264 commit 02a3918

14 files changed

Lines changed: 1367 additions & 67 deletions

File tree

Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
# OKE BM.GPU.GB200-v3.4 RoCEv2 dranet Demo
2+
3+
End-to-end demo of topologically-aware GPU + RoCEv2 NIC allocation using
4+
[Dynamic Resource Allocation (DRA)](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
5+
on Oracle Kubernetes Engine (OKE) with [BM.GPU.GB200-v3.4](https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-gpu) shapes.
6+
7+
## Context
8+
9+
### Shape: BM.GPU.GB200-v3.4
10+
11+
Each node has:
12+
13+
| Resource | Count | Detail |
14+
|---|---|---|
15+
| GPU | 4 x NVIDIA GB200 | 189 GB HBM3e, Blackwell architecture, NVLink-18 all-to-all |
16+
| NIC | 8 x Mellanox ConnectX-8 | 400 Gb/s RoCEv2, 4x NDR per NIC |
17+
| NUMA nodes | 2 | 2 GPUs + 4 NICs per NUMA node |
18+
19+
### GPU-NIC topology
20+
21+
On GB200, GPUs connect to the Grace CPU via **NVLink C2C** (chip-to-chip), while
22+
NICs connect via PCIe. Because GPUs and NICs are on fundamentally different
23+
interconnects, `nvidia-smi topo -m` reports **SYS** for every GPU-NIC pair:
24+
25+
| | GPU0 | GPU1 | GPU2 | GPU3 | NIC0 | NIC1 | NIC2 | NIC3 | NIC4 | NIC5 | NIC6 | NIC7 |
26+
|------|------|------|------|------|------|------|------|------|------|------|------|------|
27+
| GPU0 | X | NV18 | NV18 | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS |
28+
| GPU1 | NV18 | X | NV18 | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS |
29+
| GPU2 | NV18 | NV18 | X | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS |
30+
| GPU3 | NV18 | NV18 | NV18 | X | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS |
31+
32+
NIC mapping: NIC0=mlx5_0/rdma0 (NUMA 0), ... NIC3=mlx5_3/rdma3 (NUMA 0), NIC4=mlx5_5/rdma4 (NUMA 1), ... NIC7=mlx5_8/rdma7 (NUMA 1)
33+
34+
> **Key difference from Azure GB300:** On Azure, GPU-NIC pairs on the same NUMA
35+
> node have **NODE** affinity. On OKE GB200, all pairs report **SYS** because the
36+
> C2C link is not visible to the PCIe topology. Despite this, NCCL enables GDR
37+
> via the `NCCL_NET_GDR_C2C=1` flag for NUMA-local NICs, achieving comparable
38+
> bandwidth. The practical performance difference is NUMA-local vs cross-NUMA.
39+
40+
### DRA device attributes
41+
42+
**GPU** (driver: `gpu.nvidia.com`):
43+
44+
| Device | pciBusID | pcieRoot | NUMA |
45+
|---|---|---|---|
46+
| gpu-0 | 0008:06:00.0 | pci0008:00 | 0 |
47+
| gpu-1 | 0009:06:00.0 | pci0009:00 | 0 |
48+
| gpu-2 | 0018:06:00.0 | pci0018:00 | 1 |
49+
| gpu-3 | 0019:06:00.0 | pci0019:00 | 1 |
50+
51+
**NIC** (driver: `dra.net`):
52+
53+
| Device | ifName | pciAddress | NUMA | pcieRoot |
54+
|---|---|---|---|---|
55+
| pci-0000-03-00-0 | rdma0 | 0000:03:00.0 | 0 | pci0000:00 |
56+
| pci-0000-03-00-1 | rdma1 | 0000:03:00.1 | 0 | pci0000:00 |
57+
| pci-0002-03-00-0 | rdma2 | 0002:03:00.0 | 0 | pci0002:00 |
58+
| pci-0002-03-00-1 | rdma3 | 0002:03:00.1 | 0 | pci0002:00 |
59+
| pci-0010-03-00-0 | rdma4 | 0010:03:00.0 | 1 | pci0010:00 |
60+
| pci-0010-03-00-1 | rdma5 | 0010:03:00.1 | 1 | pci0010:00 |
61+
| pci-0012-03-00-0 | rdma6 | 0012:03:00.0 | 1 | pci0012:00 |
62+
| pci-0012-03-00-1 | rdma7 | 0012:03:00.1 | 1 | pci0012:00 |
63+
64+
### OKE topology attributes (oke.dra.net)
65+
66+
Each NIC device carries node-level RDMA topology attributes sourced from the
67+
OCI Instance Metadata Service (`GET /opc/v2/host/`):
68+
69+
| Attribute | Description |
70+
|---|---|
71+
| `oke.dra.net/hpcIslandId` | HPC Island -- largest topology grouping (~2000 nodes) |
72+
| `oke.dra.net/networkBlockId` | Network Block -- mid-level grouping (~64-128 nodes) |
73+
| `oke.dra.net/localBlockId` | Local Block -- closest grouping (~8-32 nodes) |
74+
| `oke.dra.net/rackId` | Physical rack identifier |
75+
| `oke.dra.net/gpuMemoryFabricId` | GPU memory fabric ID (populated on GB200/GB300) |
76+
77+
> **Note:** Topology data must be enabled for your OCI tenancy. dranet logs
78+
> `"Please turn on TopologyData for your Tenancy"` at startup if the `/host/`
79+
> endpoint does not provide `rdmaTopologyData`.
80+
81+
### RoCEv2 and IPv6 on OKE
82+
83+
The ConnectX-8 NICs use **RoCEv2** (RDMA over Converged Ethernet v2). On OKE,
84+
each RDMA NIC receives a globally-routable IPv6 address via Router Advertisement.
85+
This address populates a routable GID in the NIC's GID table, which NCCL uses
86+
for inter-node communication (`NCCL_IB_GID_INDEX=3`).
87+
88+
**Challenge:** In single-stack IPv4 Kubernetes clusters, the container runtime
89+
sets `net.ipv6.conf.all.disable_ipv6=1` in pod namespaces. This prevents the
90+
RA-assigned IPv6 address from being applied to RDMA NICs in the pod, leaving
91+
only link-local GIDs (which are not routable on the OKE fabric).
92+
93+
**dranet fix:** The OKE cloud provider returns `EnableIPv6: true` for RDMA
94+
devices on GPU fabric shapes. When set, dranet:
95+
96+
1. Soft-fails the initial IPv6 address application (EACCES due to disabled IPv6)
97+
2. Enables IPv6 per-interface via `net.ipv6.conf.<ifname>.disable_ipv6=0`
98+
3. Re-applies the IPv6 address, populating the routable GID at index 3
99+
100+
## Files
101+
102+
| File | Description |
103+
|---|---|
104+
| `resource-claim-template.yaml` | Three `ResourceClaimTemplate` objects for the three test cases |
105+
| `mpi-job.yaml` | `MPIJob` that runs `nccl_tests/all_reduce_perf` across 2 workers |
106+
| `resourceslice-gpu.yaml` | Live GPU `ResourceSlice` from a GB200 node (reference) |
107+
| `resourceslice-dranet.yaml` | Live NIC `ResourceSlice` from a GB200 node (reference) |
108+
109+
## Installation
110+
111+
### 1. Uninstall the existing dranet
112+
113+
```bash
114+
helm uninstall dranet -n kube-system
115+
kubectl wait --for=delete pod -l k8s-app=dranet -n kube-system --timeout=120s
116+
```
117+
118+
### 2. Install your local dranet build
119+
120+
Build and push your image, then install from the local Helm chart:
121+
122+
```bash
123+
helm install dranet ./deployments/helm/dranet \
124+
--namespace kube-system \
125+
--set image.repository=<your-registry>/dranet \
126+
--set image.tag=<your-tag> \
127+
--set image.pullPolicy=Always
128+
kubectl rollout status daemonset/dranet -n kube-system
129+
```
130+
131+
## Usage
132+
133+
```bash
134+
# Install MPI Operator (if not already installed)
135+
kubectl apply --server-side -k "https://github.com/kubeflow/mpi-operator/manifests/overlays/standalone?ref=v0.7.0"
136+
137+
# Apply ResourceClaimTemplates
138+
kubectl apply -f resource-claim-template.yaml
139+
140+
# Select a test case: edit mpi-job.yaml resourceClaimTemplateName to one of:
141+
# 1nic-aligned | 2nic-aligned | 1nic-unaligned
142+
kubectl apply -f mpi-job.yaml
143+
144+
# Wait for workers then stream launcher logs
145+
kubectl wait --for=condition=ready pod \
146+
-l training.kubeflow.org/job-name=nccl-test-dra,training.kubeflow.org/job-role=worker \
147+
--timeout=300s
148+
launcher=$(kubectl get pods \
149+
-l training.kubeflow.org/job-name=nccl-test-dra,training.kubeflow.org/job-role=launcher \
150+
-o jsonpath='{.items[0].metadata.name}')
151+
kubectl logs -f "${launcher}"
152+
```
153+
154+
## ResourceClaimTemplates
155+
156+
Three templates are defined, each allocating 1 GPU + N NICs per worker pod.
157+
Update `mpi-job.yaml` `resourceClaimTemplateName:` to switch between them.
158+
159+
### `1nic-aligned` -- 1 GPU + 1 NIC, same NUMA
160+
161+
gpu-0 (`0008:06:00.0`, NUMA 0) + rdma3 (`0002:03:00.1`, NUMA 0). NCCL enables
162+
GDR via C2C with `NCCL_NET_GDR_C2C=1`, transport: `NET/IB/0/GDRDMA(PCI)`.
163+
164+
### `2nic-aligned` -- 1 GPU + 2 NICs, same NUMA
165+
166+
gpu-0 (`0008:06:00.0`) + rdma2 + rdma3 (both NUMA 0, PCIe domain `0002`).
167+
Doubles available RoCEv2 bandwidth and NCCL channels (8 vs 4).
168+
169+
### `1nic-unaligned` -- 1 GPU + 1 NIC, cross-NUMA
170+
171+
gpu-0 (`0008:06:00.0`, NUMA 0) + rdma4 (`0010:03:00.0`, NUMA 1). GDR is
172+
disabled by NCCL; expect significantly lower bandwidth due to cross-NUMA memory
173+
traffic and fewer NCCL channels (2 vs 4).
174+
175+
## Running the full test suite
176+
177+
Each test requires deleting the previous MPIJob since the resource claims are
178+
immutable. Between tests, orphaned NICs may need PCI rebinding (see next section).
179+
180+
```bash
181+
# --- Test 1: 1nic-aligned ---
182+
# Ensure resourceClaimTemplateName: 1nic-aligned in mpi-job.yaml
183+
kubectl apply -f resource-claim-template.yaml
184+
kubectl apply -f mpi-job.yaml
185+
# Wait for results ...
186+
kubectl delete mpijob nccl-test-dra
187+
188+
# --- Recover orphaned NICs before next test ---
189+
# See "Recovering orphaned RDMA NICs" below
190+
191+
# --- Test 2: 1nic-unaligned ---
192+
# Edit mpi-job.yaml: resourceClaimTemplateName: 1nic-unaligned
193+
kubectl apply -f mpi-job.yaml
194+
# Wait for results ...
195+
kubectl delete mpijob nccl-test-dra
196+
197+
# --- Recover orphaned NICs before next test ---
198+
199+
# --- Test 3: 2nic-aligned ---
200+
# Edit mpi-job.yaml: resourceClaimTemplateName: 2nic-aligned
201+
kubectl apply -f mpi-job.yaml
202+
# Wait for results ...
203+
kubectl delete mpijob nccl-test-dra
204+
```
205+
206+
## Recovering orphaned RDMA NICs
207+
208+
When a pod is deleted, dranet may not return the RDMA NIC from the pod namespace
209+
to the host namespace. The NIC disappears from both the host and the ResourceSlice
210+
(`ifName: null, rdma: false`). This is a pre-existing dranet bug, not
211+
OKE-specific.
212+
213+
**Symptoms:** Workers stuck in `Pending` with `cannot allocate all claims`.
214+
215+
**Check which NICs are missing:**
216+
217+
```bash
218+
kubectl get resourceslice -o json | python3 -c "
219+
import json, sys
220+
data = json.load(sys.stdin)
221+
for rs in data['items']:
222+
if rs['spec'].get('driver') != 'dra.net': continue
223+
node = rs['spec']['nodeName']
224+
for d in rs['spec'].get('devices', []):
225+
attrs = d.get('attributes', {})
226+
if attrs.get('dra.net/rdma', {}).get('bool') and \
227+
not attrs.get('dra.net/virtual', {}).get('bool', True):
228+
print(f'{node}: {d[\"name\"]} ifName={attrs.get(\"dra.net/ifName\", {}).get(\"string\", \"?\")}')
229+
"
230+
```
231+
232+
**Recover via PCI rebind** (requires a privileged debug pod on each GPU node):
233+
234+
```bash
235+
# Start a debug pod (or use an existing one)
236+
kubectl debug node/<node-ip> --image=busybox -it -- sh
237+
238+
# Inside the debug pod, rebind the orphaned NIC's PCI address:
239+
chroot /host
240+
echo "0002:03:00.1" > /sys/bus/pci/drivers/mlx5_core/unbind
241+
sleep 2
242+
echo "0002:03:00.1" > /sys/bus/pci/drivers/mlx5_core/bind
243+
```
244+
245+
Common PCI addresses on BM.GPU.GB200-v3.4:
246+
247+
| ifName | PCI Address | NUMA |
248+
|---|---|---|
249+
| rdma0 | 0000:03:00.0 | 0 |
250+
| rdma1 | 0000:03:00.1 | 0 |
251+
| rdma2 | 0002:03:00.0 | 0 |
252+
| rdma3 | 0002:03:00.1 | 0 |
253+
| rdma4 | 0010:03:00.0 | 1 |
254+
| rdma5 | 0010:03:00.1 | 1 |
255+
| rdma6 | 0012:03:00.0 | 1 |
256+
| rdma7 | 0012:03:00.1 | 1 |
257+
258+
Repeat the unbind/bind for every orphaned NIC on **every GPU node**. Wait ~15
259+
seconds for dranet to rescan, then verify the NIC reappears in the ResourceSlice.
260+
261+
## Benchmark Results
262+
263+
2-node `all_reduce_perf` (`-b 512M -e 8G -f 2 -g 1`), 1 GPU per worker.
264+
Transport: `NET/IB/GDRDMA(PCI)` for NUMA-aligned, `NET/IB` for cross-NUMA.
265+
266+
| Template | GPU | NIC(s) | NUMA relation | Channels | GDR | Avg busbw |
267+
|---|---|---|---|---|---|---|
268+
| `1nic-aligned` | gpu-0 (NUMA 0) | rdma3 (NUMA 0) | same | 4 | yes | **~46 GB/s** |
269+
| `2nic-aligned` | gpu-0 (NUMA 0) | rdma2 + rdma3 (NUMA 0) | same | 8 | yes | **~96 GB/s** |
270+
| `1nic-unaligned` | gpu-0 (NUMA 0) | rdma4 (NUMA 1) | cross | 2 | no | **~25 GB/s** |
271+
272+
### Key observations
273+
274+
**NUMA alignment enables GDR (~1.7x):**
275+
Cross-NUMA placement degrades performance from ~42 GB/s to ~25 GB/s with the
276+
same NIC count. Two compounding penalties:
277+
278+
1. **GDR disabled** -- NCCL falls back from `GDRDMA(PCI)` to staging through
279+
host memory when the NIC is on a different NUMA node from the GPU. On GB200
280+
this is controlled by `NCCL_NET_GDR_C2C=1` which only enables GDR when NCCL
281+
detects a viable C2C path (same NUMA node).
282+
2. **Fewer channels** -- NCCL allocates 2 channels for cross-NUMA NICs vs 4
283+
for NUMA-local NICs.
284+
285+
**2 NICs doubles bandwidth (~2.2x):**
286+
Adding a second NUMA-aligned NIC increases bandwidth from ~42 GB/s to ~91 GB/s.
287+
NCCL doubles the channel count (8 vs 4) and stripes data across both NICs.
288+
The `count: 2` + CEL-selector pattern in the `2nic-aligned` template is the
289+
idiomatic DRA approach for multi-device allocation.
290+
291+
**Isolation confirmed:**
292+
In all cases, the pod sees only the allocated `/dev/infiniband/uverbs*` and
293+
`/dev/infiniband/umad*` devices -- without `privileged: true`. Isolation is
294+
enforced by the dranet NRI plugin injecting only the char devices that correspond
295+
to the DRA-allocated NIC(s).
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
apiVersion: kubeflow.org/v2beta1
2+
kind: MPIJob
3+
metadata:
4+
name: nccl-test-dra
5+
spec:
6+
slotsPerWorker: 1
7+
mpiReplicaSpecs:
8+
Launcher:
9+
replicas: 1
10+
template:
11+
spec:
12+
containers:
13+
- name: nccl
14+
image: iad.ocir.io/idxzjcdglx2s/nccl-tests:cuda-13.1.1-ubuntu-24.04-nccl-2.29.3-020926.1
15+
command: ["/bin/bash", "-c"]
16+
args:
17+
- |
18+
NUM_GPUS=1
19+
NUM_HOSTS=$(sed -n '$=' /etc/mpi/hostfile)
20+
NP=$(($NUM_HOSTS*$NUM_GPUS))
21+
while ! (for host in $(awk '{print $1}' /etc/mpi/hostfile); do
22+
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no $host exit 2>/dev/null || exit 1
23+
done); do
24+
echo "Waiting for workers to be ready..."
25+
sleep 5
26+
done
27+
echo "All workers ready, launching NCCL test across $NUM_HOSTS nodes ($NP ranks)"
28+
mpirun \
29+
--allow-run-as-root \
30+
--bind-to numa \
31+
--mca pml ucx \
32+
--mca coll ^hcoll \
33+
-x LD_LIBRARY_PATH \
34+
-x UCX_NET_DEVICES=eth0 \
35+
-x NCCL_DEBUG=INFO \
36+
-x NCCL_SOCKET_IFNAME=eth0 \
37+
-x NCCL_MNNVL_ENABLE=0 \
38+
-x NCCL_NET_GDR_C2C=1 \
39+
-x NCCL_IB_GID_INDEX=3 \
40+
-x NCCL_IB_TC=41 \
41+
-x NCCL_IB_SL=0 \
42+
-x NCCL_IB_TIMEOUT=22 \
43+
-x RX_QUEUE_LEN=8192 \
44+
-x IB_RX_QUEUE_LEN=8192 \
45+
-x NCCL_IB_QPS_PER_CONNECTION=4 \
46+
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
47+
-x NCCL_BUFFSIZE=16777216 \
48+
-x NCCL_DMABUF_ENABLE=1 \
49+
-x NCCL_NET_PLUGIN=sys \
50+
-x NCCL_NVLS_ENABLE=0 \
51+
-x HCOLL_ENABLE_MCAST_ALL=0 \
52+
-x coll_hcoll_enable=0 \
53+
-np $NP \
54+
/workspace/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1 -c 0
55+
Worker:
56+
replicas: 2
57+
template:
58+
spec:
59+
affinity:
60+
podAntiAffinity:
61+
requiredDuringSchedulingIgnoredDuringExecution:
62+
- labelSelector:
63+
matchLabels:
64+
training.kubeflow.org/job-name: nccl-test-dra
65+
training.kubeflow.org/job-role: worker
66+
topologyKey: kubernetes.io/hostname
67+
automountServiceAccountToken: false
68+
volumes:
69+
- name: shm
70+
emptyDir:
71+
medium: Memory
72+
sizeLimit: 32Gi
73+
resourceClaims:
74+
- name: gpu-nic
75+
resourceClaimTemplateName: 1nic-aligned
76+
containers:
77+
- name: nccl
78+
image: iad.ocir.io/idxzjcdglx2s/nccl-tests:cuda-13.1.1-ubuntu-24.04-nccl-2.29.3-020926.1
79+
volumeMounts:
80+
- mountPath: /dev/shm
81+
name: shm
82+
resources:
83+
claims:
84+
- name: gpu-nic
85+
securityContext:
86+
capabilities:
87+
add:
88+
- IPC_LOCK
89+
tolerations:
90+
- key: "nvidia.com/gpu"
91+
operator: "Exists"
92+
effect: "NoSchedule"

0 commit comments

Comments
 (0)