|
| 1 | +# OKE BM.GPU.GB200-v3.4 RoCEv2 dranet Demo |
| 2 | + |
| 3 | +End-to-end demo of topologically-aware GPU + RoCEv2 NIC allocation using |
| 4 | +[Dynamic Resource Allocation (DRA)](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) |
| 5 | +on Oracle Kubernetes Engine (OKE) with [BM.GPU.GB200-v3.4](https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-gpu) shapes. |
| 6 | + |
| 7 | +## Context |
| 8 | + |
| 9 | +### Shape: BM.GPU.GB200-v3.4 |
| 10 | + |
| 11 | +Each node has: |
| 12 | + |
| 13 | +| Resource | Count | Detail | |
| 14 | +|---|---|---| |
| 15 | +| GPU | 4 x NVIDIA GB200 | 189 GB HBM3e, Blackwell architecture, NVLink-18 all-to-all | |
| 16 | +| NIC | 8 x Mellanox ConnectX-8 | 400 Gb/s RoCEv2, 4x NDR per NIC | |
| 17 | +| NUMA nodes | 2 | 2 GPUs + 4 NICs per NUMA node | |
| 18 | + |
| 19 | +### GPU-NIC topology |
| 20 | + |
| 21 | +On GB200, GPUs connect to the Grace CPU via **NVLink C2C** (chip-to-chip), while |
| 22 | +NICs connect via PCIe. Because GPUs and NICs are on fundamentally different |
| 23 | +interconnects, `nvidia-smi topo -m` reports **SYS** for every GPU-NIC pair: |
| 24 | + |
| 25 | +| | GPU0 | GPU1 | GPU2 | GPU3 | NIC0 | NIC1 | NIC2 | NIC3 | NIC4 | NIC5 | NIC6 | NIC7 | |
| 26 | +|------|------|------|------|------|------|------|------|------|------|------|------|------| |
| 27 | +| GPU0 | X | NV18 | NV18 | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS | |
| 28 | +| GPU1 | NV18 | X | NV18 | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS | |
| 29 | +| GPU2 | NV18 | NV18 | X | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS | |
| 30 | +| GPU3 | NV18 | NV18 | NV18 | X | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS | |
| 31 | + |
| 32 | +NIC mapping: NIC0=mlx5_0/rdma0 (NUMA 0), ... NIC3=mlx5_3/rdma3 (NUMA 0), NIC4=mlx5_5/rdma4 (NUMA 1), ... NIC7=mlx5_8/rdma7 (NUMA 1) |
| 33 | + |
| 34 | +> **Key difference from Azure GB300:** On Azure, GPU-NIC pairs on the same NUMA |
| 35 | +> node have **NODE** affinity. On OKE GB200, all pairs report **SYS** because the |
| 36 | +> C2C link is not visible to the PCIe topology. Despite this, NCCL enables GDR |
| 37 | +> via the `NCCL_NET_GDR_C2C=1` flag for NUMA-local NICs, achieving comparable |
| 38 | +> bandwidth. The practical performance difference is NUMA-local vs cross-NUMA. |
| 39 | +
|
| 40 | +### DRA device attributes |
| 41 | + |
| 42 | +**GPU** (driver: `gpu.nvidia.com`): |
| 43 | + |
| 44 | +| Device | pciBusID | pcieRoot | NUMA | |
| 45 | +|---|---|---|---| |
| 46 | +| gpu-0 | 0008:06:00.0 | pci0008:00 | 0 | |
| 47 | +| gpu-1 | 0009:06:00.0 | pci0009:00 | 0 | |
| 48 | +| gpu-2 | 0018:06:00.0 | pci0018:00 | 1 | |
| 49 | +| gpu-3 | 0019:06:00.0 | pci0019:00 | 1 | |
| 50 | + |
| 51 | +**NIC** (driver: `dra.net`): |
| 52 | + |
| 53 | +| Device | ifName | pciAddress | NUMA | pcieRoot | |
| 54 | +|---|---|---|---|---| |
| 55 | +| pci-0000-03-00-0 | rdma0 | 0000:03:00.0 | 0 | pci0000:00 | |
| 56 | +| pci-0000-03-00-1 | rdma1 | 0000:03:00.1 | 0 | pci0000:00 | |
| 57 | +| pci-0002-03-00-0 | rdma2 | 0002:03:00.0 | 0 | pci0002:00 | |
| 58 | +| pci-0002-03-00-1 | rdma3 | 0002:03:00.1 | 0 | pci0002:00 | |
| 59 | +| pci-0010-03-00-0 | rdma4 | 0010:03:00.0 | 1 | pci0010:00 | |
| 60 | +| pci-0010-03-00-1 | rdma5 | 0010:03:00.1 | 1 | pci0010:00 | |
| 61 | +| pci-0012-03-00-0 | rdma6 | 0012:03:00.0 | 1 | pci0012:00 | |
| 62 | +| pci-0012-03-00-1 | rdma7 | 0012:03:00.1 | 1 | pci0012:00 | |
| 63 | + |
| 64 | +### OKE topology attributes (oke.dra.net) |
| 65 | + |
| 66 | +Each NIC device carries node-level RDMA topology attributes sourced from the |
| 67 | +OCI Instance Metadata Service (`GET /opc/v2/host/`): |
| 68 | + |
| 69 | +| Attribute | Description | |
| 70 | +|---|---| |
| 71 | +| `oke.dra.net/hpcIslandId` | HPC Island -- largest topology grouping (~2000 nodes) | |
| 72 | +| `oke.dra.net/networkBlockId` | Network Block -- mid-level grouping (~64-128 nodes) | |
| 73 | +| `oke.dra.net/localBlockId` | Local Block -- closest grouping (~8-32 nodes) | |
| 74 | +| `oke.dra.net/rackId` | Physical rack identifier | |
| 75 | +| `oke.dra.net/gpuMemoryFabricId` | GPU memory fabric ID (populated on GB200/GB300) | |
| 76 | + |
| 77 | +> **Note:** Topology data must be enabled for your OCI tenancy. dranet logs |
| 78 | +> `"Please turn on TopologyData for your Tenancy"` at startup if the `/host/` |
| 79 | +> endpoint does not provide `rdmaTopologyData`. |
| 80 | +
|
| 81 | +### RoCEv2 and IPv6 on OKE |
| 82 | + |
| 83 | +The ConnectX-8 NICs use **RoCEv2** (RDMA over Converged Ethernet v2). On OKE, |
| 84 | +each RDMA NIC receives a globally-routable IPv6 address via Router Advertisement. |
| 85 | +This address populates a routable GID in the NIC's GID table, which NCCL uses |
| 86 | +for inter-node communication (`NCCL_IB_GID_INDEX=3`). |
| 87 | + |
| 88 | +**Challenge:** In single-stack IPv4 Kubernetes clusters, the container runtime |
| 89 | +sets `net.ipv6.conf.all.disable_ipv6=1` in pod namespaces. This prevents the |
| 90 | +RA-assigned IPv6 address from being applied to RDMA NICs in the pod, leaving |
| 91 | +only link-local GIDs (which are not routable on the OKE fabric). |
| 92 | + |
| 93 | +**dranet fix:** The OKE cloud provider returns `EnableIPv6: true` for RDMA |
| 94 | +devices on GPU fabric shapes. When set, dranet: |
| 95 | + |
| 96 | +1. Soft-fails the initial IPv6 address application (EACCES due to disabled IPv6) |
| 97 | +2. Enables IPv6 per-interface via `net.ipv6.conf.<ifname>.disable_ipv6=0` |
| 98 | +3. Re-applies the IPv6 address, populating the routable GID at index 3 |
| 99 | + |
| 100 | +## Files |
| 101 | + |
| 102 | +| File | Description | |
| 103 | +|---|---| |
| 104 | +| `resource-claim-template.yaml` | Three `ResourceClaimTemplate` objects for the three test cases | |
| 105 | +| `mpi-job.yaml` | `MPIJob` that runs `nccl_tests/all_reduce_perf` across 2 workers | |
| 106 | +| `resourceslice-gpu.yaml` | Live GPU `ResourceSlice` from a GB200 node (reference) | |
| 107 | +| `resourceslice-dranet.yaml` | Live NIC `ResourceSlice` from a GB200 node (reference) | |
| 108 | + |
| 109 | +## Installation |
| 110 | + |
| 111 | +### 1. Uninstall the existing dranet |
| 112 | + |
| 113 | +```bash |
| 114 | +helm uninstall dranet -n kube-system |
| 115 | +kubectl wait --for=delete pod -l k8s-app=dranet -n kube-system --timeout=120s |
| 116 | +``` |
| 117 | + |
| 118 | +### 2. Install your local dranet build |
| 119 | + |
| 120 | +Build and push your image, then install from the local Helm chart: |
| 121 | + |
| 122 | +```bash |
| 123 | +helm install dranet ./deployments/helm/dranet \ |
| 124 | + --namespace kube-system \ |
| 125 | + --set image.repository=<your-registry>/dranet \ |
| 126 | + --set image.tag=<your-tag> \ |
| 127 | + --set image.pullPolicy=Always |
| 128 | +kubectl rollout status daemonset/dranet -n kube-system |
| 129 | +``` |
| 130 | + |
| 131 | +## Usage |
| 132 | + |
| 133 | +```bash |
| 134 | +# Install MPI Operator (if not already installed) |
| 135 | +kubectl apply --server-side -k "https://github.com/kubeflow/mpi-operator/manifests/overlays/standalone?ref=v0.7.0" |
| 136 | + |
| 137 | +# Apply ResourceClaimTemplates |
| 138 | +kubectl apply -f resource-claim-template.yaml |
| 139 | + |
| 140 | +# Select a test case: edit mpi-job.yaml resourceClaimTemplateName to one of: |
| 141 | +# 1nic-aligned | 2nic-aligned | 1nic-unaligned |
| 142 | +kubectl apply -f mpi-job.yaml |
| 143 | + |
| 144 | +# Wait for workers then stream launcher logs |
| 145 | +kubectl wait --for=condition=ready pod \ |
| 146 | + -l training.kubeflow.org/job-name=nccl-test-dra,training.kubeflow.org/job-role=worker \ |
| 147 | + --timeout=300s |
| 148 | +launcher=$(kubectl get pods \ |
| 149 | + -l training.kubeflow.org/job-name=nccl-test-dra,training.kubeflow.org/job-role=launcher \ |
| 150 | + -o jsonpath='{.items[0].metadata.name}') |
| 151 | +kubectl logs -f "${launcher}" |
| 152 | +``` |
| 153 | + |
| 154 | +## ResourceClaimTemplates |
| 155 | + |
| 156 | +Three templates are defined, each allocating 1 GPU + N NICs per worker pod. |
| 157 | +Update `mpi-job.yaml` `resourceClaimTemplateName:` to switch between them. |
| 158 | + |
| 159 | +### `1nic-aligned` -- 1 GPU + 1 NIC, same NUMA |
| 160 | + |
| 161 | +gpu-0 (`0008:06:00.0`, NUMA 0) + rdma3 (`0002:03:00.1`, NUMA 0). NCCL enables |
| 162 | +GDR via C2C with `NCCL_NET_GDR_C2C=1`, transport: `NET/IB/0/GDRDMA(PCI)`. |
| 163 | + |
| 164 | +### `2nic-aligned` -- 1 GPU + 2 NICs, same NUMA |
| 165 | + |
| 166 | +gpu-0 (`0008:06:00.0`) + rdma2 + rdma3 (both NUMA 0, PCIe domain `0002`). |
| 167 | +Doubles available RoCEv2 bandwidth and NCCL channels (8 vs 4). |
| 168 | + |
| 169 | +### `1nic-unaligned` -- 1 GPU + 1 NIC, cross-NUMA |
| 170 | + |
| 171 | +gpu-0 (`0008:06:00.0`, NUMA 0) + rdma4 (`0010:03:00.0`, NUMA 1). GDR is |
| 172 | +disabled by NCCL; expect significantly lower bandwidth due to cross-NUMA memory |
| 173 | +traffic and fewer NCCL channels (2 vs 4). |
| 174 | + |
| 175 | +## Running the full test suite |
| 176 | + |
| 177 | +Each test requires deleting the previous MPIJob since the resource claims are |
| 178 | +immutable. Between tests, orphaned NICs may need PCI rebinding (see next section). |
| 179 | + |
| 180 | +```bash |
| 181 | +# --- Test 1: 1nic-aligned --- |
| 182 | +# Ensure resourceClaimTemplateName: 1nic-aligned in mpi-job.yaml |
| 183 | +kubectl apply -f resource-claim-template.yaml |
| 184 | +kubectl apply -f mpi-job.yaml |
| 185 | +# Wait for results ... |
| 186 | +kubectl delete mpijob nccl-test-dra |
| 187 | + |
| 188 | +# --- Recover orphaned NICs before next test --- |
| 189 | +# See "Recovering orphaned RDMA NICs" below |
| 190 | + |
| 191 | +# --- Test 2: 1nic-unaligned --- |
| 192 | +# Edit mpi-job.yaml: resourceClaimTemplateName: 1nic-unaligned |
| 193 | +kubectl apply -f mpi-job.yaml |
| 194 | +# Wait for results ... |
| 195 | +kubectl delete mpijob nccl-test-dra |
| 196 | + |
| 197 | +# --- Recover orphaned NICs before next test --- |
| 198 | + |
| 199 | +# --- Test 3: 2nic-aligned --- |
| 200 | +# Edit mpi-job.yaml: resourceClaimTemplateName: 2nic-aligned |
| 201 | +kubectl apply -f mpi-job.yaml |
| 202 | +# Wait for results ... |
| 203 | +kubectl delete mpijob nccl-test-dra |
| 204 | +``` |
| 205 | + |
| 206 | +## Recovering orphaned RDMA NICs |
| 207 | + |
| 208 | +When a pod is deleted, dranet may not return the RDMA NIC from the pod namespace |
| 209 | +to the host namespace. The NIC disappears from both the host and the ResourceSlice |
| 210 | +(`ifName: null, rdma: false`). This is a pre-existing dranet bug, not |
| 211 | +OKE-specific. |
| 212 | + |
| 213 | +**Symptoms:** Workers stuck in `Pending` with `cannot allocate all claims`. |
| 214 | + |
| 215 | +**Check which NICs are missing:** |
| 216 | + |
| 217 | +```bash |
| 218 | +kubectl get resourceslice -o json | python3 -c " |
| 219 | +import json, sys |
| 220 | +data = json.load(sys.stdin) |
| 221 | +for rs in data['items']: |
| 222 | + if rs['spec'].get('driver') != 'dra.net': continue |
| 223 | + node = rs['spec']['nodeName'] |
| 224 | + for d in rs['spec'].get('devices', []): |
| 225 | + attrs = d.get('attributes', {}) |
| 226 | + if attrs.get('dra.net/rdma', {}).get('bool') and \ |
| 227 | + not attrs.get('dra.net/virtual', {}).get('bool', True): |
| 228 | + print(f'{node}: {d[\"name\"]} ifName={attrs.get(\"dra.net/ifName\", {}).get(\"string\", \"?\")}') |
| 229 | +" |
| 230 | +``` |
| 231 | + |
| 232 | +**Recover via PCI rebind** (requires a privileged debug pod on each GPU node): |
| 233 | + |
| 234 | +```bash |
| 235 | +# Start a debug pod (or use an existing one) |
| 236 | +kubectl debug node/<node-ip> --image=busybox -it -- sh |
| 237 | + |
| 238 | +# Inside the debug pod, rebind the orphaned NIC's PCI address: |
| 239 | +chroot /host |
| 240 | +echo "0002:03:00.1" > /sys/bus/pci/drivers/mlx5_core/unbind |
| 241 | +sleep 2 |
| 242 | +echo "0002:03:00.1" > /sys/bus/pci/drivers/mlx5_core/bind |
| 243 | +``` |
| 244 | + |
| 245 | +Common PCI addresses on BM.GPU.GB200-v3.4: |
| 246 | + |
| 247 | +| ifName | PCI Address | NUMA | |
| 248 | +|---|---|---| |
| 249 | +| rdma0 | 0000:03:00.0 | 0 | |
| 250 | +| rdma1 | 0000:03:00.1 | 0 | |
| 251 | +| rdma2 | 0002:03:00.0 | 0 | |
| 252 | +| rdma3 | 0002:03:00.1 | 0 | |
| 253 | +| rdma4 | 0010:03:00.0 | 1 | |
| 254 | +| rdma5 | 0010:03:00.1 | 1 | |
| 255 | +| rdma6 | 0012:03:00.0 | 1 | |
| 256 | +| rdma7 | 0012:03:00.1 | 1 | |
| 257 | + |
| 258 | +Repeat the unbind/bind for every orphaned NIC on **every GPU node**. Wait ~15 |
| 259 | +seconds for dranet to rescan, then verify the NIC reappears in the ResourceSlice. |
| 260 | + |
| 261 | +## Benchmark Results |
| 262 | + |
| 263 | +2-node `all_reduce_perf` (`-b 512M -e 8G -f 2 -g 1`), 1 GPU per worker. |
| 264 | +Transport: `NET/IB/GDRDMA(PCI)` for NUMA-aligned, `NET/IB` for cross-NUMA. |
| 265 | + |
| 266 | +| Template | GPU | NIC(s) | NUMA relation | Channels | GDR | Avg busbw | |
| 267 | +|---|---|---|---|---|---|---| |
| 268 | +| `1nic-aligned` | gpu-0 (NUMA 0) | rdma3 (NUMA 0) | same | 4 | yes | **~46 GB/s** | |
| 269 | +| `2nic-aligned` | gpu-0 (NUMA 0) | rdma2 + rdma3 (NUMA 0) | same | 8 | yes | **~96 GB/s** | |
| 270 | +| `1nic-unaligned` | gpu-0 (NUMA 0) | rdma4 (NUMA 1) | cross | 2 | no | **~25 GB/s** | |
| 271 | + |
| 272 | +### Key observations |
| 273 | + |
| 274 | +**NUMA alignment enables GDR (~1.7x):** |
| 275 | +Cross-NUMA placement degrades performance from ~42 GB/s to ~25 GB/s with the |
| 276 | +same NIC count. Two compounding penalties: |
| 277 | + |
| 278 | +1. **GDR disabled** -- NCCL falls back from `GDRDMA(PCI)` to staging through |
| 279 | + host memory when the NIC is on a different NUMA node from the GPU. On GB200 |
| 280 | + this is controlled by `NCCL_NET_GDR_C2C=1` which only enables GDR when NCCL |
| 281 | + detects a viable C2C path (same NUMA node). |
| 282 | +2. **Fewer channels** -- NCCL allocates 2 channels for cross-NUMA NICs vs 4 |
| 283 | + for NUMA-local NICs. |
| 284 | + |
| 285 | +**2 NICs doubles bandwidth (~2.2x):** |
| 286 | +Adding a second NUMA-aligned NIC increases bandwidth from ~42 GB/s to ~91 GB/s. |
| 287 | +NCCL doubles the channel count (8 vs 4) and stripes data across both NICs. |
| 288 | +The `count: 2` + CEL-selector pattern in the `2nic-aligned` template is the |
| 289 | +idiomatic DRA approach for multi-device allocation. |
| 290 | + |
| 291 | +**Isolation confirmed:** |
| 292 | +In all cases, the pod sees only the allocated `/dev/infiniband/uverbs*` and |
| 293 | +`/dev/infiniband/umad*` devices -- without `privileged: true`. Isolation is |
| 294 | +enforced by the dranet NRI plugin injecting only the char devices that correspond |
| 295 | +to the DRA-allocated NIC(s). |
0 commit comments