|
| 1 | +# NIXL KV Transfer + dranet Example |
| 2 | + |
| 3 | +End-to-end example of topologically-aware GPU + RDMA NIC allocation with |
| 4 | +Kubernetes Dynamic Resource Allocation (DRA). The workload uses NIXL over |
| 5 | +UCX/RDMA to copy a GPU-resident buffer between two pods on two GPU nodes. The |
| 6 | +buffer is sized like an inference KV-cache handoff, so the result isolates the |
| 7 | +transfer path used by disaggregated prefill/decode serving without requiring a |
| 8 | +full vLLM router/model stack. |
| 9 | + |
| 10 | +Both GPUs and NICs are allocated via DRA (`gpu.nvidia.com` + `dra.net`). The |
| 11 | +same 4-GPU set is used in both runs; only the NIC NUMA placement changes. |
| 12 | + |
| 13 | +The manifests are intentionally cloud-provider agnostic. The checked-in |
| 14 | +`ResourceClaimTemplate` values match the tested 8-GPU H100 topology; adapt the |
| 15 | +GPU `pciBusID` and NIC `numaNode` selectors to your hardware before running on |
| 16 | +another SKU. |
| 17 | + |
| 18 | +## Tested Topology |
| 19 | + |
| 20 | +The included templates were tested on two GPU nodes with: |
| 21 | + |
| 22 | +| Resource | Count | Detail | |
| 23 | +|---|---:|---| |
| 24 | +| GPU | 8 x NVIDIA H100 | 80 GB HBM3 each | |
| 25 | +| NIC | 8 x Mellanox ConnectX VF | RDMA-capable | |
| 26 | +| NUMA nodes | 2 | 4 GPU + 4 NIC per NUMA node | |
| 27 | + |
| 28 | +Default mapping used by the templates: |
| 29 | + |
| 30 | +| NUMA | GPUs | NICs | |
| 31 | +|---:|---|---| |
| 32 | +| 0 | `0001:00:00.0`, `0002:00:00.0`, `0003:00:00.0`, `0008:00:00.0` | `mlx5_0`..`mlx5_3` | |
| 33 | +| 1 | `0009:00:00.0`, `000a:00:00.0`, `000b:00:00.0`, `000c:00:00.0` | `mlx5_4`..`mlx5_7` | |
| 34 | + |
| 35 | +Verify the mapping on your cluster before running. Some GPU DRA drivers publish |
| 36 | +GPU `pciBusID` but not GPU `numaNode`, so this example selects GPUs by |
| 37 | +`pciBusID`; dranet publishes NIC `numaNode` directly. |
| 38 | + |
| 39 | +## Prerequisites |
| 40 | + |
| 41 | +- A Kubernetes cluster with at least two GPU nodes connected by RDMA. |
| 42 | +- dranet running on those nodes and publishing NIC devices through a DRA |
| 43 | + `DeviceClass`, typically `dranet.net` or your cluster's equivalent. |
| 44 | +- A GPU DRA driver publishing GPU devices through `gpu.nvidia.com` or your |
| 45 | + cluster's equivalent GPU `DeviceClass`. |
| 46 | +- The NRI device-injection path enabled so only the DRA-allocated RDMA devices |
| 47 | + are visible inside the benchmark pods. |
| 48 | + |
| 49 | +Inspect published devices before editing the templates: |
| 50 | + |
| 51 | +```bash |
| 52 | +kubectl get resourceslices |
| 53 | +kubectl get deviceclasses |
| 54 | +``` |
| 55 | + |
| 56 | +## Files |
| 57 | + |
| 58 | +| File | Description | |
| 59 | +|---|---| |
| 60 | +| `resource-claim-template-aligned.yaml` | `ResourceClaimTemplate` selecting NUMA-aligned GPUs + NICs | |
| 61 | +| `resource-claim-template-unaligned.yaml` | `ResourceClaimTemplate` selecting cross-NUMA GPUs + NICs | |
| 62 | +| `nixl-kv-service.yaml` | Headless Service used for the NIXL side-channel handshake | |
| 63 | +| `nixl-kv-target.yaml` | Target Pod; registers GPU memory and waits for the initiator | |
| 64 | +| `nixl-kv-initiator.yaml` | Initiator Pod; posts the NIXL transfers and prints `RESULT` | |
| 65 | +| `nixl_benchmark.py` | Python NIXL benchmark mounted into the pods through a ConfigMap | |
| 66 | +| `run_bench.sh` | Pod entrypoint: installs `nixl` and execs the benchmark | |
| 67 | +| `kustomization.yaml` | Bundles the templates, pods, Service, and ConfigMap | |
| 68 | + |
| 69 | +## ResourceClaimTemplates |
| 70 | + |
| 71 | +| Template | GPU selection | NIC selection | Purpose | |
| 72 | +|---|---|---|---| |
| 73 | +| `h100-4gpu-4nic-numa-aligned` | 4 GPUs on NUMA 0 | 4 NICs on NUMA 0 | Same-NUMA GPU/NIC path | |
| 74 | +| `h100-4gpu-4nic-numa-unaligned` | Same 4 GPUs on NUMA 0 | 4 NICs on NUMA 1 | Cross-NUMA GPU/NIC path | |
| 75 | + |
| 76 | +The two templates keep compute fixed and keep aggregate NIC count fixed. The |
| 77 | +only intended difference is whether each visible GPU reaches a same-NUMA or |
| 78 | +remote-NUMA NIC. |
| 79 | + |
| 80 | +If your cluster uses a different NIC `DeviceClass` name, update |
| 81 | +`deviceClassName: dranet.net` in both `resource-claim-template-aligned.yaml` |
| 82 | +and `resource-claim-template-unaligned.yaml`. If your GPU DRA driver publishes |
| 83 | +a reliable GPU NUMA attribute, you can replace the `pciBusID` selector with a |
| 84 | +NUMA selector. |
| 85 | + |
| 86 | +## Run |
| 87 | + |
| 88 | +Apply everything via kustomize. This creates both `ResourceClaimTemplate`s, the |
| 89 | +`nixl-benchmark` ConfigMap (generated from `nixl_benchmark.py` + `run_bench.sh`), |
| 90 | +the headless Service, and both pods: |
| 91 | + |
| 92 | +```bash |
| 93 | +kubectl apply -k . |
| 94 | +``` |
| 95 | + |
| 96 | +`ResourceClaimTemplate.spec` is immutable. If you already created templates with |
| 97 | +the same names and need to change their selectors, delete the old templates only |
| 98 | +after confirming no active pods are using claims derived from them. |
| 99 | + |
| 100 | +The pods default to the `h100-4gpu-4nic-numa-aligned` template. To run each |
| 101 | +placement case three times, swap the template name in the pod manifests with |
| 102 | +`sed` between runs. The manifest defaults to a 1 GiB transfer, 20 warmup |
| 103 | +iterations, and 100 timed iterations per run. |
| 104 | + |
| 105 | +```bash |
| 106 | +for run in 1 2 3; do |
| 107 | + for tpl in h100-4gpu-4nic-numa-aligned h100-4gpu-4nic-numa-unaligned; do |
| 108 | + echo "=== run ${run}: ${tpl} ===" |
| 109 | + |
| 110 | + for pod in nixl-kv-target.yaml nixl-kv-initiator.yaml; do |
| 111 | + sed "s/resourceClaimTemplateName:.*/resourceClaimTemplateName: ${tpl}/" \ |
| 112 | + "${pod}" | kubectl apply -f - |
| 113 | + done |
| 114 | + kubectl apply -f nixl-kv-service.yaml |
| 115 | + |
| 116 | + kubectl wait --for=condition=Ready \ |
| 117 | + pod/nixl-kv-target pod/nixl-kv-initiator \ |
| 118 | + --timeout=15m |
| 119 | + |
| 120 | + kubectl wait --for=jsonpath='{.status.phase}'=Succeeded \ |
| 121 | + pod/nixl-kv-initiator --timeout=15m |
| 122 | + kubectl wait --for=jsonpath='{.status.phase}'=Succeeded \ |
| 123 | + pod/nixl-kv-target --timeout=15m |
| 124 | + |
| 125 | + kubectl logs pod/nixl-kv-initiator | tee "results-run${run}-${tpl}.txt" |
| 126 | + kubectl logs pod/nixl-kv-target >> "results-run${run}-${tpl}.txt" |
| 127 | + |
| 128 | + kubectl delete pod/nixl-kv-target pod/nixl-kv-initiator svc/nixl-kv-target \ |
| 129 | + --ignore-not-found --wait=true |
| 130 | + done |
| 131 | +done |
| 132 | +``` |
| 133 | + |
| 134 | +The initiator log contains one `RESULT` JSON object. The key fields are |
| 135 | +`avg_GBps`, `avg_seconds`, `p50_seconds`, `p95_seconds`, and `p99_seconds`. |
| 136 | + |
| 137 | +## Verify Allocation |
| 138 | + |
| 139 | +Inspect the resolved claims: |
| 140 | + |
| 141 | +```bash |
| 142 | +kubectl get resourceclaims -o yaml | grep -E 'name:|device:|driver:|request:' |
| 143 | +``` |
| 144 | + |
| 145 | +Confirm that only the allocated RDMA devices are visible inside each pod: |
| 146 | + |
| 147 | +```bash |
| 148 | +kubectl exec nixl-kv-initiator -- ls /dev/infiniband |
| 149 | +kubectl exec nixl-kv-target -- ls /dev/infiniband |
| 150 | +``` |
| 151 | + |
| 152 | +The pod logs also print `nvidia-smi topo -m`. In the aligned case, the visible |
| 153 | +NICs should be NODE-local to the selected GPUs. In the unaligned case, the |
| 154 | +visible NICs should be SYS/cross-NUMA relative to the selected GPUs. |
| 155 | + |
| 156 | +The Service in `nixl-kv-service.yaml` is headless (`clusterIP: None`) so the |
| 157 | +initiator resolves `nixl-kv-target` to the target pod IP. NIXL's listener should |
| 158 | +not use a normal ClusterIP service for this side-channel connection. |
| 159 | + |
| 160 | +## Benchmark Results |
| 161 | + |
| 162 | +Observed on the tested 8 x H100 node topology with 1 GiB NIXL `WRITE` |
| 163 | +transfers, 20 warmup iterations, and 100 timed iterations per run: |
| 164 | + |
| 165 | +| Template | Run | Avg bandwidth | Avg latency | p50 | p95 | p99 | |
| 166 | +|---|---:|---:|---:|---:|---:|---:| |
| 167 | +| `h100-4gpu-4nic-numa-aligned` | 1 | 39.07 GB/s | 27.48 ms | 27.48 ms | 27.50 ms | 27.50 ms | |
| 168 | +| `h100-4gpu-4nic-numa-aligned` | 2 | 39.07 GB/s | 27.48 ms | 27.48 ms | 27.49 ms | 27.49 ms | |
| 169 | +| `h100-4gpu-4nic-numa-aligned` | 3 | 39.05 GB/s | 27.49 ms | 27.49 ms | 27.51 ms | 27.51 ms | |
| 170 | +| `h100-4gpu-4nic-numa-unaligned` | 1 | 27.54 GB/s | 38.99 ms | 38.98 ms | 39.08 ms | 39.11 ms | |
| 171 | +| `h100-4gpu-4nic-numa-unaligned` | 2 | 27.54 GB/s | 39.00 ms | 38.99 ms | 39.08 ms | 39.19 ms | |
| 172 | +| `h100-4gpu-4nic-numa-unaligned` | 3 | 27.54 GB/s | 38.99 ms | 38.99 ms | 39.06 ms | 39.10 ms | |
| 173 | + |
| 174 | +Three-run mean: |
| 175 | + |
| 176 | +| Template | NICs selected | Avg bandwidth | Avg latency | p50 | p95 | p99 | |
| 177 | +|---|---|---:|---:|---:|---:|---:| |
| 178 | +| `h100-4gpu-4nic-numa-aligned` | `mlx5_0`..`mlx5_3` | 39.07 GB/s | 27.49 ms | 27.49 ms | 27.50 ms | 27.50 ms | |
| 179 | +| `h100-4gpu-4nic-numa-unaligned` | `mlx5_4`..`mlx5_7` | 27.54 GB/s | 38.99 ms | 38.99 ms | 39.08 ms | 39.13 ms | |
| 180 | + |
| 181 | +Same GPUs, same NIC count, same transfer size: same-NUMA GPU/NIC allocation is |
| 182 | +about 1.42x higher bandwidth and about 29.5% lower latency for this transfer. |
| 183 | + |
| 184 | +## Inference Relevance |
| 185 | + |
| 186 | +Disaggregated inference transfers KV cache from prefill workers to decode |
| 187 | +workers. This benchmark does not run the model; it directly measures the NIXL |
| 188 | +VRAM transfer that sits on that critical path. |
| 189 | + |
| 190 | +Approximate KV payload size: |
| 191 | + |
| 192 | +```text |
| 193 | +KV bytes = 2 * layers * prompt_tokens * kv_heads * head_dim * dtype_bytes |
| 194 | +``` |
| 195 | + |
| 196 | +Using the observed 1 GiB result, a 4 GiB KV handoff would take roughly: |
| 197 | + |
| 198 | +| Placement | Estimated transfer time | |
| 199 | +|---|---:| |
| 200 | +| Same-NUMA GPU/NIC | 110 ms | |
| 201 | +| Cross-NUMA GPU/NIC | 156 ms | |
| 202 | + |
| 203 | +At concurrency, the slower transfer path also queues earlier, which is where a |
| 204 | +microbenchmark bandwidth gap becomes visible as inference tail latency. |
| 205 | + |
| 206 | +## Notes |
| 207 | + |
| 208 | +- The example uses `pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime` and installs |
| 209 | + `nixl` at pod start. This avoids UCX library mixing seen with some larger |
| 210 | + CUDA framework images. |
| 211 | +- To test a larger simulated KV handoff, edit `--block-size` in |
| 212 | + `nixl-kv-target.yaml` and `nixl-kv-initiator.yaml`, for example `4294967296` |
| 213 | + for 4 GiB if GPU memory headroom allows it. |
0 commit comments