Skip to content

Commit c43ca7e

Browse files
authored
Add NIXL kv cache transfer example (#203)
* add inference example * refactoring
1 parent 336274f commit c43ca7e

9 files changed

Lines changed: 760 additions & 0 deletions
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# NIXL KV Transfer + dranet Example
2+
3+
End-to-end example of topologically-aware GPU + RDMA NIC allocation with
4+
Kubernetes Dynamic Resource Allocation (DRA). The workload uses NIXL over
5+
UCX/RDMA to copy a GPU-resident buffer between two pods on two GPU nodes. The
6+
buffer is sized like an inference KV-cache handoff, so the result isolates the
7+
transfer path used by disaggregated prefill/decode serving without requiring a
8+
full vLLM router/model stack.
9+
10+
Both GPUs and NICs are allocated via DRA (`gpu.nvidia.com` + `dra.net`). The
11+
same 4-GPU set is used in both runs; only the NIC NUMA placement changes.
12+
13+
The manifests are intentionally cloud-provider agnostic. The checked-in
14+
`ResourceClaimTemplate` values match the tested 8-GPU H100 topology; adapt the
15+
GPU `pciBusID` and NIC `numaNode` selectors to your hardware before running on
16+
another SKU.
17+
18+
## Tested Topology
19+
20+
The included templates were tested on two GPU nodes with:
21+
22+
| Resource | Count | Detail |
23+
|---|---:|---|
24+
| GPU | 8 x NVIDIA H100 | 80 GB HBM3 each |
25+
| NIC | 8 x Mellanox ConnectX VF | RDMA-capable |
26+
| NUMA nodes | 2 | 4 GPU + 4 NIC per NUMA node |
27+
28+
Default mapping used by the templates:
29+
30+
| NUMA | GPUs | NICs |
31+
|---:|---|---|
32+
| 0 | `0001:00:00.0`, `0002:00:00.0`, `0003:00:00.0`, `0008:00:00.0` | `mlx5_0`..`mlx5_3` |
33+
| 1 | `0009:00:00.0`, `000a:00:00.0`, `000b:00:00.0`, `000c:00:00.0` | `mlx5_4`..`mlx5_7` |
34+
35+
Verify the mapping on your cluster before running. Some GPU DRA drivers publish
36+
GPU `pciBusID` but not GPU `numaNode`, so this example selects GPUs by
37+
`pciBusID`; dranet publishes NIC `numaNode` directly.
38+
39+
## Prerequisites
40+
41+
- A Kubernetes cluster with at least two GPU nodes connected by RDMA.
42+
- dranet running on those nodes and publishing NIC devices through a DRA
43+
`DeviceClass`, typically `dranet.net` or your cluster's equivalent.
44+
- A GPU DRA driver publishing GPU devices through `gpu.nvidia.com` or your
45+
cluster's equivalent GPU `DeviceClass`.
46+
- The NRI device-injection path enabled so only the DRA-allocated RDMA devices
47+
are visible inside the benchmark pods.
48+
49+
Inspect published devices before editing the templates:
50+
51+
```bash
52+
kubectl get resourceslices
53+
kubectl get deviceclasses
54+
```
55+
56+
## Files
57+
58+
| File | Description |
59+
|---|---|
60+
| `resource-claim-template-aligned.yaml` | `ResourceClaimTemplate` selecting NUMA-aligned GPUs + NICs |
61+
| `resource-claim-template-unaligned.yaml` | `ResourceClaimTemplate` selecting cross-NUMA GPUs + NICs |
62+
| `nixl-kv-service.yaml` | Headless Service used for the NIXL side-channel handshake |
63+
| `nixl-kv-target.yaml` | Target Pod; registers GPU memory and waits for the initiator |
64+
| `nixl-kv-initiator.yaml` | Initiator Pod; posts the NIXL transfers and prints `RESULT` |
65+
| `nixl_benchmark.py` | Python NIXL benchmark mounted into the pods through a ConfigMap |
66+
| `run_bench.sh` | Pod entrypoint: installs `nixl` and execs the benchmark |
67+
| `kustomization.yaml` | Bundles the templates, pods, Service, and ConfigMap |
68+
69+
## ResourceClaimTemplates
70+
71+
| Template | GPU selection | NIC selection | Purpose |
72+
|---|---|---|---|
73+
| `h100-4gpu-4nic-numa-aligned` | 4 GPUs on NUMA 0 | 4 NICs on NUMA 0 | Same-NUMA GPU/NIC path |
74+
| `h100-4gpu-4nic-numa-unaligned` | Same 4 GPUs on NUMA 0 | 4 NICs on NUMA 1 | Cross-NUMA GPU/NIC path |
75+
76+
The two templates keep compute fixed and keep aggregate NIC count fixed. The
77+
only intended difference is whether each visible GPU reaches a same-NUMA or
78+
remote-NUMA NIC.
79+
80+
If your cluster uses a different NIC `DeviceClass` name, update
81+
`deviceClassName: dranet.net` in both `resource-claim-template-aligned.yaml`
82+
and `resource-claim-template-unaligned.yaml`. If your GPU DRA driver publishes
83+
a reliable GPU NUMA attribute, you can replace the `pciBusID` selector with a
84+
NUMA selector.
85+
86+
## Run
87+
88+
Apply everything via kustomize. This creates both `ResourceClaimTemplate`s, the
89+
`nixl-benchmark` ConfigMap (generated from `nixl_benchmark.py` + `run_bench.sh`),
90+
the headless Service, and both pods:
91+
92+
```bash
93+
kubectl apply -k .
94+
```
95+
96+
`ResourceClaimTemplate.spec` is immutable. If you already created templates with
97+
the same names and need to change their selectors, delete the old templates only
98+
after confirming no active pods are using claims derived from them.
99+
100+
The pods default to the `h100-4gpu-4nic-numa-aligned` template. To run each
101+
placement case three times, swap the template name in the pod manifests with
102+
`sed` between runs. The manifest defaults to a 1 GiB transfer, 20 warmup
103+
iterations, and 100 timed iterations per run.
104+
105+
```bash
106+
for run in 1 2 3; do
107+
for tpl in h100-4gpu-4nic-numa-aligned h100-4gpu-4nic-numa-unaligned; do
108+
echo "=== run ${run}: ${tpl} ==="
109+
110+
for pod in nixl-kv-target.yaml nixl-kv-initiator.yaml; do
111+
sed "s/resourceClaimTemplateName:.*/resourceClaimTemplateName: ${tpl}/" \
112+
"${pod}" | kubectl apply -f -
113+
done
114+
kubectl apply -f nixl-kv-service.yaml
115+
116+
kubectl wait --for=condition=Ready \
117+
pod/nixl-kv-target pod/nixl-kv-initiator \
118+
--timeout=15m
119+
120+
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded \
121+
pod/nixl-kv-initiator --timeout=15m
122+
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded \
123+
pod/nixl-kv-target --timeout=15m
124+
125+
kubectl logs pod/nixl-kv-initiator | tee "results-run${run}-${tpl}.txt"
126+
kubectl logs pod/nixl-kv-target >> "results-run${run}-${tpl}.txt"
127+
128+
kubectl delete pod/nixl-kv-target pod/nixl-kv-initiator svc/nixl-kv-target \
129+
--ignore-not-found --wait=true
130+
done
131+
done
132+
```
133+
134+
The initiator log contains one `RESULT` JSON object. The key fields are
135+
`avg_GBps`, `avg_seconds`, `p50_seconds`, `p95_seconds`, and `p99_seconds`.
136+
137+
## Verify Allocation
138+
139+
Inspect the resolved claims:
140+
141+
```bash
142+
kubectl get resourceclaims -o yaml | grep -E 'name:|device:|driver:|request:'
143+
```
144+
145+
Confirm that only the allocated RDMA devices are visible inside each pod:
146+
147+
```bash
148+
kubectl exec nixl-kv-initiator -- ls /dev/infiniband
149+
kubectl exec nixl-kv-target -- ls /dev/infiniband
150+
```
151+
152+
The pod logs also print `nvidia-smi topo -m`. In the aligned case, the visible
153+
NICs should be NODE-local to the selected GPUs. In the unaligned case, the
154+
visible NICs should be SYS/cross-NUMA relative to the selected GPUs.
155+
156+
The Service in `nixl-kv-service.yaml` is headless (`clusterIP: None`) so the
157+
initiator resolves `nixl-kv-target` to the target pod IP. NIXL's listener should
158+
not use a normal ClusterIP service for this side-channel connection.
159+
160+
## Benchmark Results
161+
162+
Observed on the tested 8 x H100 node topology with 1 GiB NIXL `WRITE`
163+
transfers, 20 warmup iterations, and 100 timed iterations per run:
164+
165+
| Template | Run | Avg bandwidth | Avg latency | p50 | p95 | p99 |
166+
|---|---:|---:|---:|---:|---:|---:|
167+
| `h100-4gpu-4nic-numa-aligned` | 1 | 39.07 GB/s | 27.48 ms | 27.48 ms | 27.50 ms | 27.50 ms |
168+
| `h100-4gpu-4nic-numa-aligned` | 2 | 39.07 GB/s | 27.48 ms | 27.48 ms | 27.49 ms | 27.49 ms |
169+
| `h100-4gpu-4nic-numa-aligned` | 3 | 39.05 GB/s | 27.49 ms | 27.49 ms | 27.51 ms | 27.51 ms |
170+
| `h100-4gpu-4nic-numa-unaligned` | 1 | 27.54 GB/s | 38.99 ms | 38.98 ms | 39.08 ms | 39.11 ms |
171+
| `h100-4gpu-4nic-numa-unaligned` | 2 | 27.54 GB/s | 39.00 ms | 38.99 ms | 39.08 ms | 39.19 ms |
172+
| `h100-4gpu-4nic-numa-unaligned` | 3 | 27.54 GB/s | 38.99 ms | 38.99 ms | 39.06 ms | 39.10 ms |
173+
174+
Three-run mean:
175+
176+
| Template | NICs selected | Avg bandwidth | Avg latency | p50 | p95 | p99 |
177+
|---|---|---:|---:|---:|---:|---:|
178+
| `h100-4gpu-4nic-numa-aligned` | `mlx5_0`..`mlx5_3` | 39.07 GB/s | 27.49 ms | 27.49 ms | 27.50 ms | 27.50 ms |
179+
| `h100-4gpu-4nic-numa-unaligned` | `mlx5_4`..`mlx5_7` | 27.54 GB/s | 38.99 ms | 38.99 ms | 39.08 ms | 39.13 ms |
180+
181+
Same GPUs, same NIC count, same transfer size: same-NUMA GPU/NIC allocation is
182+
about 1.42x higher bandwidth and about 29.5% lower latency for this transfer.
183+
184+
## Inference Relevance
185+
186+
Disaggregated inference transfers KV cache from prefill workers to decode
187+
workers. This benchmark does not run the model; it directly measures the NIXL
188+
VRAM transfer that sits on that critical path.
189+
190+
Approximate KV payload size:
191+
192+
```text
193+
KV bytes = 2 * layers * prompt_tokens * kv_heads * head_dim * dtype_bytes
194+
```
195+
196+
Using the observed 1 GiB result, a 4 GiB KV handoff would take roughly:
197+
198+
| Placement | Estimated transfer time |
199+
|---|---:|
200+
| Same-NUMA GPU/NIC | 110 ms |
201+
| Cross-NUMA GPU/NIC | 156 ms |
202+
203+
At concurrency, the slower transfer path also queues earlier, which is where a
204+
microbenchmark bandwidth gap becomes visible as inference tail latency.
205+
206+
## Notes
207+
208+
- The example uses `pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime` and installs
209+
`nixl` at pod start. This avoids UCX library mixing seen with some larger
210+
CUDA framework images.
211+
- To test a larger simulated KV handoff, edit `--block-size` in
212+
`nixl-kv-target.yaml` and `nixl-kv-initiator.yaml`, for example `4294967296`
213+
for 4 GiB if GPU memory headroom allows it.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
4+
resources:
5+
- resource-claim-template-aligned.yaml
6+
- resource-claim-template-unaligned.yaml
7+
- nixl-kv-service.yaml
8+
- nixl-kv-target.yaml
9+
- nixl-kv-initiator.yaml
10+
11+
configMapGenerator:
12+
- name: nixl-benchmark
13+
files:
14+
- nixl_benchmark.py
15+
- run_bench.sh
16+
options:
17+
disableNameSuffixHash: true
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
apiVersion: v1
2+
kind: Pod
3+
metadata:
4+
name: nixl-kv-initiator
5+
labels:
6+
app.kubernetes.io/name: nixl-kv-transfer
7+
app.kubernetes.io/role: initiator
8+
spec:
9+
restartPolicy: Never
10+
automountServiceAccountToken: false
11+
affinity:
12+
podAntiAffinity:
13+
requiredDuringSchedulingIgnoredDuringExecution:
14+
- labelSelector:
15+
matchLabels:
16+
app.kubernetes.io/name: nixl-kv-transfer
17+
topologyKey: kubernetes.io/hostname
18+
resourceClaims:
19+
- name: gpu-nic
20+
resourceClaimTemplateName: h100-4gpu-4nic-numa-aligned
21+
containers:
22+
- name: bench
23+
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime
24+
imagePullPolicy: IfNotPresent
25+
securityContext:
26+
capabilities:
27+
add:
28+
- IPC_LOCK
29+
env:
30+
- name: PYTHONUNBUFFERED
31+
value: "1"
32+
- name: UCX_LOG_LEVEL
33+
value: warn
34+
- name: UCX_PROTO_INFO
35+
value: "y"
36+
resources:
37+
claims:
38+
- name: gpu-nic
39+
requests:
40+
cpu: "8"
41+
memory: 64Gi
42+
volumeMounts:
43+
- name: bench-script
44+
mountPath: /bench
45+
- name: shm
46+
mountPath: /dev/shm
47+
command:
48+
- /bench/run_bench.sh
49+
args:
50+
- --mode=initiator
51+
- --target-host=nixl-kv-target
52+
- --port=5555
53+
- --backend=UCX
54+
- --op=WRITE
55+
- --block-size=1073741824
56+
- --batch-size=1
57+
- --iters=100
58+
- --warmup-iters=20
59+
volumes:
60+
- name: bench-script
61+
configMap:
62+
name: nixl-benchmark
63+
defaultMode: 0755
64+
- name: shm
65+
emptyDir:
66+
medium: Memory
67+
sizeLimit: 64Gi
68+
tolerations:
69+
- key: nvidia.com/gpu
70+
operator: Exists
71+
effect: NoSchedule
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
apiVersion: v1
2+
kind: Service
3+
metadata:
4+
name: nixl-kv-target
5+
labels:
6+
app.kubernetes.io/name: nixl-kv-transfer
7+
spec:
8+
clusterIP: None
9+
selector:
10+
app.kubernetes.io/name: nixl-kv-transfer
11+
app.kubernetes.io/role: target
12+
ports:
13+
- name: nixl
14+
port: 5555
15+
targetPort: 5555
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
apiVersion: v1
2+
kind: Pod
3+
metadata:
4+
name: nixl-kv-target
5+
labels:
6+
app.kubernetes.io/name: nixl-kv-transfer
7+
app.kubernetes.io/role: target
8+
spec:
9+
restartPolicy: Never
10+
automountServiceAccountToken: false
11+
affinity:
12+
podAntiAffinity:
13+
requiredDuringSchedulingIgnoredDuringExecution:
14+
- labelSelector:
15+
matchLabels:
16+
app.kubernetes.io/name: nixl-kv-transfer
17+
topologyKey: kubernetes.io/hostname
18+
resourceClaims:
19+
- name: gpu-nic
20+
resourceClaimTemplateName: h100-4gpu-4nic-numa-aligned
21+
containers:
22+
- name: bench
23+
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime
24+
imagePullPolicy: IfNotPresent
25+
securityContext:
26+
capabilities:
27+
add:
28+
- IPC_LOCK
29+
env:
30+
- name: PYTHONUNBUFFERED
31+
value: "1"
32+
- name: UCX_LOG_LEVEL
33+
value: warn
34+
- name: UCX_PROTO_INFO
35+
value: "y"
36+
resources:
37+
claims:
38+
- name: gpu-nic
39+
requests:
40+
cpu: "8"
41+
memory: 64Gi
42+
volumeMounts:
43+
- name: bench-script
44+
mountPath: /bench
45+
- name: shm
46+
mountPath: /dev/shm
47+
command:
48+
- /bench/run_bench.sh
49+
args:
50+
- --mode=target
51+
- --port=5555
52+
- --backend=UCX
53+
- --op=WRITE
54+
- --block-size=1073741824
55+
- --batch-size=1
56+
- --iters=100
57+
- --warmup-iters=20
58+
volumes:
59+
- name: bench-script
60+
configMap:
61+
name: nixl-benchmark
62+
defaultMode: 0755
63+
- name: shm
64+
emptyDir:
65+
medium: Memory
66+
sizeLimit: 64Gi
67+
tolerations:
68+
- key: nvidia.com/gpu
69+
operator: Exists
70+
effect: NoSchedule

0 commit comments

Comments
 (0)