feat(nvml-mock): RDMA verbs data path for perftest on Kind without IB hardware by giuliocalzo · Pull Request #403 · NVIDIA/k8s-test-infra

giuliocalzo · 2026-06-15T16:01:34Z

Summary

Adds an RDMA verbs data path to the mockib subsystem so stock perftest 4.5 (ib_write_bw, ib_read_bw) reports non-zero bandwidth across nodes on Kind — no GPU or InfiniBand hardware required. Closes #374.

In-process libibverbs provider shim (pkg/network/mockib/c/rdma_shim.c): real verbs_context/abi_compat, query_port/create_qp_ex, and the extended ibv_wr_* op table so stock perftest works without flags. MR tracking with no-wrap bounds checks; per-port LID/GID surfaced from the mock sysfs tree so the daemon can resolve routes. Gated by MOCK_IB_RDMA=1.
Daemon relay (protocol/verbs.go, daemon/verbs_fabric.go, server.go, fabric.go): QP create/connect/destroy, attach, and chunked verbs_op framing under the 1 MiB frame size, reusing the existing length-prefixed JSON TCP fabric (port 18515). Egress is fire-and-forget. Real WRITE/READ/SEND bytes are delivered into the responder MR.
Packaging: Dockerfile ships libibmockrdma.so; Helm wires infiniband.rdma.enabled (on by default) to MOCK_IB_RDMA + LD_PRELOAD.

The reported bandwidth is a functional artifact of a JSON relay, not an InfiniBand measurement.

Test plan

go test ./pkg/network/mockib/... — green (wire protocol + relay unit tests)
helm unittest deployments/nvml-mock/helm/nvml-mock — 112/112
go vet ./pkg/network/mockib/... — clean
Hardware-free loopback harness (pkg/network/mockib/c/loopback) — builds the production shim, moves real bytes through the relay, runs stock ib_write_bw, asserts non-zero BW and verbs_op traversal. Now run as the rdma-loopback CI job.
Gated cross-pod E2E (tests/e2e/validate-rdma.sh) — asserts SERVER_IP != CLIENT_IP, non-zero bandwidth, and real fabric traversal in the daemon logs.

…t IB hardware Adds an in-process libibverbs provider shim (libibmockrdma.so) and a daemon relay that move real RDMA WRITE/READ/SEND bytes between pods over the existing length-prefixed JSON TCP fabric. This lets stock perftest 4.5 (ib_write_bw, ib_read_bw) report non-zero bandwidth across nodes on Kind, with no GPU or InfiniBand hardware. Highlights: - C shim (pkg/network/mockib/c/rdma_shim.c): real verbs_context/abi_compat, query_port/create_qp_ex, and the extended ibv_wr_* op table so stock perftest works without flags. MR tracking with no-wrap bounds checks; per-port LID/GID surfaced from the mock sysfs tree so the daemon can resolve routes. Gated by MOCK_IB_RDMA=1. - Go wire contract (protocol/verbs.go) and daemon relay (daemon/verbs_fabric.go, server.go, fabric.go): QP create/connect/destroy, attach, and chunked verbs_op framing under the 1 MiB frame size, with fire-and-forget egress. - Packaging: Dockerfile ships the shim; Helm wires infiniband.rdma.enabled (on by default) to MOCK_IB_RDMA + LD_PRELOAD. Helm unittests updated. - Tests: Go unit tests for the wire protocol and relay; a hardware-free loopback harness (c/loopback) now run as a CI job; a gated cross-pod E2E (tests/e2e/validate-rdma.sh) that asserts non-zero bandwidth and real fabric traversal. The reported bandwidth is a functional artifact of a JSON relay, not an InfiniBand measurement. Closes NVIDIA#374 Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

Removes the hardware-free loopback test harness and the rdma-loopback CI job. Coverage for the RDMA verbs data path is provided by the Go unit tests and the gated cross-pod E2E (tests/e2e/validate-rdma.sh). Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

Introduce decodeAnd[T] to absorb the repeated decode-then-handle boilerplate so the dispatch switch reads as a flat one-line-per-message-type table. No behavior change. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

The Docker builder runs `make` in pkg/network/mockib, which now compiles libibmockrdma.so from c/rdma_shim.c. That source #includes <infiniband/verbs.h>, absent from the golang:bookworm builder, breaking the image build. Install libibverbs-dev (build-only) so the shim compiles; the runtime image already ships rdma-core/ibverbs-providers for it to load against. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

The container had no readiness probe, so the pod reported Ready the instant its process started -- while setup.sh was still installing the driver tree. `helm --wait` / `kubectl rollout status` then returned early, letting e2e validations (e.g. validate-nvidia-smi.sh) race the install and fail with "couldn't find libnvidia-ml.so". Enabling the RDMA data path by default adds an extra LD_PRELOAD lib to every setup.sh subprocess, widening the window enough to lose the race in CI. setup.sh now writes /tmp/nvml-mock-setup-complete as its final step and the DaemonSet gains a readinessProbe that gates on it, so readiness reflects a finished install for every profile. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

The ibping-multinode e2e job's "Validate cross-node RDMA bandwidth" step runs stock ib_write_bw inside the pod (tests/e2e/validate-rdma.sh), but the runtime image never installed perftest, so the step failed with "ib_write_bw: not found" after exhausting all retries. Add perftest to the runtime apt install (alongside the existing rdma-core/ibverbs packages) so ib_write_bw/ib_read_bw are on PATH and the verbs data path can be exercised against the libibmockrdma shim. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

ib_write_bw blocks indefinitely when the OOB/QP handshake or a completion never arrives, and the validate-rdma.sh kubectl exec calls had no timeout, so a wedged verbs data path stalled the ibping-multinode job until GitHub's 6h wall clock instead of failing with diagnostics. Wrap both the server and client ib_write_bw invocations in `timeout` (RDMA_E2E_TIMEOUT, default 30s) so a stuck run is killed and the retry loop reports failure with the captured server/client logs. Bounding the server too prevents a blocked run from holding the OOB port into the next attempt. Add a 20m timeout-minutes backstop on the job as a hard safety net in case the binary itself wedges outside the wrapped calls. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

The RDMA ib_write_bw e2e hung because stock perftest does its out-of-band TCP handshake on port 18516 BEFORE creating the QP, and the ibping NetworkPolicy admitted ingress only on the fabric port (18515). kindnet enforces NetworkPolicy by default since kind v0.24.0, so the OOB connect was silently dropped and the run timed out before any verbs frame reached the daemon. Admit rdma.oobPort (default 18516) in the policy when the RDMA data path is enabled, add a configurable value, cover it with helm unittests, and correct the now-stale comments claiming kindnet ignores NetworkPolicy. Verified end to end on a 2-node kind v0.31 cluster (kindnet enforcing): validate-rdma.sh passes with the port admitted and fails without it. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

ArangoGutierrez

Approving. The shim's bounds logic holds up — mr_check's no-wrap test is correct, and apply_write/apply_read_resp/apply_send all re-bound writes against the process's own registered MRs, so a peer can only touch memory explicitly registered. Two non-blocking notes inline. The one thing I'd like back at some point is fast unit coverage of the shim bounds: right now that rests on review plus the single gated E2E, since the loopback harness was dropped in 16e6122.

ArangoGutierrez · 2026-06-22T15:50:23Z

+    m->lkey = __sync_fetch_and_add(&g_key, 1);
+    m->rkey = __sync_fetch_and_add(&g_key, 1);
+    pthread_mutex_lock(&mr_mu);
+    for (int i = 0; i < MR_MAX; i++) {


When the 256-slot MR table is full this drops the MR but still returns a valid ibv_mr*, so mr_check rejects later ops against it with no diagnostic. Fine for perftest's handful of MRs, but a debug log on table-full would save a confusing session.

Addresses review feedback on NVIDIA#403: ibv_reg_mr_iova2 still returns a valid ibv_mr* when the 256-slot MR table is full, but the MR is untracked, so mr_check() silently rejects every inbound op against it. That otherwise looks like a data-path bug. Emit a DBG line (gated on MOCK_IB_DEBUG_VERBS) on table-full so the rem_access failures are diagnosable. No behavior change; bounded at MR_MAX=256, far beyond perftest's handful of MRs. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

giuliocalzo requested a review from ArangoGutierrez as a code owner June 15, 2026 16:01

giuliocalzo force-pushed the feat/issue-374-rdma-verbs branch from bda1f2b to 3277a2b Compare June 15, 2026 16:21

giuliocalzo marked this pull request as draft June 15, 2026 16:46

giuliocalzo added 5 commits June 17, 2026 17:52

giuliocalzo force-pushed the feat/issue-374-rdma-verbs branch from 7b35ddd to c015d8c Compare June 17, 2026 15:59

giuliocalzo marked this pull request as ready for review June 17, 2026 16:16

giuliocalzo marked this pull request as draft June 17, 2026 17:35

giuliocalzo marked this pull request as ready for review June 18, 2026 07:52

ArangoGutierrez previously approved these changes Jun 22, 2026

View reviewed changes

giuliocalzo dismissed ArangoGutierrez’s stale review via 4fe904b June 22, 2026 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(nvml-mock): RDMA verbs data path for perftest on Kind without IB hardware#403

feat(nvml-mock): RDMA verbs data path for perftest on Kind without IB hardware#403
giuliocalzo wants to merge 9 commits into
NVIDIA:mainfrom
giuliocalzo:feat/issue-374-rdma-verbs

giuliocalzo commented Jun 15, 2026

Uh oh!

ArangoGutierrez left a comment

Uh oh!

ArangoGutierrez Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

giuliocalzo commented Jun 15, 2026

Summary

Test plan

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants