feat(nvml-mock): RDMA verbs data path for perftest on Kind without IB hardware#403
feat(nvml-mock): RDMA verbs data path for perftest on Kind without IB hardware#403giuliocalzo wants to merge 9 commits into
Conversation
bda1f2b to
3277a2b
Compare
…t IB hardware Adds an in-process libibverbs provider shim (libibmockrdma.so) and a daemon relay that move real RDMA WRITE/READ/SEND bytes between pods over the existing length-prefixed JSON TCP fabric. This lets stock perftest 4.5 (ib_write_bw, ib_read_bw) report non-zero bandwidth across nodes on Kind, with no GPU or InfiniBand hardware. Highlights: - C shim (pkg/network/mockib/c/rdma_shim.c): real verbs_context/abi_compat, query_port/create_qp_ex, and the extended ibv_wr_* op table so stock perftest works without flags. MR tracking with no-wrap bounds checks; per-port LID/GID surfaced from the mock sysfs tree so the daemon can resolve routes. Gated by MOCK_IB_RDMA=1. - Go wire contract (protocol/verbs.go) and daemon relay (daemon/verbs_fabric.go, server.go, fabric.go): QP create/connect/destroy, attach, and chunked verbs_op framing under the 1 MiB frame size, with fire-and-forget egress. - Packaging: Dockerfile ships the shim; Helm wires infiniband.rdma.enabled (on by default) to MOCK_IB_RDMA + LD_PRELOAD. Helm unittests updated. - Tests: Go unit tests for the wire protocol and relay; a hardware-free loopback harness (c/loopback) now run as a CI job; a gated cross-pod E2E (tests/e2e/validate-rdma.sh) that asserts non-zero bandwidth and real fabric traversal. The reported bandwidth is a functional artifact of a JSON relay, not an InfiniBand measurement. Closes NVIDIA#374 Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Removes the hardware-free loopback test harness and the rdma-loopback CI job. Coverage for the RDMA verbs data path is provided by the Go unit tests and the gated cross-pod E2E (tests/e2e/validate-rdma.sh). Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Introduce decodeAnd[T] to absorb the repeated decode-then-handle boilerplate so the dispatch switch reads as a flat one-line-per-message-type table. No behavior change. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
The Docker builder runs `make` in pkg/network/mockib, which now compiles libibmockrdma.so from c/rdma_shim.c. That source #includes <infiniband/verbs.h>, absent from the golang:bookworm builder, breaking the image build. Install libibverbs-dev (build-only) so the shim compiles; the runtime image already ships rdma-core/ibverbs-providers for it to load against. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
The container had no readiness probe, so the pod reported Ready the instant its process started -- while setup.sh was still installing the driver tree. `helm --wait` / `kubectl rollout status` then returned early, letting e2e validations (e.g. validate-nvidia-smi.sh) race the install and fail with "couldn't find libnvidia-ml.so". Enabling the RDMA data path by default adds an extra LD_PRELOAD lib to every setup.sh subprocess, widening the window enough to lose the race in CI. setup.sh now writes /tmp/nvml-mock-setup-complete as its final step and the DaemonSet gains a readinessProbe that gates on it, so readiness reflects a finished install for every profile. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
7b35ddd to
c015d8c
Compare
The ibping-multinode e2e job's "Validate cross-node RDMA bandwidth" step runs stock ib_write_bw inside the pod (tests/e2e/validate-rdma.sh), but the runtime image never installed perftest, so the step failed with "ib_write_bw: not found" after exhausting all retries. Add perftest to the runtime apt install (alongside the existing rdma-core/ibverbs packages) so ib_write_bw/ib_read_bw are on PATH and the verbs data path can be exercised against the libibmockrdma shim. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
ib_write_bw blocks indefinitely when the OOB/QP handshake or a completion never arrives, and the validate-rdma.sh kubectl exec calls had no timeout, so a wedged verbs data path stalled the ibping-multinode job until GitHub's 6h wall clock instead of failing with diagnostics. Wrap both the server and client ib_write_bw invocations in `timeout` (RDMA_E2E_TIMEOUT, default 30s) so a stuck run is killed and the retry loop reports failure with the captured server/client logs. Bounding the server too prevents a blocked run from holding the OOB port into the next attempt. Add a 20m timeout-minutes backstop on the job as a hard safety net in case the binary itself wedges outside the wrapped calls. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
The RDMA ib_write_bw e2e hung because stock perftest does its out-of-band TCP handshake on port 18516 BEFORE creating the QP, and the ibping NetworkPolicy admitted ingress only on the fabric port (18515). kindnet enforces NetworkPolicy by default since kind v0.24.0, so the OOB connect was silently dropped and the run timed out before any verbs frame reached the daemon. Admit rdma.oobPort (default 18516) in the policy when the RDMA data path is enabled, add a configurable value, cover it with helm unittests, and correct the now-stale comments claiming kindnet ignores NetworkPolicy. Verified end to end on a 2-node kind v0.31 cluster (kindnet enforcing): validate-rdma.sh passes with the port admitted and fails without it. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Approving. The shim's bounds logic holds up — mr_check's no-wrap test is correct, and apply_write/apply_read_resp/apply_send all re-bound writes against the process's own registered MRs, so a peer can only touch memory explicitly registered. Two non-blocking notes inline. The one thing I'd like back at some point is fast unit coverage of the shim bounds: right now that rests on review plus the single gated E2E, since the loopback harness was dropped in 16e6122.
| m->lkey = __sync_fetch_and_add(&g_key, 1); | ||
| m->rkey = __sync_fetch_and_add(&g_key, 1); | ||
| pthread_mutex_lock(&mr_mu); | ||
| for (int i = 0; i < MR_MAX; i++) { |
There was a problem hiding this comment.
When the 256-slot MR table is full this drops the MR but still returns a valid ibv_mr*, so mr_check rejects later ops against it with no diagnostic. Fine for perftest's handful of MRs, but a debug log on table-full would save a confusing session.
Addresses review feedback on NVIDIA#403: ibv_reg_mr_iova2 still returns a valid ibv_mr* when the 256-slot MR table is full, but the MR is untracked, so mr_check() silently rejects every inbound op against it. That otherwise looks like a data-path bug. Emit a DBG line (gated on MOCK_IB_DEBUG_VERBS) on table-full so the rem_access failures are diagnosable. No behavior change; bounded at MR_MAX=256, far beyond perftest's handful of MRs. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Summary
Adds an RDMA verbs data path to the
mockibsubsystem so stockperftest4.5 (ib_write_bw,ib_read_bw) reports non-zero bandwidth across nodes on Kind — no GPU or InfiniBand hardware required. Closes #374.pkg/network/mockib/c/rdma_shim.c): realverbs_context/abi_compat,query_port/create_qp_ex, and the extendedibv_wr_*op table so stockperftestworks without flags. MR tracking with no-wrap bounds checks; per-port LID/GID surfaced from the mock sysfs tree so the daemon can resolve routes. Gated byMOCK_IB_RDMA=1.protocol/verbs.go,daemon/verbs_fabric.go,server.go,fabric.go): QP create/connect/destroy, attach, and chunkedverbs_opframing under the 1 MiB frame size, reusing the existing length-prefixed JSON TCP fabric (port 18515). Egress is fire-and-forget. Real WRITE/READ/SEND bytes are delivered into the responder MR.libibmockrdma.so; Helm wiresinfiniband.rdma.enabled(on by default) toMOCK_IB_RDMA+LD_PRELOAD.Test plan
go test ./pkg/network/mockib/...— green (wire protocol + relay unit tests)helm unittest deployments/nvml-mock/helm/nvml-mock— 112/112go vet ./pkg/network/mockib/...— cleanpkg/network/mockib/c/loopback) — builds the production shim, moves real bytes through the relay, runs stockib_write_bw, asserts non-zero BW andverbs_optraversal. Now run as therdma-loopbackCI job.tests/e2e/validate-rdma.sh) — assertsSERVER_IP != CLIENT_IP, non-zero bandwidth, and real fabric traversal in the daemon logs.