Skip to content

feat(nvml-mock): RDMA verbs data path for perftest on Kind without IB hardware#403

Open
giuliocalzo wants to merge 9 commits into
NVIDIA:mainfrom
giuliocalzo:feat/issue-374-rdma-verbs
Open

feat(nvml-mock): RDMA verbs data path for perftest on Kind without IB hardware#403
giuliocalzo wants to merge 9 commits into
NVIDIA:mainfrom
giuliocalzo:feat/issue-374-rdma-verbs

Conversation

@giuliocalzo

Copy link
Copy Markdown
Contributor

Summary

Adds an RDMA verbs data path to the mockib subsystem so stock perftest 4.5 (ib_write_bw, ib_read_bw) reports non-zero bandwidth across nodes on Kind — no GPU or InfiniBand hardware required. Closes #374.

  • In-process libibverbs provider shim (pkg/network/mockib/c/rdma_shim.c): real verbs_context/abi_compat, query_port/create_qp_ex, and the extended ibv_wr_* op table so stock perftest works without flags. MR tracking with no-wrap bounds checks; per-port LID/GID surfaced from the mock sysfs tree so the daemon can resolve routes. Gated by MOCK_IB_RDMA=1.
  • Daemon relay (protocol/verbs.go, daemon/verbs_fabric.go, server.go, fabric.go): QP create/connect/destroy, attach, and chunked verbs_op framing under the 1 MiB frame size, reusing the existing length-prefixed JSON TCP fabric (port 18515). Egress is fire-and-forget. Real WRITE/READ/SEND bytes are delivered into the responder MR.
  • Packaging: Dockerfile ships libibmockrdma.so; Helm wires infiniband.rdma.enabled (on by default) to MOCK_IB_RDMA + LD_PRELOAD.

The reported bandwidth is a functional artifact of a JSON relay, not an InfiniBand measurement.

Test plan

  • go test ./pkg/network/mockib/... — green (wire protocol + relay unit tests)
  • helm unittest deployments/nvml-mock/helm/nvml-mock — 112/112
  • go vet ./pkg/network/mockib/... — clean
  • Hardware-free loopback harness (pkg/network/mockib/c/loopback) — builds the production shim, moves real bytes through the relay, runs stock ib_write_bw, asserts non-zero BW and verbs_op traversal. Now run as the rdma-loopback CI job.
  • Gated cross-pod E2E (tests/e2e/validate-rdma.sh) — asserts SERVER_IP != CLIENT_IP, non-zero bandwidth, and real fabric traversal in the daemon logs.

@giuliocalzo giuliocalzo force-pushed the feat/issue-374-rdma-verbs branch from bda1f2b to 3277a2b Compare June 15, 2026 16:21
@giuliocalzo giuliocalzo marked this pull request as draft June 15, 2026 16:46
…t IB hardware

Adds an in-process libibverbs provider shim (libibmockrdma.so) and a daemon
relay that move real RDMA WRITE/READ/SEND bytes between pods over the existing
length-prefixed JSON TCP fabric. This lets stock perftest 4.5 (ib_write_bw,
ib_read_bw) report non-zero bandwidth across nodes on Kind, with no GPU or
InfiniBand hardware.

Highlights:
- C shim (pkg/network/mockib/c/rdma_shim.c): real verbs_context/abi_compat,
  query_port/create_qp_ex, and the extended ibv_wr_* op table so stock perftest
  works without flags. MR tracking with no-wrap bounds checks; per-port LID/GID
  surfaced from the mock sysfs tree so the daemon can resolve routes. Gated by
  MOCK_IB_RDMA=1.
- Go wire contract (protocol/verbs.go) and daemon relay (daemon/verbs_fabric.go,
  server.go, fabric.go): QP create/connect/destroy, attach, and chunked verbs_op
  framing under the 1 MiB frame size, with fire-and-forget egress.
- Packaging: Dockerfile ships the shim; Helm wires infiniband.rdma.enabled
  (on by default) to MOCK_IB_RDMA + LD_PRELOAD. Helm unittests updated.
- Tests: Go unit tests for the wire protocol and relay; a hardware-free loopback
  harness (c/loopback) now run as a CI job; a gated cross-pod E2E
  (tests/e2e/validate-rdma.sh) that asserts non-zero bandwidth and real fabric
  traversal.

The reported bandwidth is a functional artifact of a JSON relay, not an
InfiniBand measurement.

Closes NVIDIA#374

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Removes the hardware-free loopback test harness and the rdma-loopback CI job.
Coverage for the RDMA verbs data path is provided by the Go unit tests and the
gated cross-pod E2E (tests/e2e/validate-rdma.sh).

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Introduce decodeAnd[T] to absorb the repeated decode-then-handle boilerplate
so the dispatch switch reads as a flat one-line-per-message-type table.
No behavior change.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
The Docker builder runs `make` in pkg/network/mockib, which now compiles
libibmockrdma.so from c/rdma_shim.c. That source #includes
<infiniband/verbs.h>, absent from the golang:bookworm builder, breaking the
image build. Install libibverbs-dev (build-only) so the shim compiles; the
runtime image already ships rdma-core/ibverbs-providers for it to load against.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
The container had no readiness probe, so the pod reported Ready the instant
its process started -- while setup.sh was still installing the driver tree.
`helm --wait` / `kubectl rollout status` then returned early, letting e2e
validations (e.g. validate-nvidia-smi.sh) race the install and fail with
"couldn't find libnvidia-ml.so". Enabling the RDMA data path by default
adds an extra LD_PRELOAD lib to every setup.sh subprocess, widening the
window enough to lose the race in CI.

setup.sh now writes /tmp/nvml-mock-setup-complete as its final step and the
DaemonSet gains a readinessProbe that gates on it, so readiness reflects a
finished install for every profile.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
@giuliocalzo giuliocalzo force-pushed the feat/issue-374-rdma-verbs branch from 7b35ddd to c015d8c Compare June 17, 2026 15:59
The ibping-multinode e2e job's "Validate cross-node RDMA bandwidth"
step runs stock ib_write_bw inside the pod (tests/e2e/validate-rdma.sh),
but the runtime image never installed perftest, so the step failed with
"ib_write_bw: not found" after exhausting all retries.

Add perftest to the runtime apt install (alongside the existing
rdma-core/ibverbs packages) so ib_write_bw/ib_read_bw are on PATH and
the verbs data path can be exercised against the libibmockrdma shim.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
@giuliocalzo giuliocalzo marked this pull request as ready for review June 17, 2026 16:16
ib_write_bw blocks indefinitely when the OOB/QP handshake or a
completion never arrives, and the validate-rdma.sh kubectl exec calls
had no timeout, so a wedged verbs data path stalled the ibping-multinode
job until GitHub's 6h wall clock instead of failing with diagnostics.

Wrap both the server and client ib_write_bw invocations in `timeout`
(RDMA_E2E_TIMEOUT, default 30s) so a stuck run is killed and the retry
loop reports failure with the captured server/client logs. Bounding the
server too prevents a blocked run from holding the OOB port into the next
attempt. Add a 20m timeout-minutes backstop on the job as a hard safety
net in case the binary itself wedges outside the wrapped calls.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
@giuliocalzo giuliocalzo marked this pull request as draft June 17, 2026 17:35
The RDMA ib_write_bw e2e hung because stock perftest does its out-of-band
TCP handshake on port 18516 BEFORE creating the QP, and the ibping
NetworkPolicy admitted ingress only on the fabric port (18515). kindnet
enforces NetworkPolicy by default since kind v0.24.0, so the OOB connect
was silently dropped and the run timed out before any verbs frame reached
the daemon.

Admit rdma.oobPort (default 18516) in the policy when the RDMA data path
is enabled, add a configurable value, cover it with helm unittests, and
correct the now-stale comments claiming kindnet ignores NetworkPolicy.

Verified end to end on a 2-node kind v0.31 cluster (kindnet enforcing):
validate-rdma.sh passes with the port admitted and fails without it.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
@giuliocalzo giuliocalzo marked this pull request as ready for review June 18, 2026 07:52

@ArangoGutierrez ArangoGutierrez left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The shim's bounds logic holds up — mr_check's no-wrap test is correct, and apply_write/apply_read_resp/apply_send all re-bound writes against the process's own registered MRs, so a peer can only touch memory explicitly registered. Two non-blocking notes inline. The one thing I'd like back at some point is fast unit coverage of the shim bounds: right now that rests on review plus the single gated E2E, since the loopback harness was dropped in 16e6122.

m->lkey = __sync_fetch_and_add(&g_key, 1);
m->rkey = __sync_fetch_and_add(&g_key, 1);
pthread_mutex_lock(&mr_mu);
for (int i = 0; i < MR_MAX; i++) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the 256-slot MR table is full this drops the MR but still returns a valid ibv_mr*, so mr_check rejects later ops against it with no diagnostic. Fine for perftest's handful of MRs, but a debug log on table-full would save a confusing session.

Addresses review feedback on NVIDIA#403: ibv_reg_mr_iova2 still returns a valid
ibv_mr* when the 256-slot MR table is full, but the MR is untracked, so
mr_check() silently rejects every inbound op against it. That otherwise
looks like a data-path bug. Emit a DBG line (gated on MOCK_IB_DEBUG_VERBS)
on table-full so the rem_access failures are diagnosable. No behavior
change; bounded at MR_MAX=256, far beyond perftest's handful of MRs.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(mockib): RDMA verbs data path (perftest / GPUDirect RDMA)

2 participants