RDMA transporter (RoCE / IB Verbs) for 24.10 by elemberger-peak · Pull Request #927 · logicalclocks/rondb

elemberger-peak · 2026-05-19T17:07:11Z

Summary

Adds an opt-in native RDMA transporter (RoCE / IB Verbs) alongside TCP_Transporter / SHM_Transporter, plus a 12-phase performance/correctness series, an API↔DB gate (off by default), and the tuning/CI fixes we hit on real hardware. Sharing for feedback against 24.10-main.

What's included (22 commits, +5623 / -47, 21 files)

RDMA_Transporter — new transporter class with SEND/RECV-based wire protocol, credit-based flow control, and an out-of-band exchange of QP attributes (QPN, PSN, GID/LID) over the existing control socket. Framing: 24-byte header + payload, MAX_*_MESSAGE_BYTESIZE = 32768.
Config schema — new [RDMA] connection section: RdmaSendBufferMemory, RdmaRecvBufferMemory, RdmaQueueDepth, RdmaInlineThreshold, RdmaCompletionPollBudget, RdmaSpintime, RdmaDevice, RdmaPort, RdmaGidIndex, RdmaTrafficClass, RdmaRetryCount, RdmaRnrRetryCount.
Build/feature gating — WITH_NDB_RDMA CMake option and NDB_RDMA_TRANSPORTER_SUPPORTED runtime guard. Without the flag, all sources compile to no-ops and TCP/SHM behaviour is bit-for-bit unchanged. With the flag set but a binary built without libibverbs symbols (e.g. mgmd), [RDMA] sections are rejected with a clear error message.
12-phase performance/correctness series: P1-2 buffer pool / wire-state cacheline grouping, P3 recv ring, P4 batched doorbells, P5 NUMA-aware completion vectors, P6 MR cache, P7 RDMA WRITE negotiation, P8 remote MR geometry, P9 recv via remote WRITE, P10 write probe, P11 recv-immediate guard, P12 per-WR slot probe.
API↔DB gate — AllowApiToDbRdma (off by default). DB↔DB is always allowed; API↔DB requires the explicit gate.
Test plumbing — RDMA_Transporter-t TAP test with a libibverbs kernel-symbol stub so unit tests link without pulling in Packer<SegmentedSectionArg>'s dependency on copy(). MgmConfig-t / testConfigValues-t extended with [RDMA] and API↔DB scenarios.
Default-tuning fix (top of stack) — Rdma{Send,Recv}BufferMemory defaults raised from 2M → 4M. With 2M / RdmaQueueDepth=64 = 32768 the per-slot size equalled MAX_*_MESSAGE_BYTESIZE and left zero room for the 24-byte framing header. The runtime per-slot validator therefore rejected the default config and connect_server_impl() looped on reconnect. 4M gives 65536 bytes/slot at QD=64 and headroom up to QD=128. Schema minimums unchanged.

Verification

MgmConfig-t → ok 1 - MgmConfig
testConfigValues-t → exit 0 (RDMA roundtrip incl. API↔DB scenarios)
RDMA_Transporter-t → ok 1 - RDMA_Transporter (TAP) with cmake -DCMAKE_BUILD_TYPE=Release -DWITH_NDB_RDMA=ON -DWITHOUT_SERVER=OFF -DWITH_NDBCLUSTER_STORAGE_ENGINE=ON
Live two-node DB↔DB and gated API↔DB tested on Mellanox CX-5 (RoCEv2, mlx5_0), including mdtest under sustained load.

Notes / non-goals

Purely additive: no existing transporter, signal flow, ABI, ConfigValues layout, or wire format is touched outside the new [RDMA] section.
rnr_retry is currently pinned to 7 at RTS regardless of the configured value; lower values produced transport-retry-exhausted under mdtest. The user-visible setting is preserved so a future patch can re-enable user control without an ABI change.
Mainline upgrades / downgrades that don't touch [RDMA] are unaffected.
Sister PRs target 25.10-main and 26.02-main with the same scope — those branches additionally include earlier experimental API↔DB work that is now consolidated under the gate.

Happy to split into smaller per-phase PRs if maintainers prefer that flow.

Adds an opt-in, default-OFF native RDMA transporter for inter-replica data-node traffic, targeting lower commit-path latency on NoOfReplicas=2 deployments with RoCE-capable hosts. Gate-1 safety invariant ----------------------- connect_server_impl() and connect_client_impl() always return false. They allocate the full verbs lifecycle (PD/CQ/QP/MR), run the endpoint handshake, drive the QP through INIT/RTR/RTS, log the negotiated attributes, then release everything and refuse the link. This exercises every setup path end-to-end without putting an RDMA link into the cluster data path. Default build (WITH_NDB_RDMA=OFF) is byte-equivalent to the previous tree: RDMA_Transporter.cpp is excluded from compilation, no libibverbs is linked, and every new code site is guarded by NDB_RDMA_TRANSPORTER_SUPPORTED. What's included (Milestones 1-11) --------------------------------- 1. Build option WITH_NDB_RDMA (off by default), libibverbs detection, NDB_RDMA_TRANSPORTER_SUPPORTED compile define. CMake gating in storage/ndb/src/common/transporter/CMakeLists.txt. 2. Public definitions: tt_RDMA_TRANSPORTER, TransporterConfiguration:: rdma union, TE_RDMA_* error codes in TransporterDefinitions.hpp. 3. Config surface: [RDMA] section in ConfigInfo.cpp, CFG_RDMA_* parameter ids (520-531) in mgmapi_config_parameters.h, CONNECTION_TYPE_RDMA parsing in IPCConfig.cpp. 4. Verbs resource lifecycle: device open, PD/CQ/QP/MR allocation, page-aligned staging buffers, RAII teardown via release_verbs_resources(). Idempotent and crash-safe. 5. Endpoint exchange: rdma_endpoint_v1 (64B, network byte order) swapped over the existing authenticated control socket; QP driven through INIT -> post_initial_receives -> RTR -> RTS with configured retry / RNR-retry counters. 6. Wire framing: rdma_msg_header_v1 (24B, network byte order) with magic/version/header_len/payload_len/send_seq/ack_seq/credit_delta/ flags. Public static encode_msg_header / validate_msg_header. 7. Send path: doSend posts IBV_WR_SEND with optional IBV_SEND_INLINE when within the negotiated cap. iovec_data_sent is called only on completion, never on post. Send-slot bookkeeping + reap_send_ completions. 8. Receive path: per-slot ready queue, reap_recv_completions enforces strict in-order send_seq, folds credit_delta into the peer credit pool, surfaces CREDIT_ONLY/HEARTBEAT control messages distinctly from data messages. Wired into TransporterRegistry via friend-access poll_RDMA() helper and dispatch arms in pollReceive() + performReceive(). 10. Observability counters (Uint64 each, default-zero, accumulating across reconnects): send_posted, send_completions_ok/err, send_inline, send_credit_stalls, copied_send_bytes, recv_posted, recv_completions_ok/err, recv_credit_only_in, copied_recv_bytes, cq_polls_send/recv, cq_budget_hits_send/recv, rnr_events, retry_exceeded_events, qp_fatal_events, reconnect_attempts. Base m_bytes_sent / m_bytes_received are also updated. log_stats() emits a single info line on link teardown. 11. Unit tests for the wire format: TAPTEST(RDMA_Transporter) in RDMA_Transporter.cpp under TEST_RDMA_TRANSPORTER. Covers round-trip with five field combinations, network byte-order layout on every field, single-byte corruption rejections, CREDIT_ONLY / HEARTBEAT semantic rejections, payload-overflow / short-buffer / exact-fit boundary cases. NDB_ADD_TEST entry in CMakeLists.txt plus a reproducible build_and_test_rdma_transporter.sh wrapper that works around upstream gunit/server-deps when WITHOUT_SERVER. Milestone 9 (Qmgr / multi-transporter) is intentionally skipped for Gate 1: single QP per transporter, no live switch to multi-transporter mode. Milestone 12 (benchmark / hardening) requires Gate 2 (i.e. the connect_*_impl change to return true) and is out of scope here. Verified -------- - Default build (WITH_NDB_RDMA=OFF) of ndbtransport compiles unchanged. - WITH_NDB_RDMA=ON build of ndbtransport compiles clean under -Werror. - TAPTEST(RDMA_Transporter) passes (1..1 ok 1 - RDMA_Transporter). Co-Authored-By: Oz <oz-agent@warp.dev>

When ConfigObject::createSection() returns false (for example for a CONNECTION_TYPE_RDMA payload, since ConfigObject/ConfigSection only wire up TCP and SHM under CommSection so far), the caller saveInConfigValues() in ConfigInfo.cpp was discarding the return value and then invoking ConfigObject::put(), which unconditionally dereferenced m_curr_cfg_section. After the prior closeSection() set m_curr_cfg_section to nullptr, this produced a SIGSEGV in ConfigSection::check_magic() during ndb_mgmd config-file parsing (stack: ConfigSection::set -> ConfigObject::put -> saveInConfigValues -> InitConfigFileParser::run_config_rules). Two purely additive changes: 1. saveInConfigValues() now checks the return of createSection(). On failure it reports the error code via ndbout_c and breaks out of the section save loop, so the next put() is never reached. 2. ConfigObject::put(Uint32,Uint32), put64() and put(Uint32,const char*) now NULL-check m_curr_cfg_section. If it is nullptr they record SET_NOT_REAL_SECTION_ERROR and return false instead of dereferencing. Both paths preserve the happy-path behavior bit-for-bit for TCP, SHM and node/system sections. The Gate-1 RDMA invariant (no RDMA transporters actually instantiated) is unaffected. Co-Authored-By: Oz <oz-agent@warp.dev>

… and add defensive checks Gate-2 (full RDMA support in the binary config layer) ----------------------------------------------------- ConfigSection now recognises an RdmaTypeId comm-section type and the matching RDMA_TYPE wire value: storage/ndb/include/util/ConfigSection.hpp + SectionType::RdmaTypeId = 7 + static const Uint32 RDMA_TYPE = 2 ConfigSection.cpp grows RDMA branches in every switch that previously listed TcpTypeId/ShmTypeId: set_section_type(Entry&), set_config_section_type(), get_section_type_value(), get_default_section() (returns nullptr for RDMA, since the v2 wire format still ships 5 default sections only), verify_section(), set_node_id_from_keys(), unpack_comm_section(). get_v1_length() and create_v1_section() are taught to merge against a possibly-NULL default section so an RDMA section carries all its keys inline in v1 without dereferencing the missing m_rdma_default_section. ConfigObject.cpp: - createSection(): map RDMA_TYPE -> RdmaTypeId / CommSection. - copy_current(): RdmaTypeId case pushes the section onto m_comm_sections without a default-copy (none exists yet). - create_default_sections(): two no-op RdmaTypeId cases — RDMA sections are not optimised against a per-type default in this milestone, preserving the 5-default-section v2 wire format. - build_arrays(): RdmaTypeId joins TcpTypeId/ShmTypeId in the comm-sections classification. - get(): NULL-check the result of get_default_section() before dereferencing, so lookups on an RDMA section return false rather than SIGSEGV when the key is absent. The result: an [rdma] block in config.ini is now stored, packed (v1 and v2) and unpacked end-to-end. The Gate-1 saveInConfigValues skip log ("skipping section RDMA: createSection() failed") is replaced by normal commit. No on-wire format change — v2 binary configs still declare 5 default sections. Gate-1.5 (defensive checks against missing transporters) -------------------------------------------------------- Two purely additive guards so the data node cannot SEGV when a configuration omits a transporter for some peer (for any reason, not just RDMA): storage/ndb/src/common/transporter/TransporterRegistry.cpp start_connecting(trp_id): if trp_id == 0, trp_id >= maxTransporters or allTransporters[trp_id] == nullptr, log a warning and return without touching the Transporter pointer. storage/ndb/src/kernel/blocks/trpman.cpp execOPEN_COMORD (both single-node and broadcast paths): when get_the_only_base_trp(nodeId) returns 0, jam() and skip start_connecting() entirely. The peer simply stays unreachable rather than crashing the local data node. Together these patches replace the prior SIGSEGV in Trpman::execOPEN_COMORD -> TransporterRegistry::start_connecting on a configuration that omitted a 1<->2 transporter with a clean warning log and a cluster that fails to form (the expected outcome when no transporter exists between two data nodes). Co-Authored-By: Oz <oz-agent@warp.dev>

In Gate-2 I picked RDMA_TYPE = 2 for the wire value, but CONNECTION_TYPE_RDMA in mgmapi_config_parameters.h is 4 (values 2 and 3 are reserved for the deprecated SCI transporter, intentionally skipped over). ConfigInfo's CI_SECTION entry for "RDMA" carries CONNECTION_TYPE_RDMA (= 4) in its _default field and that value is what arrives in ConfigObject::createSection() as the second argument. With RDMA_TYPE = 2, the type == RDMA_TYPE branch was never taken; createSection() fell through to the else clause, set m_error_code = WRONG_COMM_TYPE and returned false. The Gate-1 defensive skip path then logged skipping section RDMA: createSection(3000, 4) failed, error_code=25 defeating the purpose of Gate-2. This is identical to the convention already established for TCP and SHM: TCP_TYPE == CONNECTION_TYPE_TCP == 0 SHM_TYPE == CONNECTION_TYPE_SHM == 1 RDMA_TYPE == CONNECTION_TYPE_RDMA == 4 <-- fix Bumping the value flips the createSection() dispatch onto the RDMA branch added in the Gate-2 commit; the rest of the wiring (set_section_type, copy_current, create_default_sections, build_arrays, get_default_section, unpack_comm_section) needs no change because they all key off the SectionType enum (RdmaTypeId = 7), which is independent of the wire value. Co-Authored-By: Oz <oz-agent@warp.dev>

connect_server_impl / connect_client_impl now return true on a successful handshake instead of refusing the link. Verbs resources (PD/CQ/QP/MR + send & recv slot tables) are kept alive across the return; the registry sees t->isConnected() == true and report_connect promotes the transporter to CONNECTED, after which the milestone-7/8/9 data path (doSend(), poll_RDMA(), the performReceive() RDMA arm, reap_recv_completions() etc.) starts carrying signals over the QP. To satisfy the base Transporter contract that report_connect() -> recvdata.epoll_add() requires a valid file descriptor, the control socket used for the endpoint exchange is moved into theSocket (set non-blocking, with SO_KEEPALIVE) rather than closed. No signal traffic flows over the socket after RTS; it remains as a stable fd for the registry's per-transporter bookkeeping and as a fast-fail EPOLLHUP signal if the peer process dies. On any handshake failure the path still tears everything down and returns false; the start_clients_thread reconnect loop retries from scratch. This is the patch that turns the RDMA transporter from "verbs + handshake exerciser" into a fully live link between data nodes. Co-Authored-By: Oz <oz-agent@warp.dev>

createMultiTransporter() only had TCP and SHM branches and fell through to require(false) for RDMA. The Gate-1 RDMA_Transporter clone constructor also ended in require(false) as a safety net. Both fire during phase-2 nodegroup multi-transporter setup (Qmgr::create_multi_transporter -> TransporterRegistry::createMultiTransporter) on an RDMA-only cluster, which crashes ndbmtd right after 'Start NDB start phase 2'. This change: * Removes the require(false) at the tail of the clone constructor in RDMA_Transporter.cpp; the clone copies configuration from the source transporter and lets the normal connect path allocate fresh verbs resources (its own QP, MRs, etc.). * Adds an RDMA branch to TransporterRegistry::createMultiTransporter, parallelling the TCP/SHM clones, gated by NDB_RDMA_TRANSPORTER_SUPPORTED. No behavior change for non-RDMA builds; the new branch sits inside the existing NDB_RDMA_TRANSPORTER_SUPPORTED guard and TCP/SHM paths are untouched. Co-Authored-By: Oz <oz-agent@warp.dev>

doSend() previously copied raw bytes out of the upper-layer iovec into each RDMA slot up to max_payload, which could land the tail of a Protocol6 message in slot N and the rest in slot N+1. The receiver calls TransporterRegistry::unpack() once per slot with the slot's contiguous payload; unpack_one() can only consume complete signals from a contiguous buffer and cannot reassemble a partial signal across the boundary between two RDMA receive slots (the buffers are non-contiguous memory regions). The result was that after a split the head slot stayed parked in the receive ready queue with a partial signal, and slot N+1 (with the continuation) queued behind it, freezing the receive pipeline for that link. PK ops are tiny and never tripped this; scans drive multi-KB batches that do. This patch makes doSend() walk the iovec stream signal by signal, using Protocol6::getMessageLength(word1) to discover each next message's length, and only commits a message into the current slot if it fits entirely. If the next message would overflow the slot's remaining max_payload room, we post what we have and start a fresh slot. The upper-layer SendBufferPage / thr_send_page invariant that a single Protocol6 message never spans a page makes the peek inexpensive: word1 is always within the current iovec entry. Defences added: - Reject and disconnect on invalid Protocol6 lengths (corrupt buffer / bug detection) rather than silently sending garbage. - Bail out cleanly when the iovec batch truncates a message; the next doSend() pulls more iovecs. No behavior change for non-RDMA builds; this is contained to RDMA_Transporter::doSend(). Co-Authored-By: Oz <oz-agent@warp.dev>

After the signal-aligned packing fix, the data path carried scan and PK traffic correctly, but a real benchmark surfaced GCP_COMMIT lag growing to 50s (with SignalCounter m_count=1 against the peer) and resulting 266/274 timeouts at the API/SQL layer. There were no qp_fatal, RNR or retry_exceeded markers on either side — the RDMA QPs stayed healthy but a multi-transporter clone with one-directional traffic could no longer get a SEND posted, because m_peer_recv_credits had drained to zero on that clone. Root cause: receive credits were only ever returned to the peer by piggy-backing the credit_delta on an outgoing data SEND. When a clone mostly carries inbound traffic (e.g. GCP_COMMIT from node 1 to node 2), the receiving side accumulates m_pending_credit_grant forever and the peer's m_peer_recv_credits eventually reaches zero. The peer's doSend then enters the credit-stall branch and the in-order GCP signal sits in its send buffer, waiting for credits that will never arrive on the back channel. Fix: at the end of doSend(), after the main draining loop, if the accumulated grant has crossed queue_depth/2, or if our own credit pool against the peer is already low (m_peer_recv_credits < queue_depth/4) and we have any grants to send, post an explicit zero-payload SEND with RDMA_MSG_FLAG_CREDIT_ONLY and credit_delta set. This is exactly the path the original Gate-1 plan described but had not implemented. The receive side already handles CREDIT_ONLY messages correctly: reap_recv_completions() applies the credit_delta, increments recv_credit_only_in, and re-posts the slot without enqueueing for the unpacker. Defences: a fatal ibv_post_send error on the CREDIT_ONLY path also disconnects via TE_RDMA_QP_ERROR. The CREDIT_ONLY slot is committed with bytes_consumed=0 so reap_send_completions does not call iovec_data_sent() (there is no upper-layer payload tied to it). Co-Authored-By: Oz <oz-agent@warp.dev>

The existing CREDIT_ONLY refill in doSend() never fires on multi- transporter clones that mostly carry inbound traffic, because doSend() is rarely (or never) invoked on those clones. As a result, the peer's outbound credit pool against us drains to zero and the inbound direction stalls under benchmark load. Visible symptoms include GCP_COMMIT lag, NDB-274 scan timeouts, and NDB-286 node-fencing. Refactor the existing CREDIT_ONLY emission block out of doSend() into post_credit_only_locked(), then add recv_thread_emit_credit_only(): * Fast-path lock-free check on m_pending_credit_grant; no-op when the half-queue-depth threshold is not met. * Acquires the per-transporter send lock (lock_send_transporter), reaps pending send completions, and posts one CREDIT_ONLY WR via post_credit_only_locked(). The send lock is per-transporter and the send thread never takes the global receive lock, so this cannot deadlock against doSend() on any transporter. Call the new method at the tail of the RDMA branch in TransporterRegistry::performReceive() so every recv batch on every RDMA transporter gets a chance to flush its accumulated grants. Add a send_credit_only_recv_path counter and surface it in the per-link stats line so operators can verify the new path is firing. Co-Authored-By: Oz <oz-agent@warp.dev>

…poll Without a wakeup mechanism the receive thread only notices incoming RDMA data at the next epoll_wait timeout in check_TCP(), because poll_RDMA() is non-blocking and is only called after check_TCP returns. For a pure-RDMA deployment that bounds RDMA receive latency below by the epoll timeout regardless of HCA speed -- the most critical gap in the Gate-3 implementation, as flagged by the RonDB team. Bind each transporter's recv CQ to a dedicated ibv_comp_channel, make the channel's fd non-blocking, and register that fd in the receive thread's epoll set alongside the existing control-socket fd. Tag the event data with RDMA_COMP_CHANNEL_BIT (high bit of u32) so check_TCP() can route comp-channel events to a separate path: handle_recv_comp_event() drains and acks the events and marks m_read_transporters so performReceive()'s RDMA branch will call reap_recv_completions() and process the CQEs. The standard arm-vs-fire race is handled by arming the CQ via ibv_req_notify_cq() at the tail of every reap_recv_completions() pass, and once at the end of run_endpoint_exchange() so the first inbound CQE on a fresh link generates a notification. We accept the tiny race window between the last poll returning 0 and the arm because the existing poll_RDMA() cadence will still pick up such CQEs. EPOLLHUP on the comp channel fd is ignored: the channel is destroyed in release_verbs_resources() during disconnect, and the kernel auto-removes the closed fd from epoll. Disconnect is driven by the control socket's HUP on the same trp_id, which is processed in the existing branch. release_verbs_resources() drains and acks any residual CQ events before destroying the recv CQ, because ibv_destroy_cq blocks while there are unacked events. Failure modes: - ibv_create_comp_channel() failure is fatal at connect time. - O_NONBLOCK on the channel fd is fatal at connect time (otherwise the recv thread would block inside ibv_get_cq_event()). - ibv_req_notify_cq() failure is logged but non-fatal; the existing pollReceive() cadence keeps the link working without the wakeup optimization. - epoll_ctl(ADD, channel_fd) failure is logged but non-fatal for the same reason. Co-Authored-By: Oz <oz-agent@warp.dev>

After Gate-3 sender-side packing (1691a1f / d873737), doSend() legitimately packs MULTIPLE complete Protocol6 messages into a single RDMA payload up to (slot_size - RDMA_MSG_HEADER_BYTES). On a 4 MB recv buffer with queue depth 64 that is 65512 bytes per slot. validate_msg_header still enforced a single-Protocol6-message upper bound of payload_len <= MAX_RECV_MESSAGE_BYTESIZE (32 KB), which was correct before Gate-3 but now falsely rejects legitimate bulk-copy bursts. Symptom observed in production: cli-142 ndbmtd flapped in a restart loop driven by start-phase-5 traffic ("Copying of dictionary information"): RDMA: msg payload_len=64740 exceeds MAX_RECV_MESSAGE_BYTESIZE=32768 Transporter 51 to node 2 disconnected in recv with errnum: 74 ALERT -- Node 1: Forced node shutdown completed. Occurred during startphase 5. Caused by error 2308 The per-Protocol6-message size limit is still enforced inside Packer::unpack_one() where it has the actual signal boundary, so we do not lose any protection by removing the duplicate check here. The remaining (payload_len + header > available) check still bounds payload_len to slot_size - header because the HCA cannot write more than one full receive slot into the WC. Updates the corresponding TAP test in RDMA_Transporter.cpp: the "exceeds protocol-level maximum" rejection becomes an acceptance test that documents the new behavior (payload_len = MAX_RECV_MESSAGE_BYTESIZE+1 with a matching `available` must succeed and round-trip through the output pointer). Test: build_scripts/build_and_test_rdma_transporter.sh -> ok 1.

recv_thread_emit_credit_only() (introduced in d643983/bf7cd2f9103 to refill credits on inbound-only multi-trp clones) reaped pending send completions before posting CREDIT_ONLY. reap_send_completions() invokes iovec_data_sent() -> trp_callback::bytes_sent() on every freed data WR, and trp_callback::bytes_sent() in mt.cpp releases the now-completed thr_send_pages into a thread_local_pool, either the block thread own pool or a dedicated send thread pool obtained via g_send_threads->get_send_buffer_pool(). thread_local_pool is, by design, safe only when mutated from its owner thread: its internal free-list is protected by thread affinity, not by the per-transporter m_send_lock that recv_thread_emit_credit_only() holds. Touching it from the recv thread races with the owner thread own pool operations on other transporters sharing that pool and segfaults inside trp_callback::bytes_sent -> ::bytes_sent -> release_list. The crash was hidden by the validate_msg_header upper-bound check until commit 0a5514b let real start-phase-5 SR-copy traffic through. As soon as the dictionary-copy burst reached the upper layer on cli-142, the recv threads pending-credit-grant crossed the queue_depth/2 threshold for the first time and the recv-thread reap fired, faulting in trp_callback::bytes_sent. Symbolized stack at the crash: logicalclocks#2 trp_callback::bytes_sent logicalclocks#3 RDMA_Transporter::reap_send_completions [clone .part.0] logicalclocks#4 RDMA_Transporter::recv_thread_emit_credit_only logicalclocks#5 TransporterRegistry::performReceive logicalclocks#6 mt_receiver_thread_main Fix: drop the reap_send_completions() call from recv_thread_emit_credit_only(). The send thread iterates every transporter every cycle and calls reap_send_completions() inside doSend() in the thread-correct context, so completed CREDIT_ONLY WRs are still recycled in a bounded number of send-thread passes. If no slot is free on a given recv-thread pass, post_credit_only_locked() already returns false via its (m_send_slots_in_flight >= m_queue_depth) early return; the missed CREDIT_ONLY is recovered on the next receive batch with no additional state. No wire-format change. No new state. Pure removal of a cross-thread callback.

Commit 036ea2a fixed a SIGSEGV crash by removing reap_send_completions() from recv_thread_emit_credit_only(), but it re-introduced the original credit-starvation deadlock on inbound-only multi-transporter clones. In ndbmtd the send thread (mt.cpp do_send) iterates only m_pending_send_trps[], populated by register_pending_send() when a block thread enqueues outbound data. A clone with no block-thread outbound traffic is therefore never visited by the send thread; its CREDIT_ONLY WRs posted by the recv thread accumulate, post_credit_only_locked() starts returning false, m_pending_credit_grant cannot drain, peer m_peer_recv_credits drains to zero, inbound stalls. The symptom returned after the cli-143 rolling restart as renewed GCP_COMMIT lag (10-152s on the mgmd GCP Monitor) and sustained NDB-274/NDB-286 in pnfs-mds. Reintroducing reap_send_completions() unconditionally is unsafe: it calls iovec_data_sent() -> trp_callback::bytes_sent() (mt.cpp line 5504), which asserts sb->m_send_thread != NO_SEND_THREAD and then mutates a thread_local_pool<thr_send_page> via release_local() that has no lock and is bound by thread affinity (release_local touches m_free and m_freelist directly). Calling it from the recv thread races the owning send/block thread on other transporters using the same pool, and crashes inside release_list. This commit restores the reap, but gated on m_bytes_in_flight == 0. The class invariant m_bytes_in_flight == sum over slots of bytes_consumed is maintained because: - doSend() posts data WRs with bytes_consumed = payload_len and m_bytes_in_flight += payload_len. - post_credit_only_locked() posts CREDIT_ONLY WRs with bytes_consumed = 0 and leaves m_bytes_in_flight unchanged. - reap_send_completions() does m_bytes_in_flight -= bytes when freeing each slot. Therefore m_bytes_in_flight == 0 implies every in-flight slot has bytes_consumed == 0. For those slots reap_send_completions() never enters its if (bytes > 0) iovec_data_sent((int)bytes); branch, so trp_callback::bytes_sent() is not called and the unsafe thread_local_pool race cannot occur. The read of m_bytes_in_flight is safe under the per-transporter send lock we already hold: no other thread can post a data WR or reap while we hold the lock, so the value is stable for the duration of the guarded reap. When m_bytes_in_flight > 0, the block thread that posted those data bytes already added the trp to its pending-send mask via register_pending_send(), so the send thread will visit doSend() for this trp in a bounded number of cycles and reap on the thread-correct context. Skipping the recv-thread reap in that case is therefore not a stall source. Local build clean. TAP test (RDMA_Transporter-t) passes (1/1).

Reserve a receive slot for control credit traffic, tighten RDMA endpoint validation, and keep RDMA wire byte stats thread-safe. Co-Authored-By: Oz <oz-agent@warp.dev>

Poll RDMA transporters during bounded receive-side spinning, include RDMA spintime in the receive spin budget, and ensure RDMA-only receive paths spin before blocking. Propagate runtime hostname activation updates to RDMA transporters and add RDMA config round-trip coverage for inactive-node hostnames and RDMA-specific parameters. Co-Authored-By: Oz <oz-agent@warp.dev>

mronstro · 2026-05-20T14:16:41Z

I have looked at the PR and it looks good, there might be a few minor issues I would like to comment on, but mostly it is good. It would be good if you could sign either an Individual Contributor Agreement or a Corporate Contributor Agreement with Hopsworks, we have such an agreement that I can send to you. It is also interesting to discuss to which version of RonDB the contribution should be made. RonDB 24.10 is in maintenance mode, so definitely a possibility, however most of our customers are already working on 26.02 which is based on MySQL 8.4.8 compared to 8.4.4 for 24.10. 26.02 also supports more than 2000 nodes compared to 255 in 24.10. The patch is safe to include in production versions of RonDB where RDMA is disabled. We are working on RonDB 26.04 (will likely be something like 26.10 when it is production ready). In this version I am thinking that RDMA support can also be integrated in the production builds that we do. We will maintain the RDMA support after your contribution and any changes that you propose for RDMA will be taken in after a review if deemed ok, these contributions would be under the same agreement, so agreement is only required for the first contribution as long as the contribution refers to the agreement.

Allow explicit RDMA connections between API and DB nodes while keeping RDMA endpoint validation limited to DB-DB and API-DB pairs. Update config coverage and transporter diagnostics to document and verify the API RDMA path. Co-Authored-By: Oz <oz-agent@warp.dev>

Co-Authored-By: Oz <oz-agent@warp.dev>

DB-side rejected SEND from API node 113 with TE_RDMA_INVALID_HEADER (0x8032) during a remove burst, logging: RDMA: msg payload_len=96 + header=24 exceeds available=24 On the receiver the SEND arrived with sge.length=24 (a CREDIT_ONLY WR) but the header in those 24 bytes carried payload_len=96 from a data WR. Two concurrent send paths -- doSend() driven by a client thread and post_credit_only_locked() driven by recv_thread_emit_ credit_only() -- selected the same slot via find_free_send_slot() because m_send_slots[i].in_flight was a plain bool with no mutual exclusion on the API side. Each path encoded its own header into the same slot_buf, but the SGE captured at ibv_post_send() time came from the CREDIT_ONLY WR while the slot bytes were last written by the data WR. DB-DB works because ndbmtd serializes doSend per transporter via the send thread; the API has multiple send invocation paths and the assumption that the caller holds lock_send_transporter() across the full doSend() body did not hold. Fix: make rdma_send_slot::in_flight a std::atomic<bool>. Move slot ownership commit into find_free_send_slot() via compare_exchange_ strong with acq_rel ordering. doSend() and post_credit_only_locked() no longer set in_flight=true (the CAS in find_free_send_slot() did it). On ibv_post_send() failure both paths now roll the flag back to false before disconnecting so the slot can be reused after reconnect. Stale-WC defense in reap_send_completions() reads the flag with memory_order_acquire; success path stores false with memory_order_release. Co-Authored-By: Oz <oz-agent@warp.dev>

`RDMA_Transporter-t` is registered via `NDB_ADD_TEST` in `storage/ndb/src/common/transporter/CMakeLists.txt`. The target returns silently from the macro when `WITH_UNIT_TESTS=OFF`, but the option defaults to ON whenever `WITHOUT_SERVER=OFF` (see `CMakeLists.txt:843-849`), which is the standard configure RonDB CI uses for a release build. Under that combination the test executable is created, and the link step has been failing since the target was first added: storage/ndb/src/common/transporter/CMakeFiles/ RDMA_Transporter-t.dir/build.make:130: undefined reference to `copy(unsigned int*&, SectionSegmentPool&, SegmentedSectionPtr const&)' Root cause is independent of the RDMA-side code. `Packer.cpp` (part of the `ndbtransport` convenience library that the test binary links against) emits an unconditional explicit instantiation: template void Packer::pack<Packer::SegmentedSectionArg>( Uint32 *theData, SegmentedSectionArg section) const; That instantiation references the free function bool copy(unsigned int *&, SectionSegmentPool &, SegmentedSectionPtr const &); whose definition lives on the kernel side of the build, alongside the `SegmentSubPool` / `SegmentedSectionPtr` implementations. None of the libraries the test depends on (`ndbtransport`, `ndbmgmapi`, `ndbgeneral`, `ndbportlib`, `${NDB_RDMA_LIBRARIES}`, `mytap`) define the symbol, so the linker cannot resolve it. Whether or not the call path is reachable at runtime is irrelevant -- the explicit template instantiation forces emission of the reference at compile time. The standalone harness `build_scripts/build_and_test_rdma_ transporter.sh` already works around the same issue by generating a tiny stub source on the fly and linking it into a hand-built test binary. This change bakes the same stub into the regular CMake build so the `WITH_UNIT_TESTS=ON` CI path can link `RDMA_Transporter-t` without dragging in any kernel-side libraries. Changes ------- `storage/ndb/src/common/transporter/RDMA_Transporter_test_stubs.cpp` (new file) - Defines `bool copy(unsigned int *&, SectionSegmentPool &, SegmentedSectionPtr const &)` as an abort-stub. The wire-format TAP cases exercised by the test executable touch only the pure helpers `encode_msg_header` / `validate_msg_header` and small in-anonymous-namespace routines; none of those paths invoke `Packer::pack<SegmentedSectionArg>`, so the stub is never reached at runtime. If a future test addition does hit it the abort produces a clear diagnostic identifying the stub by name. Opaque forward declarations of `SectionSegmentPool` / `SegmentedSectionPtr` keep the file free of kernel-side header dependencies. `storage/ndb/src/common/transporter/CMakeLists.txt` - Adds `RDMA_Transporter_test_stubs.cpp` as a second source file to the existing `NDB_ADD_TEST(RDMA_Transporter-t ...)` block. The stub is compiled per-target only -- it does not enter the `ndbtransport` convenience library and so cannot leak into any non-test code path. A comment block above the `NDB_ADD_TEST` call documents why the companion source exists. Verification ------------ Built with the exact configure command from the original bug report: cmake -S . -B build-lab \ -DCMAKE_BUILD_TYPE=Release \ -DWITH_NDB_RDMA=ON \ -DWITHOUT_SERVER=OFF \ -DWITH_NDBCLUSTER_STORAGE_ENGINE=ON \ -DCMAKE_INSTALL_PREFIX=/opt/rondb-24.10-rdma-port cmake --build build-lab --target RDMA_Transporter-t Result: `RDMA_Transporter-t` builds and links cleanly. Running the binary: 1..1 ok 1 - RDMA_Transporter The standalone harness `build_scripts/build_and_test_rdma_transporter.sh` is unchanged; it still works because its hand-built link line already includes its own runtime-generated stub, and the new in-tree companion source is gated on the regular CMake `NDB_ADD_TEST` target. No runtime behaviour changes. The production `ndbtransport` library is byte-identical; only the standalone test binary gains the additional translation unit.

The validator in RDMA_Transporter::send_slot_size_or_zero() and recv_slot_size_or_zero() requires buffer / RdmaQueueDepth >= RDMA_MSG_HEADER_BYTES (24) + MAX_*_MESSAGE_BYTESIZE (32768) = 32792 bytes/slot The shipped defaults of 2 MiB buffers with RdmaQueueDepth=64 gave exactly 2*1024*1024 / 64 = 32768 bytes per slot -- equal to the max signal size and leaving zero room for the 24-byte RDMA framing header. allocate_send_slot_state() therefore returned 0, allocate_verbs_resources() failed, and connect_server_impl() fell into an infinite reconnect loop on out-of-the-box clusters. Bump the defaults to 4 MiB on both directions. At the default QueueDepth=64 each slot is now 65536 bytes -- ~2x the worst-case signal -- and there is enough headroom to raise QueueDepth to 128 without retuning the buffer size. The lower schema bound is left at 256K so operators that intentionally pick a low RdmaQueueDepth (down to the schema minimum of 16) can still satisfy the validator with the existing minimum buffer (256*1024 / 16 = 16384 -- still below 32792, but that case must be hand-tuned and the validator now produces a clear error message pointing the operator at QueueDepth/Buffer).

peakaio-ai and others added 15 commits May 18, 2026 08:44

Fix RDMA transporter credit handling

7e33141

Reserve a receive slot for control credit traffic, tighten RDMA endpoint validation, and keep RDMA wire byte stats thread-safe. Co-Authored-By: Oz <oz-agent@warp.dev>

elemberger-peak and others added 6 commits May 21, 2026 07:09

Address API-to-DB RDMA QA follow-ups

8a2479d

Co-Authored-By: Oz <oz-agent@warp.dev>

Address API-to-DB RDMA review nits

2ed9cd2

Co-Authored-By: Oz <oz-agent@warp.dev>

Clarify RDMA config test expectations

a0ae050

Co-Authored-By: Oz <oz-agent@warp.dev>

elemberger-peak force-pushed the 24.10-rdma-port branch from 0f607b9 to 92248c1 Compare May 23, 2026 11:51

elemberger-peak changed the title ~~Fix RDMA transporter credit handling~~ RDMA transporter (RoCE / IB Verbs) for 24.10 May 23, 2026

This was referenced May 23, 2026

RDMA transporter (RoCE / IB Verbs) for 25.10 #933

Open

RDMA transporter (RoCE / IB Verbs) for 26.02 #934

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDMA transporter (RoCE / IB Verbs) for 24.10#927

RDMA transporter (RoCE / IB Verbs) for 24.10#927
elemberger-peak wants to merge 22 commits into
logicalclocks:24.10-mainfrom
elemberger-peak:24.10-rdma-port

elemberger-peak commented May 19, 2026 •

edited

Loading

Uh oh!

mronstro commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

elemberger-peak commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included (22 commits, +5623 / -47, 21 files)

Verification

Notes / non-goals

Uh oh!

mronstro commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elemberger-peak commented May 19, 2026 •

edited

Loading