Skip to content

RDMA transporter (RoCE / IB Verbs) for 24.10#927

Open
elemberger-peak wants to merge 22 commits into
logicalclocks:24.10-mainfrom
elemberger-peak:24.10-rdma-port
Open

RDMA transporter (RoCE / IB Verbs) for 24.10#927
elemberger-peak wants to merge 22 commits into
logicalclocks:24.10-mainfrom
elemberger-peak:24.10-rdma-port

Conversation

@elemberger-peak
Copy link
Copy Markdown

@elemberger-peak elemberger-peak commented May 19, 2026

Summary

Adds an opt-in native RDMA transporter (RoCE / IB Verbs) alongside TCP_Transporter / SHM_Transporter, plus a 12-phase performance/correctness series, an API↔DB gate (off by default), and the tuning/CI fixes we hit on real hardware. Sharing for feedback against 24.10-main.

What's included (22 commits, +5623 / -47, 21 files)

  • RDMA_Transporter — new transporter class with SEND/RECV-based wire protocol, credit-based flow control, and an out-of-band exchange of QP attributes (QPN, PSN, GID/LID) over the existing control socket. Framing: 24-byte header + payload, MAX_*_MESSAGE_BYTESIZE = 32768.
  • Config schema — new [RDMA] connection section: RdmaSendBufferMemory, RdmaRecvBufferMemory, RdmaQueueDepth, RdmaInlineThreshold, RdmaCompletionPollBudget, RdmaSpintime, RdmaDevice, RdmaPort, RdmaGidIndex, RdmaTrafficClass, RdmaRetryCount, RdmaRnrRetryCount.
  • Build/feature gatingWITH_NDB_RDMA CMake option and NDB_RDMA_TRANSPORTER_SUPPORTED runtime guard. Without the flag, all sources compile to no-ops and TCP/SHM behaviour is bit-for-bit unchanged. With the flag set but a binary built without libibverbs symbols (e.g. mgmd), [RDMA] sections are rejected with a clear error message.
  • 12-phase performance/correctness series: P1-2 buffer pool / wire-state cacheline grouping, P3 recv ring, P4 batched doorbells, P5 NUMA-aware completion vectors, P6 MR cache, P7 RDMA WRITE negotiation, P8 remote MR geometry, P9 recv via remote WRITE, P10 write probe, P11 recv-immediate guard, P12 per-WR slot probe.
  • API↔DB gateAllowApiToDbRdma (off by default). DB↔DB is always allowed; API↔DB requires the explicit gate.
  • Test plumbingRDMA_Transporter-t TAP test with a libibverbs kernel-symbol stub so unit tests link without pulling in Packer<SegmentedSectionArg>'s dependency on copy(). MgmConfig-t / testConfigValues-t extended with [RDMA] and API↔DB scenarios.
  • Default-tuning fix (top of stack) — Rdma{Send,Recv}BufferMemory defaults raised from 2M4M. With 2M / RdmaQueueDepth=64 = 32768 the per-slot size equalled MAX_*_MESSAGE_BYTESIZE and left zero room for the 24-byte framing header. The runtime per-slot validator therefore rejected the default config and connect_server_impl() looped on reconnect. 4M gives 65536 bytes/slot at QD=64 and headroom up to QD=128. Schema minimums unchanged.

Verification

  • MgmConfig-tok 1 - MgmConfig
  • testConfigValues-t → exit 0 (RDMA roundtrip incl. API↔DB scenarios)
  • RDMA_Transporter-tok 1 - RDMA_Transporter (TAP) with cmake -DCMAKE_BUILD_TYPE=Release -DWITH_NDB_RDMA=ON -DWITHOUT_SERVER=OFF -DWITH_NDBCLUSTER_STORAGE_ENGINE=ON
  • Live two-node DB↔DB and gated API↔DB tested on Mellanox CX-5 (RoCEv2, mlx5_0), including mdtest under sustained load.

Notes / non-goals

  • Purely additive: no existing transporter, signal flow, ABI, ConfigValues layout, or wire format is touched outside the new [RDMA] section.
  • rnr_retry is currently pinned to 7 at RTS regardless of the configured value; lower values produced transport-retry-exhausted under mdtest. The user-visible setting is preserved so a future patch can re-enable user control without an ABI change.
  • Mainline upgrades / downgrades that don't touch [RDMA] are unaffected.
  • Sister PRs target 25.10-main and 26.02-main with the same scope — those branches additionally include earlier experimental API↔DB work that is now consolidated under the gate.

Happy to split into smaller per-phase PRs if maintainers prefer that flow.

peakaio-ai and others added 15 commits May 18, 2026 08:44
Adds an opt-in, default-OFF native RDMA transporter for inter-replica
data-node traffic, targeting lower commit-path latency on
NoOfReplicas=2 deployments with RoCE-capable hosts.

Gate-1 safety invariant
-----------------------
connect_server_impl() and connect_client_impl() always return false.
They allocate the full verbs lifecycle (PD/CQ/QP/MR), run the endpoint
handshake, drive the QP through INIT/RTR/RTS, log the negotiated
attributes, then release everything and refuse the link. This exercises
every setup path end-to-end without putting an RDMA link into the
cluster data path.

Default build (WITH_NDB_RDMA=OFF) is byte-equivalent to the previous
tree: RDMA_Transporter.cpp is excluded from compilation, no libibverbs
is linked, and every new code site is guarded by
NDB_RDMA_TRANSPORTER_SUPPORTED.

What's included (Milestones 1-11)
---------------------------------
1.  Build option WITH_NDB_RDMA (off by default), libibverbs detection,
    NDB_RDMA_TRANSPORTER_SUPPORTED compile define. CMake gating in
    storage/ndb/src/common/transporter/CMakeLists.txt.
2.  Public definitions: tt_RDMA_TRANSPORTER, TransporterConfiguration::
    rdma union, TE_RDMA_* error codes in TransporterDefinitions.hpp.
3.  Config surface: [RDMA] section in ConfigInfo.cpp, CFG_RDMA_*
    parameter ids (520-531) in mgmapi_config_parameters.h,
    CONNECTION_TYPE_RDMA parsing in IPCConfig.cpp.
4.  Verbs resource lifecycle: device open, PD/CQ/QP/MR allocation,
    page-aligned staging buffers, RAII teardown via
    release_verbs_resources(). Idempotent and crash-safe.
5.  Endpoint exchange: rdma_endpoint_v1 (64B, network byte order)
    swapped over the existing authenticated control socket; QP driven
    through INIT -> post_initial_receives -> RTR -> RTS with configured
    retry / RNR-retry counters.
6.  Wire framing: rdma_msg_header_v1 (24B, network byte order) with
    magic/version/header_len/payload_len/send_seq/ack_seq/credit_delta/
    flags. Public static encode_msg_header / validate_msg_header.
7.  Send path: doSend posts IBV_WR_SEND with optional IBV_SEND_INLINE
    when within the negotiated cap. iovec_data_sent is called only on
    completion, never on post. Send-slot bookkeeping + reap_send_
    completions.
8.  Receive path: per-slot ready queue, reap_recv_completions enforces
    strict in-order send_seq, folds credit_delta into the peer credit
    pool, surfaces CREDIT_ONLY/HEARTBEAT control messages distinctly
    from data messages. Wired into TransporterRegistry via
    friend-access poll_RDMA() helper and dispatch arms in pollReceive()
    + performReceive().
10. Observability counters (Uint64 each, default-zero, accumulating
    across reconnects): send_posted, send_completions_ok/err,
    send_inline, send_credit_stalls, copied_send_bytes, recv_posted,
    recv_completions_ok/err, recv_credit_only_in, copied_recv_bytes,
    cq_polls_send/recv, cq_budget_hits_send/recv, rnr_events,
    retry_exceeded_events, qp_fatal_events, reconnect_attempts. Base
    m_bytes_sent / m_bytes_received are also updated. log_stats()
    emits a single info line on link teardown.
11. Unit tests for the wire format: TAPTEST(RDMA_Transporter) in
    RDMA_Transporter.cpp under TEST_RDMA_TRANSPORTER. Covers
    round-trip with five field combinations, network byte-order layout
    on every field, single-byte corruption rejections, CREDIT_ONLY /
    HEARTBEAT semantic rejections, payload-overflow / short-buffer /
    exact-fit boundary cases. NDB_ADD_TEST entry in CMakeLists.txt
    plus a reproducible build_and_test_rdma_transporter.sh wrapper
    that works around upstream gunit/server-deps when WITHOUT_SERVER.

Milestone 9 (Qmgr / multi-transporter) is intentionally skipped for
Gate 1: single QP per transporter, no live switch to multi-transporter
mode. Milestone 12 (benchmark / hardening) requires Gate 2 (i.e. the
connect_*_impl change to return true) and is out of scope here.

Verified
--------
- Default build (WITH_NDB_RDMA=OFF) of ndbtransport compiles unchanged.
- WITH_NDB_RDMA=ON build of ndbtransport compiles clean under -Werror.
- TAPTEST(RDMA_Transporter) passes (1..1 ok 1 - RDMA_Transporter).

Co-Authored-By: Oz <oz-agent@warp.dev>
When ConfigObject::createSection() returns false (for example for a
CONNECTION_TYPE_RDMA payload, since ConfigObject/ConfigSection only
wire up TCP and SHM under CommSection so far), the caller
saveInConfigValues() in ConfigInfo.cpp was discarding the return
value and then invoking ConfigObject::put(), which unconditionally
dereferenced m_curr_cfg_section. After the prior closeSection() set
m_curr_cfg_section to nullptr, this produced a SIGSEGV in
ConfigSection::check_magic() during ndb_mgmd config-file parsing
(stack: ConfigSection::set -> ConfigObject::put ->
saveInConfigValues -> InitConfigFileParser::run_config_rules).

Two purely additive changes:

1. saveInConfigValues() now checks the return of createSection().
   On failure it reports the error code via ndbout_c and breaks out
   of the section save loop, so the next put() is never reached.

2. ConfigObject::put(Uint32,Uint32), put64() and put(Uint32,const
   char*) now NULL-check m_curr_cfg_section. If it is nullptr they
   record SET_NOT_REAL_SECTION_ERROR and return false instead of
   dereferencing.

Both paths preserve the happy-path behavior bit-for-bit for TCP, SHM
and node/system sections. The Gate-1 RDMA invariant (no RDMA
transporters actually instantiated) is unaffected.

Co-Authored-By: Oz <oz-agent@warp.dev>
… and add defensive checks

Gate-2 (full RDMA support in the binary config layer)
-----------------------------------------------------

ConfigSection now recognises an RdmaTypeId comm-section type and the
matching RDMA_TYPE wire value:

  storage/ndb/include/util/ConfigSection.hpp
    + SectionType::RdmaTypeId = 7
    + static const Uint32 RDMA_TYPE = 2

ConfigSection.cpp grows RDMA branches in every switch that previously
listed TcpTypeId/ShmTypeId: set_section_type(Entry&),
set_config_section_type(), get_section_type_value(),
get_default_section() (returns nullptr for RDMA, since the v2 wire
format still ships 5 default sections only), verify_section(),
set_node_id_from_keys(), unpack_comm_section(). get_v1_length() and
create_v1_section() are taught to merge against a possibly-NULL
default section so an RDMA section carries all its keys inline in v1
without dereferencing the missing m_rdma_default_section.

ConfigObject.cpp:
  - createSection(): map RDMA_TYPE -> RdmaTypeId / CommSection.
  - copy_current(): RdmaTypeId case pushes the section onto
    m_comm_sections without a default-copy (none exists yet).
  - create_default_sections(): two no-op RdmaTypeId cases — RDMA
    sections are not optimised against a per-type default in this
    milestone, preserving the 5-default-section v2 wire format.
  - build_arrays(): RdmaTypeId joins TcpTypeId/ShmTypeId in the
    comm-sections classification.
  - get(): NULL-check the result of get_default_section() before
    dereferencing, so lookups on an RDMA section return false rather
    than SIGSEGV when the key is absent.

The result: an [rdma] block in config.ini is now stored, packed (v1
and v2) and unpacked end-to-end. The Gate-1 saveInConfigValues skip
log ("skipping section RDMA: createSection() failed") is replaced by
normal commit. No on-wire format change — v2 binary configs still
declare 5 default sections.

Gate-1.5 (defensive checks against missing transporters)
--------------------------------------------------------

Two purely additive guards so the data node cannot SEGV when a
configuration omits a transporter for some peer (for any reason,
not just RDMA):

  storage/ndb/src/common/transporter/TransporterRegistry.cpp
    start_connecting(trp_id): if trp_id == 0, trp_id >= maxTransporters
    or allTransporters[trp_id] == nullptr, log a warning and return
    without touching the Transporter pointer.

  storage/ndb/src/kernel/blocks/trpman.cpp
    execOPEN_COMORD (both single-node and broadcast paths): when
    get_the_only_base_trp(nodeId) returns 0, jam() and skip
    start_connecting() entirely. The peer simply stays unreachable
    rather than crashing the local data node.

Together these patches replace the prior SIGSEGV in
Trpman::execOPEN_COMORD -> TransporterRegistry::start_connecting on a
configuration that omitted a 1<->2 transporter with a clean warning
log and a cluster that fails to form (the expected outcome when no
transporter exists between two data nodes).

Co-Authored-By: Oz <oz-agent@warp.dev>
In Gate-2 I picked RDMA_TYPE = 2 for the wire value, but
CONNECTION_TYPE_RDMA in mgmapi_config_parameters.h is 4 (values 2 and
3 are reserved for the deprecated SCI transporter, intentionally
skipped over). ConfigInfo's CI_SECTION entry for "RDMA" carries
CONNECTION_TYPE_RDMA (= 4) in its _default field and that value is
what arrives in ConfigObject::createSection() as the second
argument. With RDMA_TYPE = 2, the type == RDMA_TYPE branch was never
taken; createSection() fell through to the else clause, set
m_error_code = WRONG_COMM_TYPE and returned false. The Gate-1
defensive skip path then logged

  skipping section RDMA: createSection(3000, 4) failed, error_code=25

defeating the purpose of Gate-2.

This is identical to the convention already established for TCP and
SHM:

  TCP_TYPE == CONNECTION_TYPE_TCP == 0
  SHM_TYPE == CONNECTION_TYPE_SHM == 1
  RDMA_TYPE == CONNECTION_TYPE_RDMA == 4   <-- fix

Bumping the value flips the createSection() dispatch onto the RDMA
branch added in the Gate-2 commit; the rest of the wiring
(set_section_type, copy_current, create_default_sections,
build_arrays, get_default_section, unpack_comm_section) needs no
change because they all key off the SectionType enum
(RdmaTypeId = 7), which is independent of the wire value.

Co-Authored-By: Oz <oz-agent@warp.dev>
connect_server_impl / connect_client_impl now return true on a
successful handshake instead of refusing the link. Verbs resources
(PD/CQ/QP/MR + send & recv slot tables) are kept alive across the
return; the registry sees t->isConnected() == true and report_connect
promotes the transporter to CONNECTED, after which the milestone-7/8/9
data path (doSend(), poll_RDMA(), the performReceive() RDMA arm,
reap_recv_completions() etc.) starts carrying signals over the QP.

To satisfy the base Transporter contract that report_connect() ->
recvdata.epoll_add() requires a valid file descriptor, the control
socket used for the endpoint exchange is moved into theSocket
(set non-blocking, with SO_KEEPALIVE) rather than closed. No signal
traffic flows over the socket after RTS; it remains as a stable fd
for the registry's per-transporter bookkeeping and as a fast-fail
EPOLLHUP signal if the peer process dies.

On any handshake failure the path still tears everything down and
returns false; the start_clients_thread reconnect loop retries from
scratch.

This is the patch that turns the RDMA transporter from "verbs +
handshake exerciser" into a fully live link between data nodes.

Co-Authored-By: Oz <oz-agent@warp.dev>
createMultiTransporter() only had TCP and SHM branches and fell through
to require(false) for RDMA. The Gate-1 RDMA_Transporter clone constructor
also ended in require(false) as a safety net. Both fire during phase-2
nodegroup multi-transporter setup (Qmgr::create_multi_transporter ->
TransporterRegistry::createMultiTransporter) on an RDMA-only cluster,
which crashes ndbmtd right after 'Start NDB start phase 2'.

This change:
  * Removes the require(false) at the tail of the clone constructor in
    RDMA_Transporter.cpp; the clone copies configuration from the source
    transporter and lets the normal connect path allocate fresh verbs
    resources (its own QP, MRs, etc.).
  * Adds an RDMA branch to TransporterRegistry::createMultiTransporter,
    parallelling the TCP/SHM clones, gated by
    NDB_RDMA_TRANSPORTER_SUPPORTED.

No behavior change for non-RDMA builds; the new branch sits inside the
existing NDB_RDMA_TRANSPORTER_SUPPORTED guard and TCP/SHM paths are
untouched.

Co-Authored-By: Oz <oz-agent@warp.dev>
doSend() previously copied raw bytes out of the upper-layer iovec
into each RDMA slot up to max_payload, which could land the tail of
a Protocol6 message in slot N and the rest in slot N+1. The receiver
calls TransporterRegistry::unpack() once per slot with the slot's
contiguous payload; unpack_one() can only consume complete signals
from a contiguous buffer and cannot reassemble a partial signal
across the boundary between two RDMA receive slots (the buffers are
non-contiguous memory regions). The result was that after a split
the head slot stayed parked in the receive ready queue with a
partial signal, and slot N+1 (with the continuation) queued behind
it, freezing the receive pipeline for that link. PK ops are tiny
and never tripped this; scans drive multi-KB batches that do.

This patch makes doSend() walk the iovec stream signal by signal,
using Protocol6::getMessageLength(word1) to discover each next
message's length, and only commits a message into the current slot
if it fits entirely. If the next message would overflow the slot's
remaining max_payload room, we post what we have and start a fresh
slot. The upper-layer SendBufferPage / thr_send_page invariant that
a single Protocol6 message never spans a page makes the peek
inexpensive: word1 is always within the current iovec entry.

Defences added:
  - Reject and disconnect on invalid Protocol6 lengths (corrupt
    buffer / bug detection) rather than silently sending garbage.
  - Bail out cleanly when the iovec batch truncates a message;
    the next doSend() pulls more iovecs.

No behavior change for non-RDMA builds; this is contained to
RDMA_Transporter::doSend().

Co-Authored-By: Oz <oz-agent@warp.dev>
After the signal-aligned packing fix, the data path carried scan and PK
traffic correctly, but a real benchmark surfaced GCP_COMMIT lag growing
to 50s (with SignalCounter m_count=1 against the peer) and resulting
266/274 timeouts at the API/SQL layer. There were no qp_fatal, RNR or
retry_exceeded markers on either side — the RDMA QPs stayed healthy
but a multi-transporter clone with one-directional traffic could no
longer get a SEND posted, because m_peer_recv_credits had drained to
zero on that clone.

Root cause: receive credits were only ever returned to the peer by
piggy-backing the credit_delta on an outgoing data SEND. When a clone
mostly carries inbound traffic (e.g. GCP_COMMIT from node 1 to node 2),
the receiving side accumulates m_pending_credit_grant forever and the
peer's m_peer_recv_credits eventually reaches zero. The peer's doSend
then enters the credit-stall branch and the in-order GCP signal sits
in its send buffer, waiting for credits that will never arrive on the
back channel.

Fix: at the end of doSend(), after the main draining loop, if the
accumulated grant has crossed queue_depth/2, or if our own credit pool
against the peer is already low (m_peer_recv_credits < queue_depth/4)
and we have any grants to send, post an explicit zero-payload SEND
with RDMA_MSG_FLAG_CREDIT_ONLY and credit_delta set. This is exactly
the path the original Gate-1 plan described but had not implemented.

The receive side already handles CREDIT_ONLY messages correctly:
reap_recv_completions() applies the credit_delta, increments
recv_credit_only_in, and re-posts the slot without enqueueing for
the unpacker.

Defences: a fatal ibv_post_send error on the CREDIT_ONLY path also
disconnects via TE_RDMA_QP_ERROR. The CREDIT_ONLY slot is committed
with bytes_consumed=0 so reap_send_completions does not call
iovec_data_sent() (there is no upper-layer payload tied to it).

Co-Authored-By: Oz <oz-agent@warp.dev>
The existing CREDIT_ONLY refill in doSend() never fires on multi-
transporter clones that mostly carry inbound traffic, because doSend()
is rarely (or never) invoked on those clones. As a result, the peer's
outbound credit pool against us drains to zero and the inbound
direction stalls under benchmark load. Visible symptoms include
GCP_COMMIT lag, NDB-274 scan timeouts, and NDB-286 node-fencing.

Refactor the existing CREDIT_ONLY emission block out of doSend() into
post_credit_only_locked(), then add recv_thread_emit_credit_only():

* Fast-path lock-free check on m_pending_credit_grant; no-op when the
  half-queue-depth threshold is not met.
* Acquires the per-transporter send lock (lock_send_transporter),
  reaps pending send completions, and posts one CREDIT_ONLY WR via
  post_credit_only_locked(). The send lock is per-transporter and the
  send thread never takes the global receive lock, so this cannot
  deadlock against doSend() on any transporter.

Call the new method at the tail of the RDMA branch in
TransporterRegistry::performReceive() so every recv batch on every
RDMA transporter gets a chance to flush its accumulated grants.

Add a send_credit_only_recv_path counter and surface it in the
per-link stats line so operators can verify the new path is firing.

Co-Authored-By: Oz <oz-agent@warp.dev>
…poll

Without a wakeup mechanism the receive thread only notices incoming
RDMA data at the next epoll_wait timeout in check_TCP(), because
poll_RDMA() is non-blocking and is only called after check_TCP returns.
For a pure-RDMA deployment that bounds RDMA receive latency below by
the epoll timeout regardless of HCA speed -- the most critical gap in
the Gate-3 implementation, as flagged by the RonDB team.

Bind each transporter's recv CQ to a dedicated ibv_comp_channel,
make the channel's fd non-blocking, and register that fd in the
receive thread's epoll set alongside the existing control-socket fd.
Tag the event data with RDMA_COMP_CHANNEL_BIT (high bit of u32) so
check_TCP() can route comp-channel events to a separate path:
handle_recv_comp_event() drains and acks the events and marks
m_read_transporters so performReceive()'s RDMA branch will call
reap_recv_completions() and process the CQEs.

The standard arm-vs-fire race is handled by arming the CQ via
ibv_req_notify_cq() at the tail of every reap_recv_completions() pass,
and once at the end of run_endpoint_exchange() so the first inbound
CQE on a fresh link generates a notification. We accept the tiny race
window between the last poll returning 0 and the arm because the
existing poll_RDMA() cadence will still pick up such CQEs.

EPOLLHUP on the comp channel fd is ignored: the channel is destroyed
in release_verbs_resources() during disconnect, and the kernel
auto-removes the closed fd from epoll. Disconnect is driven by the
control socket's HUP on the same trp_id, which is processed in the
existing branch.

release_verbs_resources() drains and acks any residual CQ events
before destroying the recv CQ, because ibv_destroy_cq blocks while
there are unacked events.

Failure modes:
  - ibv_create_comp_channel() failure is fatal at connect time.
  - O_NONBLOCK on the channel fd is fatal at connect time (otherwise
    the recv thread would block inside ibv_get_cq_event()).
  - ibv_req_notify_cq() failure is logged but non-fatal; the existing
    pollReceive() cadence keeps the link working without the wakeup
    optimization.
  - epoll_ctl(ADD, channel_fd) failure is logged but non-fatal for
    the same reason.

Co-Authored-By: Oz <oz-agent@warp.dev>
After Gate-3 sender-side packing (1691a1f / d873737), doSend()
legitimately packs MULTIPLE complete Protocol6 messages into a single
RDMA payload up to (slot_size - RDMA_MSG_HEADER_BYTES). On a 4 MB
recv buffer with queue depth 64 that is 65512 bytes per slot.

validate_msg_header still enforced a single-Protocol6-message upper
bound of payload_len <= MAX_RECV_MESSAGE_BYTESIZE (32 KB), which was
correct before Gate-3 but now falsely rejects legitimate bulk-copy
bursts. Symptom observed in production: cli-142 ndbmtd flapped in a
restart loop driven by start-phase-5 traffic ("Copying of dictionary
information"):

  RDMA: msg payload_len=64740 exceeds MAX_RECV_MESSAGE_BYTESIZE=32768
  Transporter 51 to node 2 disconnected in recv with errnum: 74
  ALERT -- Node 1: Forced node shutdown completed. Occurred during
          startphase 5. Caused by error 2308

The per-Protocol6-message size limit is still enforced inside
Packer::unpack_one() where it has the actual signal boundary, so we
do not lose any protection by removing the duplicate check here. The
remaining (payload_len + header > available) check still bounds
payload_len to slot_size - header because the HCA cannot write more
than one full receive slot into the WC.

Updates the corresponding TAP test in RDMA_Transporter.cpp: the
"exceeds protocol-level maximum" rejection becomes an acceptance test
that documents the new behavior (payload_len = MAX_RECV_MESSAGE_BYTESIZE+1
with a matching `available` must succeed and round-trip through the
output pointer).

Test: build_scripts/build_and_test_rdma_transporter.sh -> ok 1.
recv_thread_emit_credit_only() (introduced in d643983/bf7cd2f9103
to refill credits on inbound-only multi-trp clones) reaped pending
send completions before posting CREDIT_ONLY. reap_send_completions()
invokes iovec_data_sent() -> trp_callback::bytes_sent() on every
freed data WR, and trp_callback::bytes_sent() in mt.cpp releases
the now-completed thr_send_pages into a thread_local_pool, either
the block thread own pool or a dedicated send thread pool obtained
via g_send_threads->get_send_buffer_pool().

thread_local_pool is, by design, safe only when mutated from its
owner thread: its internal free-list is protected by thread
affinity, not by the per-transporter m_send_lock that
recv_thread_emit_credit_only() holds. Touching it from the recv
thread races with the owner thread own pool operations on other
transporters sharing that pool and segfaults inside
trp_callback::bytes_sent -> ::bytes_sent -> release_list.

The crash was hidden by the validate_msg_header upper-bound check
until commit 0a5514b let real start-phase-5 SR-copy traffic
through. As soon as the dictionary-copy burst reached the upper
layer on cli-142, the recv threads pending-credit-grant crossed
the queue_depth/2 threshold for the first time and the recv-thread
reap fired, faulting in trp_callback::bytes_sent. Symbolized
stack at the crash:

  logicalclocks#2  trp_callback::bytes_sent
  logicalclocks#3  RDMA_Transporter::reap_send_completions [clone .part.0]
  logicalclocks#4  RDMA_Transporter::recv_thread_emit_credit_only
  logicalclocks#5  TransporterRegistry::performReceive
  logicalclocks#6  mt_receiver_thread_main

Fix: drop the reap_send_completions() call from
recv_thread_emit_credit_only(). The send thread iterates every
transporter every cycle and calls reap_send_completions() inside
doSend() in the thread-correct context, so completed CREDIT_ONLY
WRs are still recycled in a bounded number of send-thread passes.
If no slot is free on a given recv-thread pass,
post_credit_only_locked() already returns false via its
(m_send_slots_in_flight >= m_queue_depth) early return; the
missed CREDIT_ONLY is recovered on the next receive batch with
no additional state.

No wire-format change. No new state. Pure removal of a
cross-thread callback.
Commit 036ea2a fixed a SIGSEGV crash by removing
reap_send_completions() from recv_thread_emit_credit_only(), but it
re-introduced the original credit-starvation deadlock on inbound-only
multi-transporter clones. In ndbmtd the send thread (mt.cpp do_send)
iterates only m_pending_send_trps[], populated by
register_pending_send() when a block thread enqueues outbound data.
A clone with no block-thread outbound traffic is therefore never
visited by the send thread; its CREDIT_ONLY WRs posted by the recv
thread accumulate, post_credit_only_locked() starts returning false,
m_pending_credit_grant cannot drain, peer m_peer_recv_credits drains
to zero, inbound stalls. The symptom returned after the cli-143
rolling restart as renewed GCP_COMMIT lag (10-152s on the mgmd GCP
Monitor) and sustained NDB-274/NDB-286 in pnfs-mds.

Reintroducing reap_send_completions() unconditionally is unsafe: it
calls iovec_data_sent() -> trp_callback::bytes_sent() (mt.cpp line
5504), which asserts sb->m_send_thread != NO_SEND_THREAD and then
mutates a thread_local_pool<thr_send_page> via release_local() that
has no lock and is bound by thread affinity (release_local touches
m_free and m_freelist directly). Calling it from the recv thread
races the owning send/block thread on other transporters using the
same pool, and crashes inside release_list.

This commit restores the reap, but gated on m_bytes_in_flight == 0.
The class invariant
    m_bytes_in_flight == sum over slots of bytes_consumed
is maintained because:
  - doSend() posts data WRs with bytes_consumed = payload_len and
    m_bytes_in_flight += payload_len.
  - post_credit_only_locked() posts CREDIT_ONLY WRs with
    bytes_consumed = 0 and leaves m_bytes_in_flight unchanged.
  - reap_send_completions() does m_bytes_in_flight -= bytes when
    freeing each slot.
Therefore m_bytes_in_flight == 0 implies every in-flight slot has
bytes_consumed == 0. For those slots reap_send_completions() never
enters its
    if (bytes > 0) iovec_data_sent((int)bytes);
branch, so trp_callback::bytes_sent() is not called and the unsafe
thread_local_pool race cannot occur.

The read of m_bytes_in_flight is safe under the per-transporter
send lock we already hold: no other thread can post a data WR or
reap while we hold the lock, so the value is stable for the
duration of the guarded reap.

When m_bytes_in_flight > 0, the block thread that posted those
data bytes already added the trp to its pending-send mask via
register_pending_send(), so the send thread will visit doSend()
for this trp in a bounded number of cycles and reap on the
thread-correct context. Skipping the recv-thread reap in that
case is therefore not a stall source.

Local build clean. TAP test (RDMA_Transporter-t) passes (1/1).
Reserve a receive slot for control credit traffic, tighten RDMA endpoint validation, and keep RDMA wire byte stats thread-safe.

Co-Authored-By: Oz <oz-agent@warp.dev>
Poll RDMA transporters during bounded receive-side spinning, include RDMA spintime in the receive spin budget, and ensure RDMA-only receive paths spin before blocking.

Propagate runtime hostname activation updates to RDMA transporters and add RDMA config round-trip coverage for inactive-node hostnames and RDMA-specific parameters.

Co-Authored-By: Oz <oz-agent@warp.dev>
@mronstro
Copy link
Copy Markdown
Collaborator

I have looked at the PR and it looks good, there might be a few minor issues I would like to comment on, but mostly it is good. It would be good if you could sign either an Individual Contributor Agreement or a Corporate Contributor Agreement with Hopsworks, we have such an agreement that I can send to you. It is also interesting to discuss to which version of RonDB the contribution should be made. RonDB 24.10 is in maintenance mode, so definitely a possibility, however most of our customers are already working on 26.02 which is based on MySQL 8.4.8 compared to 8.4.4 for 24.10. 26.02 also supports more than 2000 nodes compared to 255 in 24.10. The patch is safe to include in production versions of RonDB where RDMA is disabled. We are working on RonDB 26.04 (will likely be something like 26.10 when it is production ready). In this version I am thinking that RDMA support can also be integrated in the production builds that we do. We will maintain the RDMA support after your contribution and any changes that you propose for RDMA will be taken in after a review if deemed ok, these contributions would be under the same agreement, so agreement is only required for the first contribution as long as the contribution refers to the agreement.

elemberger-peak and others added 6 commits May 21, 2026 07:09
Allow explicit RDMA connections between API and DB nodes while keeping RDMA endpoint validation limited to DB-DB and API-DB pairs. Update config coverage and transporter diagnostics to document and verify the API RDMA path.

Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
DB-side rejected SEND from API node 113 with TE_RDMA_INVALID_HEADER
(0x8032) during a remove burst, logging:

  RDMA: msg payload_len=96 + header=24 exceeds available=24

On the receiver the SEND arrived with sge.length=24 (a CREDIT_ONLY
WR) but the header in those 24 bytes carried payload_len=96 from a
data WR. Two concurrent send paths -- doSend() driven by a client
thread and post_credit_only_locked() driven by recv_thread_emit_
credit_only() -- selected the same slot via find_free_send_slot()
because m_send_slots[i].in_flight was a plain bool with no mutual
exclusion on the API side. Each path encoded its own header into
the same slot_buf, but the SGE captured at ibv_post_send() time
came from the CREDIT_ONLY WR while the slot bytes were last written
by the data WR.

DB-DB works because ndbmtd serializes doSend per transporter via
the send thread; the API has multiple send invocation paths and
the assumption that the caller holds lock_send_transporter() across
the full doSend() body did not hold.

Fix: make rdma_send_slot::in_flight a std::atomic<bool>. Move slot
ownership commit into find_free_send_slot() via compare_exchange_
strong with acq_rel ordering. doSend() and post_credit_only_locked()
no longer set in_flight=true (the CAS in find_free_send_slot() did
it). On ibv_post_send() failure both paths now roll the flag back
to false before disconnecting so the slot can be reused after
reconnect. Stale-WC defense in reap_send_completions() reads the
flag with memory_order_acquire; success path stores false with
memory_order_release.

Co-Authored-By: Oz <oz-agent@warp.dev>
`RDMA_Transporter-t` is registered via `NDB_ADD_TEST` in
`storage/ndb/src/common/transporter/CMakeLists.txt`. The target
returns silently from the macro when `WITH_UNIT_TESTS=OFF`, but
the option defaults to ON whenever `WITHOUT_SERVER=OFF` (see
`CMakeLists.txt:843-849`), which is the standard configure
RonDB CI uses for a release build. Under that combination the
test executable is created, and the link step has been failing
since the target was first added:

    storage/ndb/src/common/transporter/CMakeFiles/
        RDMA_Transporter-t.dir/build.make:130:
    undefined reference to
      `copy(unsigned int*&, SectionSegmentPool&,
            SegmentedSectionPtr const&)'

Root cause is independent of the RDMA-side code. `Packer.cpp`
(part of the `ndbtransport` convenience library that the test
binary links against) emits an unconditional explicit
instantiation:

    template void Packer::pack<Packer::SegmentedSectionArg>(
        Uint32 *theData, SegmentedSectionArg section) const;

That instantiation references the free function

    bool copy(unsigned int *&, SectionSegmentPool &,
              SegmentedSectionPtr const &);

whose definition lives on the kernel side of the build,
alongside the `SegmentSubPool` / `SegmentedSectionPtr`
implementations. None of the libraries the test depends on
(`ndbtransport`, `ndbmgmapi`, `ndbgeneral`, `ndbportlib`,
`${NDB_RDMA_LIBRARIES}`, `mytap`) define the symbol, so the
linker cannot resolve it. Whether or not the call path is
reachable at runtime is irrelevant -- the explicit template
instantiation forces emission of the reference at compile
time.

The standalone harness `build_scripts/build_and_test_rdma_
transporter.sh` already works around the same issue by
generating a tiny stub source on the fly and linking it
into a hand-built test binary. This change bakes the same
stub into the regular CMake build so the
`WITH_UNIT_TESTS=ON` CI path can link `RDMA_Transporter-t`
without dragging in any kernel-side libraries.

Changes
-------

`storage/ndb/src/common/transporter/RDMA_Transporter_test_stubs.cpp`
(new file)
  - Defines `bool copy(unsigned int *&, SectionSegmentPool &,
    SegmentedSectionPtr const &)` as an abort-stub. The wire-format
    TAP cases exercised by the test executable touch only the
    pure helpers `encode_msg_header` /
    `validate_msg_header` and small in-anonymous-namespace
    routines; none of those paths invoke
    `Packer::pack<SegmentedSectionArg>`, so the stub is never
    reached at runtime. If a future test addition does hit it
    the abort produces a clear diagnostic identifying the stub
    by name. Opaque forward declarations of
    `SectionSegmentPool` / `SegmentedSectionPtr` keep the file
    free of kernel-side header dependencies.

`storage/ndb/src/common/transporter/CMakeLists.txt`
  - Adds `RDMA_Transporter_test_stubs.cpp` as a second source
    file to the existing `NDB_ADD_TEST(RDMA_Transporter-t ...)`
    block. The stub is compiled per-target only -- it does not
    enter the `ndbtransport` convenience library and so cannot
    leak into any non-test code path. A comment block above the
    `NDB_ADD_TEST` call documents why the companion source
    exists.

Verification
------------

Built with the exact configure command from the original bug
report:

    cmake -S . -B build-lab \
      -DCMAKE_BUILD_TYPE=Release \
      -DWITH_NDB_RDMA=ON \
      -DWITHOUT_SERVER=OFF \
      -DWITH_NDBCLUSTER_STORAGE_ENGINE=ON \
      -DCMAKE_INSTALL_PREFIX=/opt/rondb-24.10-rdma-port
    cmake --build build-lab --target RDMA_Transporter-t

Result: `RDMA_Transporter-t` builds and links cleanly.

Running the binary:

    1..1
    ok 1 - RDMA_Transporter

The standalone harness
`build_scripts/build_and_test_rdma_transporter.sh` is
unchanged; it still works because its hand-built link line
already includes its own runtime-generated stub, and the new
in-tree companion source is gated on the regular CMake
`NDB_ADD_TEST` target.

No runtime behaviour changes. The production `ndbtransport`
library is byte-identical; only the standalone test binary
gains the additional translation unit.
The validator in RDMA_Transporter::send_slot_size_or_zero() and
recv_slot_size_or_zero() requires

    buffer / RdmaQueueDepth >= RDMA_MSG_HEADER_BYTES (24) +
                               MAX_*_MESSAGE_BYTESIZE (32768)
                             = 32792 bytes/slot

The shipped defaults of 2 MiB buffers with RdmaQueueDepth=64 gave
exactly 2*1024*1024 / 64 = 32768 bytes per slot -- equal to the
max signal size and leaving zero room for the 24-byte RDMA
framing header. allocate_send_slot_state() therefore returned 0,
allocate_verbs_resources() failed, and connect_server_impl() fell
into an infinite reconnect loop on out-of-the-box clusters.

Bump the defaults to 4 MiB on both directions. At the default
QueueDepth=64 each slot is now 65536 bytes -- ~2x the worst-case
signal -- and there is enough headroom to raise QueueDepth to
128 without retuning the buffer size. The lower schema bound is
left at 256K so operators that intentionally pick a low
RdmaQueueDepth (down to the schema minimum of 16) can still
satisfy the validator with the existing minimum buffer
(256*1024 / 16 = 16384 -- still below 32792, but that case must
be hand-tuned and the validator now produces a clear error
message pointing the operator at QueueDepth/Buffer).
@elemberger-peak elemberger-peak changed the title Fix RDMA transporter credit handling RDMA transporter (RoCE / IB Verbs) for 24.10 May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants