RDMA transporter (RoCE / IB Verbs) for 24.10#927
Conversation
Adds an opt-in, default-OFF native RDMA transporter for inter-replica
data-node traffic, targeting lower commit-path latency on
NoOfReplicas=2 deployments with RoCE-capable hosts.
Gate-1 safety invariant
-----------------------
connect_server_impl() and connect_client_impl() always return false.
They allocate the full verbs lifecycle (PD/CQ/QP/MR), run the endpoint
handshake, drive the QP through INIT/RTR/RTS, log the negotiated
attributes, then release everything and refuse the link. This exercises
every setup path end-to-end without putting an RDMA link into the
cluster data path.
Default build (WITH_NDB_RDMA=OFF) is byte-equivalent to the previous
tree: RDMA_Transporter.cpp is excluded from compilation, no libibverbs
is linked, and every new code site is guarded by
NDB_RDMA_TRANSPORTER_SUPPORTED.
What's included (Milestones 1-11)
---------------------------------
1. Build option WITH_NDB_RDMA (off by default), libibverbs detection,
NDB_RDMA_TRANSPORTER_SUPPORTED compile define. CMake gating in
storage/ndb/src/common/transporter/CMakeLists.txt.
2. Public definitions: tt_RDMA_TRANSPORTER, TransporterConfiguration::
rdma union, TE_RDMA_* error codes in TransporterDefinitions.hpp.
3. Config surface: [RDMA] section in ConfigInfo.cpp, CFG_RDMA_*
parameter ids (520-531) in mgmapi_config_parameters.h,
CONNECTION_TYPE_RDMA parsing in IPCConfig.cpp.
4. Verbs resource lifecycle: device open, PD/CQ/QP/MR allocation,
page-aligned staging buffers, RAII teardown via
release_verbs_resources(). Idempotent and crash-safe.
5. Endpoint exchange: rdma_endpoint_v1 (64B, network byte order)
swapped over the existing authenticated control socket; QP driven
through INIT -> post_initial_receives -> RTR -> RTS with configured
retry / RNR-retry counters.
6. Wire framing: rdma_msg_header_v1 (24B, network byte order) with
magic/version/header_len/payload_len/send_seq/ack_seq/credit_delta/
flags. Public static encode_msg_header / validate_msg_header.
7. Send path: doSend posts IBV_WR_SEND with optional IBV_SEND_INLINE
when within the negotiated cap. iovec_data_sent is called only on
completion, never on post. Send-slot bookkeeping + reap_send_
completions.
8. Receive path: per-slot ready queue, reap_recv_completions enforces
strict in-order send_seq, folds credit_delta into the peer credit
pool, surfaces CREDIT_ONLY/HEARTBEAT control messages distinctly
from data messages. Wired into TransporterRegistry via
friend-access poll_RDMA() helper and dispatch arms in pollReceive()
+ performReceive().
10. Observability counters (Uint64 each, default-zero, accumulating
across reconnects): send_posted, send_completions_ok/err,
send_inline, send_credit_stalls, copied_send_bytes, recv_posted,
recv_completions_ok/err, recv_credit_only_in, copied_recv_bytes,
cq_polls_send/recv, cq_budget_hits_send/recv, rnr_events,
retry_exceeded_events, qp_fatal_events, reconnect_attempts. Base
m_bytes_sent / m_bytes_received are also updated. log_stats()
emits a single info line on link teardown.
11. Unit tests for the wire format: TAPTEST(RDMA_Transporter) in
RDMA_Transporter.cpp under TEST_RDMA_TRANSPORTER. Covers
round-trip with five field combinations, network byte-order layout
on every field, single-byte corruption rejections, CREDIT_ONLY /
HEARTBEAT semantic rejections, payload-overflow / short-buffer /
exact-fit boundary cases. NDB_ADD_TEST entry in CMakeLists.txt
plus a reproducible build_and_test_rdma_transporter.sh wrapper
that works around upstream gunit/server-deps when WITHOUT_SERVER.
Milestone 9 (Qmgr / multi-transporter) is intentionally skipped for
Gate 1: single QP per transporter, no live switch to multi-transporter
mode. Milestone 12 (benchmark / hardening) requires Gate 2 (i.e. the
connect_*_impl change to return true) and is out of scope here.
Verified
--------
- Default build (WITH_NDB_RDMA=OFF) of ndbtransport compiles unchanged.
- WITH_NDB_RDMA=ON build of ndbtransport compiles clean under -Werror.
- TAPTEST(RDMA_Transporter) passes (1..1 ok 1 - RDMA_Transporter).
Co-Authored-By: Oz <oz-agent@warp.dev>
When ConfigObject::createSection() returns false (for example for a CONNECTION_TYPE_RDMA payload, since ConfigObject/ConfigSection only wire up TCP and SHM under CommSection so far), the caller saveInConfigValues() in ConfigInfo.cpp was discarding the return value and then invoking ConfigObject::put(), which unconditionally dereferenced m_curr_cfg_section. After the prior closeSection() set m_curr_cfg_section to nullptr, this produced a SIGSEGV in ConfigSection::check_magic() during ndb_mgmd config-file parsing (stack: ConfigSection::set -> ConfigObject::put -> saveInConfigValues -> InitConfigFileParser::run_config_rules). Two purely additive changes: 1. saveInConfigValues() now checks the return of createSection(). On failure it reports the error code via ndbout_c and breaks out of the section save loop, so the next put() is never reached. 2. ConfigObject::put(Uint32,Uint32), put64() and put(Uint32,const char*) now NULL-check m_curr_cfg_section. If it is nullptr they record SET_NOT_REAL_SECTION_ERROR and return false instead of dereferencing. Both paths preserve the happy-path behavior bit-for-bit for TCP, SHM and node/system sections. The Gate-1 RDMA invariant (no RDMA transporters actually instantiated) is unaffected. Co-Authored-By: Oz <oz-agent@warp.dev>
… and add defensive checks
Gate-2 (full RDMA support in the binary config layer)
-----------------------------------------------------
ConfigSection now recognises an RdmaTypeId comm-section type and the
matching RDMA_TYPE wire value:
storage/ndb/include/util/ConfigSection.hpp
+ SectionType::RdmaTypeId = 7
+ static const Uint32 RDMA_TYPE = 2
ConfigSection.cpp grows RDMA branches in every switch that previously
listed TcpTypeId/ShmTypeId: set_section_type(Entry&),
set_config_section_type(), get_section_type_value(),
get_default_section() (returns nullptr for RDMA, since the v2 wire
format still ships 5 default sections only), verify_section(),
set_node_id_from_keys(), unpack_comm_section(). get_v1_length() and
create_v1_section() are taught to merge against a possibly-NULL
default section so an RDMA section carries all its keys inline in v1
without dereferencing the missing m_rdma_default_section.
ConfigObject.cpp:
- createSection(): map RDMA_TYPE -> RdmaTypeId / CommSection.
- copy_current(): RdmaTypeId case pushes the section onto
m_comm_sections without a default-copy (none exists yet).
- create_default_sections(): two no-op RdmaTypeId cases — RDMA
sections are not optimised against a per-type default in this
milestone, preserving the 5-default-section v2 wire format.
- build_arrays(): RdmaTypeId joins TcpTypeId/ShmTypeId in the
comm-sections classification.
- get(): NULL-check the result of get_default_section() before
dereferencing, so lookups on an RDMA section return false rather
than SIGSEGV when the key is absent.
The result: an [rdma] block in config.ini is now stored, packed (v1
and v2) and unpacked end-to-end. The Gate-1 saveInConfigValues skip
log ("skipping section RDMA: createSection() failed") is replaced by
normal commit. No on-wire format change — v2 binary configs still
declare 5 default sections.
Gate-1.5 (defensive checks against missing transporters)
--------------------------------------------------------
Two purely additive guards so the data node cannot SEGV when a
configuration omits a transporter for some peer (for any reason,
not just RDMA):
storage/ndb/src/common/transporter/TransporterRegistry.cpp
start_connecting(trp_id): if trp_id == 0, trp_id >= maxTransporters
or allTransporters[trp_id] == nullptr, log a warning and return
without touching the Transporter pointer.
storage/ndb/src/kernel/blocks/trpman.cpp
execOPEN_COMORD (both single-node and broadcast paths): when
get_the_only_base_trp(nodeId) returns 0, jam() and skip
start_connecting() entirely. The peer simply stays unreachable
rather than crashing the local data node.
Together these patches replace the prior SIGSEGV in
Trpman::execOPEN_COMORD -> TransporterRegistry::start_connecting on a
configuration that omitted a 1<->2 transporter with a clean warning
log and a cluster that fails to form (the expected outcome when no
transporter exists between two data nodes).
Co-Authored-By: Oz <oz-agent@warp.dev>
In Gate-2 I picked RDMA_TYPE = 2 for the wire value, but CONNECTION_TYPE_RDMA in mgmapi_config_parameters.h is 4 (values 2 and 3 are reserved for the deprecated SCI transporter, intentionally skipped over). ConfigInfo's CI_SECTION entry for "RDMA" carries CONNECTION_TYPE_RDMA (= 4) in its _default field and that value is what arrives in ConfigObject::createSection() as the second argument. With RDMA_TYPE = 2, the type == RDMA_TYPE branch was never taken; createSection() fell through to the else clause, set m_error_code = WRONG_COMM_TYPE and returned false. The Gate-1 defensive skip path then logged skipping section RDMA: createSection(3000, 4) failed, error_code=25 defeating the purpose of Gate-2. This is identical to the convention already established for TCP and SHM: TCP_TYPE == CONNECTION_TYPE_TCP == 0 SHM_TYPE == CONNECTION_TYPE_SHM == 1 RDMA_TYPE == CONNECTION_TYPE_RDMA == 4 <-- fix Bumping the value flips the createSection() dispatch onto the RDMA branch added in the Gate-2 commit; the rest of the wiring (set_section_type, copy_current, create_default_sections, build_arrays, get_default_section, unpack_comm_section) needs no change because they all key off the SectionType enum (RdmaTypeId = 7), which is independent of the wire value. Co-Authored-By: Oz <oz-agent@warp.dev>
connect_server_impl / connect_client_impl now return true on a successful handshake instead of refusing the link. Verbs resources (PD/CQ/QP/MR + send & recv slot tables) are kept alive across the return; the registry sees t->isConnected() == true and report_connect promotes the transporter to CONNECTED, after which the milestone-7/8/9 data path (doSend(), poll_RDMA(), the performReceive() RDMA arm, reap_recv_completions() etc.) starts carrying signals over the QP. To satisfy the base Transporter contract that report_connect() -> recvdata.epoll_add() requires a valid file descriptor, the control socket used for the endpoint exchange is moved into theSocket (set non-blocking, with SO_KEEPALIVE) rather than closed. No signal traffic flows over the socket after RTS; it remains as a stable fd for the registry's per-transporter bookkeeping and as a fast-fail EPOLLHUP signal if the peer process dies. On any handshake failure the path still tears everything down and returns false; the start_clients_thread reconnect loop retries from scratch. This is the patch that turns the RDMA transporter from "verbs + handshake exerciser" into a fully live link between data nodes. Co-Authored-By: Oz <oz-agent@warp.dev>
createMultiTransporter() only had TCP and SHM branches and fell through
to require(false) for RDMA. The Gate-1 RDMA_Transporter clone constructor
also ended in require(false) as a safety net. Both fire during phase-2
nodegroup multi-transporter setup (Qmgr::create_multi_transporter ->
TransporterRegistry::createMultiTransporter) on an RDMA-only cluster,
which crashes ndbmtd right after 'Start NDB start phase 2'.
This change:
* Removes the require(false) at the tail of the clone constructor in
RDMA_Transporter.cpp; the clone copies configuration from the source
transporter and lets the normal connect path allocate fresh verbs
resources (its own QP, MRs, etc.).
* Adds an RDMA branch to TransporterRegistry::createMultiTransporter,
parallelling the TCP/SHM clones, gated by
NDB_RDMA_TRANSPORTER_SUPPORTED.
No behavior change for non-RDMA builds; the new branch sits inside the
existing NDB_RDMA_TRANSPORTER_SUPPORTED guard and TCP/SHM paths are
untouched.
Co-Authored-By: Oz <oz-agent@warp.dev>
doSend() previously copied raw bytes out of the upper-layer iovec
into each RDMA slot up to max_payload, which could land the tail of
a Protocol6 message in slot N and the rest in slot N+1. The receiver
calls TransporterRegistry::unpack() once per slot with the slot's
contiguous payload; unpack_one() can only consume complete signals
from a contiguous buffer and cannot reassemble a partial signal
across the boundary between two RDMA receive slots (the buffers are
non-contiguous memory regions). The result was that after a split
the head slot stayed parked in the receive ready queue with a
partial signal, and slot N+1 (with the continuation) queued behind
it, freezing the receive pipeline for that link. PK ops are tiny
and never tripped this; scans drive multi-KB batches that do.
This patch makes doSend() walk the iovec stream signal by signal,
using Protocol6::getMessageLength(word1) to discover each next
message's length, and only commits a message into the current slot
if it fits entirely. If the next message would overflow the slot's
remaining max_payload room, we post what we have and start a fresh
slot. The upper-layer SendBufferPage / thr_send_page invariant that
a single Protocol6 message never spans a page makes the peek
inexpensive: word1 is always within the current iovec entry.
Defences added:
- Reject and disconnect on invalid Protocol6 lengths (corrupt
buffer / bug detection) rather than silently sending garbage.
- Bail out cleanly when the iovec batch truncates a message;
the next doSend() pulls more iovecs.
No behavior change for non-RDMA builds; this is contained to
RDMA_Transporter::doSend().
Co-Authored-By: Oz <oz-agent@warp.dev>
After the signal-aligned packing fix, the data path carried scan and PK traffic correctly, but a real benchmark surfaced GCP_COMMIT lag growing to 50s (with SignalCounter m_count=1 against the peer) and resulting 266/274 timeouts at the API/SQL layer. There were no qp_fatal, RNR or retry_exceeded markers on either side — the RDMA QPs stayed healthy but a multi-transporter clone with one-directional traffic could no longer get a SEND posted, because m_peer_recv_credits had drained to zero on that clone. Root cause: receive credits were only ever returned to the peer by piggy-backing the credit_delta on an outgoing data SEND. When a clone mostly carries inbound traffic (e.g. GCP_COMMIT from node 1 to node 2), the receiving side accumulates m_pending_credit_grant forever and the peer's m_peer_recv_credits eventually reaches zero. The peer's doSend then enters the credit-stall branch and the in-order GCP signal sits in its send buffer, waiting for credits that will never arrive on the back channel. Fix: at the end of doSend(), after the main draining loop, if the accumulated grant has crossed queue_depth/2, or if our own credit pool against the peer is already low (m_peer_recv_credits < queue_depth/4) and we have any grants to send, post an explicit zero-payload SEND with RDMA_MSG_FLAG_CREDIT_ONLY and credit_delta set. This is exactly the path the original Gate-1 plan described but had not implemented. The receive side already handles CREDIT_ONLY messages correctly: reap_recv_completions() applies the credit_delta, increments recv_credit_only_in, and re-posts the slot without enqueueing for the unpacker. Defences: a fatal ibv_post_send error on the CREDIT_ONLY path also disconnects via TE_RDMA_QP_ERROR. The CREDIT_ONLY slot is committed with bytes_consumed=0 so reap_send_completions does not call iovec_data_sent() (there is no upper-layer payload tied to it). Co-Authored-By: Oz <oz-agent@warp.dev>
The existing CREDIT_ONLY refill in doSend() never fires on multi- transporter clones that mostly carry inbound traffic, because doSend() is rarely (or never) invoked on those clones. As a result, the peer's outbound credit pool against us drains to zero and the inbound direction stalls under benchmark load. Visible symptoms include GCP_COMMIT lag, NDB-274 scan timeouts, and NDB-286 node-fencing. Refactor the existing CREDIT_ONLY emission block out of doSend() into post_credit_only_locked(), then add recv_thread_emit_credit_only(): * Fast-path lock-free check on m_pending_credit_grant; no-op when the half-queue-depth threshold is not met. * Acquires the per-transporter send lock (lock_send_transporter), reaps pending send completions, and posts one CREDIT_ONLY WR via post_credit_only_locked(). The send lock is per-transporter and the send thread never takes the global receive lock, so this cannot deadlock against doSend() on any transporter. Call the new method at the tail of the RDMA branch in TransporterRegistry::performReceive() so every recv batch on every RDMA transporter gets a chance to flush its accumulated grants. Add a send_credit_only_recv_path counter and surface it in the per-link stats line so operators can verify the new path is firing. Co-Authored-By: Oz <oz-agent@warp.dev>
…poll
Without a wakeup mechanism the receive thread only notices incoming
RDMA data at the next epoll_wait timeout in check_TCP(), because
poll_RDMA() is non-blocking and is only called after check_TCP returns.
For a pure-RDMA deployment that bounds RDMA receive latency below by
the epoll timeout regardless of HCA speed -- the most critical gap in
the Gate-3 implementation, as flagged by the RonDB team.
Bind each transporter's recv CQ to a dedicated ibv_comp_channel,
make the channel's fd non-blocking, and register that fd in the
receive thread's epoll set alongside the existing control-socket fd.
Tag the event data with RDMA_COMP_CHANNEL_BIT (high bit of u32) so
check_TCP() can route comp-channel events to a separate path:
handle_recv_comp_event() drains and acks the events and marks
m_read_transporters so performReceive()'s RDMA branch will call
reap_recv_completions() and process the CQEs.
The standard arm-vs-fire race is handled by arming the CQ via
ibv_req_notify_cq() at the tail of every reap_recv_completions() pass,
and once at the end of run_endpoint_exchange() so the first inbound
CQE on a fresh link generates a notification. We accept the tiny race
window between the last poll returning 0 and the arm because the
existing poll_RDMA() cadence will still pick up such CQEs.
EPOLLHUP on the comp channel fd is ignored: the channel is destroyed
in release_verbs_resources() during disconnect, and the kernel
auto-removes the closed fd from epoll. Disconnect is driven by the
control socket's HUP on the same trp_id, which is processed in the
existing branch.
release_verbs_resources() drains and acks any residual CQ events
before destroying the recv CQ, because ibv_destroy_cq blocks while
there are unacked events.
Failure modes:
- ibv_create_comp_channel() failure is fatal at connect time.
- O_NONBLOCK on the channel fd is fatal at connect time (otherwise
the recv thread would block inside ibv_get_cq_event()).
- ibv_req_notify_cq() failure is logged but non-fatal; the existing
pollReceive() cadence keeps the link working without the wakeup
optimization.
- epoll_ctl(ADD, channel_fd) failure is logged but non-fatal for
the same reason.
Co-Authored-By: Oz <oz-agent@warp.dev>
After Gate-3 sender-side packing (1691a1f / d873737), doSend() legitimately packs MULTIPLE complete Protocol6 messages into a single RDMA payload up to (slot_size - RDMA_MSG_HEADER_BYTES). On a 4 MB recv buffer with queue depth 64 that is 65512 bytes per slot. validate_msg_header still enforced a single-Protocol6-message upper bound of payload_len <= MAX_RECV_MESSAGE_BYTESIZE (32 KB), which was correct before Gate-3 but now falsely rejects legitimate bulk-copy bursts. Symptom observed in production: cli-142 ndbmtd flapped in a restart loop driven by start-phase-5 traffic ("Copying of dictionary information"): RDMA: msg payload_len=64740 exceeds MAX_RECV_MESSAGE_BYTESIZE=32768 Transporter 51 to node 2 disconnected in recv with errnum: 74 ALERT -- Node 1: Forced node shutdown completed. Occurred during startphase 5. Caused by error 2308 The per-Protocol6-message size limit is still enforced inside Packer::unpack_one() where it has the actual signal boundary, so we do not lose any protection by removing the duplicate check here. The remaining (payload_len + header > available) check still bounds payload_len to slot_size - header because the HCA cannot write more than one full receive slot into the WC. Updates the corresponding TAP test in RDMA_Transporter.cpp: the "exceeds protocol-level maximum" rejection becomes an acceptance test that documents the new behavior (payload_len = MAX_RECV_MESSAGE_BYTESIZE+1 with a matching `available` must succeed and round-trip through the output pointer). Test: build_scripts/build_and_test_rdma_transporter.sh -> ok 1.
recv_thread_emit_credit_only() (introduced in d643983/bf7cd2f9103 to refill credits on inbound-only multi-trp clones) reaped pending send completions before posting CREDIT_ONLY. reap_send_completions() invokes iovec_data_sent() -> trp_callback::bytes_sent() on every freed data WR, and trp_callback::bytes_sent() in mt.cpp releases the now-completed thr_send_pages into a thread_local_pool, either the block thread own pool or a dedicated send thread pool obtained via g_send_threads->get_send_buffer_pool(). thread_local_pool is, by design, safe only when mutated from its owner thread: its internal free-list is protected by thread affinity, not by the per-transporter m_send_lock that recv_thread_emit_credit_only() holds. Touching it from the recv thread races with the owner thread own pool operations on other transporters sharing that pool and segfaults inside trp_callback::bytes_sent -> ::bytes_sent -> release_list. The crash was hidden by the validate_msg_header upper-bound check until commit 0a5514b let real start-phase-5 SR-copy traffic through. As soon as the dictionary-copy burst reached the upper layer on cli-142, the recv threads pending-credit-grant crossed the queue_depth/2 threshold for the first time and the recv-thread reap fired, faulting in trp_callback::bytes_sent. Symbolized stack at the crash: logicalclocks#2 trp_callback::bytes_sent logicalclocks#3 RDMA_Transporter::reap_send_completions [clone .part.0] logicalclocks#4 RDMA_Transporter::recv_thread_emit_credit_only logicalclocks#5 TransporterRegistry::performReceive logicalclocks#6 mt_receiver_thread_main Fix: drop the reap_send_completions() call from recv_thread_emit_credit_only(). The send thread iterates every transporter every cycle and calls reap_send_completions() inside doSend() in the thread-correct context, so completed CREDIT_ONLY WRs are still recycled in a bounded number of send-thread passes. If no slot is free on a given recv-thread pass, post_credit_only_locked() already returns false via its (m_send_slots_in_flight >= m_queue_depth) early return; the missed CREDIT_ONLY is recovered on the next receive batch with no additional state. No wire-format change. No new state. Pure removal of a cross-thread callback.
Commit 036ea2a fixed a SIGSEGV crash by removing reap_send_completions() from recv_thread_emit_credit_only(), but it re-introduced the original credit-starvation deadlock on inbound-only multi-transporter clones. In ndbmtd the send thread (mt.cpp do_send) iterates only m_pending_send_trps[], populated by register_pending_send() when a block thread enqueues outbound data. A clone with no block-thread outbound traffic is therefore never visited by the send thread; its CREDIT_ONLY WRs posted by the recv thread accumulate, post_credit_only_locked() starts returning false, m_pending_credit_grant cannot drain, peer m_peer_recv_credits drains to zero, inbound stalls. The symptom returned after the cli-143 rolling restart as renewed GCP_COMMIT lag (10-152s on the mgmd GCP Monitor) and sustained NDB-274/NDB-286 in pnfs-mds. Reintroducing reap_send_completions() unconditionally is unsafe: it calls iovec_data_sent() -> trp_callback::bytes_sent() (mt.cpp line 5504), which asserts sb->m_send_thread != NO_SEND_THREAD and then mutates a thread_local_pool<thr_send_page> via release_local() that has no lock and is bound by thread affinity (release_local touches m_free and m_freelist directly). Calling it from the recv thread races the owning send/block thread on other transporters using the same pool, and crashes inside release_list. This commit restores the reap, but gated on m_bytes_in_flight == 0. The class invariant m_bytes_in_flight == sum over slots of bytes_consumed is maintained because: - doSend() posts data WRs with bytes_consumed = payload_len and m_bytes_in_flight += payload_len. - post_credit_only_locked() posts CREDIT_ONLY WRs with bytes_consumed = 0 and leaves m_bytes_in_flight unchanged. - reap_send_completions() does m_bytes_in_flight -= bytes when freeing each slot. Therefore m_bytes_in_flight == 0 implies every in-flight slot has bytes_consumed == 0. For those slots reap_send_completions() never enters its if (bytes > 0) iovec_data_sent((int)bytes); branch, so trp_callback::bytes_sent() is not called and the unsafe thread_local_pool race cannot occur. The read of m_bytes_in_flight is safe under the per-transporter send lock we already hold: no other thread can post a data WR or reap while we hold the lock, so the value is stable for the duration of the guarded reap. When m_bytes_in_flight > 0, the block thread that posted those data bytes already added the trp to its pending-send mask via register_pending_send(), so the send thread will visit doSend() for this trp in a bounded number of cycles and reap on the thread-correct context. Skipping the recv-thread reap in that case is therefore not a stall source. Local build clean. TAP test (RDMA_Transporter-t) passes (1/1).
Reserve a receive slot for control credit traffic, tighten RDMA endpoint validation, and keep RDMA wire byte stats thread-safe. Co-Authored-By: Oz <oz-agent@warp.dev>
Poll RDMA transporters during bounded receive-side spinning, include RDMA spintime in the receive spin budget, and ensure RDMA-only receive paths spin before blocking. Propagate runtime hostname activation updates to RDMA transporters and add RDMA config round-trip coverage for inactive-node hostnames and RDMA-specific parameters. Co-Authored-By: Oz <oz-agent@warp.dev>
|
I have looked at the PR and it looks good, there might be a few minor issues I would like to comment on, but mostly it is good. It would be good if you could sign either an Individual Contributor Agreement or a Corporate Contributor Agreement with Hopsworks, we have such an agreement that I can send to you. It is also interesting to discuss to which version of RonDB the contribution should be made. RonDB 24.10 is in maintenance mode, so definitely a possibility, however most of our customers are already working on 26.02 which is based on MySQL 8.4.8 compared to 8.4.4 for 24.10. 26.02 also supports more than 2000 nodes compared to 255 in 24.10. The patch is safe to include in production versions of RonDB where RDMA is disabled. We are working on RonDB 26.04 (will likely be something like 26.10 when it is production ready). In this version I am thinking that RDMA support can also be integrated in the production builds that we do. We will maintain the RDMA support after your contribution and any changes that you propose for RDMA will be taken in after a review if deemed ok, these contributions would be under the same agreement, so agreement is only required for the first contribution as long as the contribution refers to the agreement. |
Allow explicit RDMA connections between API and DB nodes while keeping RDMA endpoint validation limited to DB-DB and API-DB pairs. Update config coverage and transporter diagnostics to document and verify the API RDMA path. Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
DB-side rejected SEND from API node 113 with TE_RDMA_INVALID_HEADER (0x8032) during a remove burst, logging: RDMA: msg payload_len=96 + header=24 exceeds available=24 On the receiver the SEND arrived with sge.length=24 (a CREDIT_ONLY WR) but the header in those 24 bytes carried payload_len=96 from a data WR. Two concurrent send paths -- doSend() driven by a client thread and post_credit_only_locked() driven by recv_thread_emit_ credit_only() -- selected the same slot via find_free_send_slot() because m_send_slots[i].in_flight was a plain bool with no mutual exclusion on the API side. Each path encoded its own header into the same slot_buf, but the SGE captured at ibv_post_send() time came from the CREDIT_ONLY WR while the slot bytes were last written by the data WR. DB-DB works because ndbmtd serializes doSend per transporter via the send thread; the API has multiple send invocation paths and the assumption that the caller holds lock_send_transporter() across the full doSend() body did not hold. Fix: make rdma_send_slot::in_flight a std::atomic<bool>. Move slot ownership commit into find_free_send_slot() via compare_exchange_ strong with acq_rel ordering. doSend() and post_credit_only_locked() no longer set in_flight=true (the CAS in find_free_send_slot() did it). On ibv_post_send() failure both paths now roll the flag back to false before disconnecting so the slot can be reused after reconnect. Stale-WC defense in reap_send_completions() reads the flag with memory_order_acquire; success path stores false with memory_order_release. Co-Authored-By: Oz <oz-agent@warp.dev>
`RDMA_Transporter-t` is registered via `NDB_ADD_TEST` in
`storage/ndb/src/common/transporter/CMakeLists.txt`. The target
returns silently from the macro when `WITH_UNIT_TESTS=OFF`, but
the option defaults to ON whenever `WITHOUT_SERVER=OFF` (see
`CMakeLists.txt:843-849`), which is the standard configure
RonDB CI uses for a release build. Under that combination the
test executable is created, and the link step has been failing
since the target was first added:
storage/ndb/src/common/transporter/CMakeFiles/
RDMA_Transporter-t.dir/build.make:130:
undefined reference to
`copy(unsigned int*&, SectionSegmentPool&,
SegmentedSectionPtr const&)'
Root cause is independent of the RDMA-side code. `Packer.cpp`
(part of the `ndbtransport` convenience library that the test
binary links against) emits an unconditional explicit
instantiation:
template void Packer::pack<Packer::SegmentedSectionArg>(
Uint32 *theData, SegmentedSectionArg section) const;
That instantiation references the free function
bool copy(unsigned int *&, SectionSegmentPool &,
SegmentedSectionPtr const &);
whose definition lives on the kernel side of the build,
alongside the `SegmentSubPool` / `SegmentedSectionPtr`
implementations. None of the libraries the test depends on
(`ndbtransport`, `ndbmgmapi`, `ndbgeneral`, `ndbportlib`,
`${NDB_RDMA_LIBRARIES}`, `mytap`) define the symbol, so the
linker cannot resolve it. Whether or not the call path is
reachable at runtime is irrelevant -- the explicit template
instantiation forces emission of the reference at compile
time.
The standalone harness `build_scripts/build_and_test_rdma_
transporter.sh` already works around the same issue by
generating a tiny stub source on the fly and linking it
into a hand-built test binary. This change bakes the same
stub into the regular CMake build so the
`WITH_UNIT_TESTS=ON` CI path can link `RDMA_Transporter-t`
without dragging in any kernel-side libraries.
Changes
-------
`storage/ndb/src/common/transporter/RDMA_Transporter_test_stubs.cpp`
(new file)
- Defines `bool copy(unsigned int *&, SectionSegmentPool &,
SegmentedSectionPtr const &)` as an abort-stub. The wire-format
TAP cases exercised by the test executable touch only the
pure helpers `encode_msg_header` /
`validate_msg_header` and small in-anonymous-namespace
routines; none of those paths invoke
`Packer::pack<SegmentedSectionArg>`, so the stub is never
reached at runtime. If a future test addition does hit it
the abort produces a clear diagnostic identifying the stub
by name. Opaque forward declarations of
`SectionSegmentPool` / `SegmentedSectionPtr` keep the file
free of kernel-side header dependencies.
`storage/ndb/src/common/transporter/CMakeLists.txt`
- Adds `RDMA_Transporter_test_stubs.cpp` as a second source
file to the existing `NDB_ADD_TEST(RDMA_Transporter-t ...)`
block. The stub is compiled per-target only -- it does not
enter the `ndbtransport` convenience library and so cannot
leak into any non-test code path. A comment block above the
`NDB_ADD_TEST` call documents why the companion source
exists.
Verification
------------
Built with the exact configure command from the original bug
report:
cmake -S . -B build-lab \
-DCMAKE_BUILD_TYPE=Release \
-DWITH_NDB_RDMA=ON \
-DWITHOUT_SERVER=OFF \
-DWITH_NDBCLUSTER_STORAGE_ENGINE=ON \
-DCMAKE_INSTALL_PREFIX=/opt/rondb-24.10-rdma-port
cmake --build build-lab --target RDMA_Transporter-t
Result: `RDMA_Transporter-t` builds and links cleanly.
Running the binary:
1..1
ok 1 - RDMA_Transporter
The standalone harness
`build_scripts/build_and_test_rdma_transporter.sh` is
unchanged; it still works because its hand-built link line
already includes its own runtime-generated stub, and the new
in-tree companion source is gated on the regular CMake
`NDB_ADD_TEST` target.
No runtime behaviour changes. The production `ndbtransport`
library is byte-identical; only the standalone test binary
gains the additional translation unit.
0f607b9 to
92248c1
Compare
The validator in RDMA_Transporter::send_slot_size_or_zero() and
recv_slot_size_or_zero() requires
buffer / RdmaQueueDepth >= RDMA_MSG_HEADER_BYTES (24) +
MAX_*_MESSAGE_BYTESIZE (32768)
= 32792 bytes/slot
The shipped defaults of 2 MiB buffers with RdmaQueueDepth=64 gave
exactly 2*1024*1024 / 64 = 32768 bytes per slot -- equal to the
max signal size and leaving zero room for the 24-byte RDMA
framing header. allocate_send_slot_state() therefore returned 0,
allocate_verbs_resources() failed, and connect_server_impl() fell
into an infinite reconnect loop on out-of-the-box clusters.
Bump the defaults to 4 MiB on both directions. At the default
QueueDepth=64 each slot is now 65536 bytes -- ~2x the worst-case
signal -- and there is enough headroom to raise QueueDepth to
128 without retuning the buffer size. The lower schema bound is
left at 256K so operators that intentionally pick a low
RdmaQueueDepth (down to the schema minimum of 16) can still
satisfy the validator with the existing minimum buffer
(256*1024 / 16 = 16384 -- still below 32792, but that case must
be hand-tuned and the validator now produces a clear error
message pointing the operator at QueueDepth/Buffer).
Summary
Adds an opt-in native RDMA transporter (RoCE / IB Verbs) alongside
TCP_Transporter/SHM_Transporter, plus a 12-phase performance/correctness series, an API↔DB gate (off by default), and the tuning/CI fixes we hit on real hardware. Sharing for feedback against24.10-main.What's included (22 commits, +5623 / -47, 21 files)
RDMA_Transporter— new transporter class with SEND/RECV-based wire protocol, credit-based flow control, and an out-of-band exchange of QP attributes (QPN, PSN, GID/LID) over the existing control socket. Framing: 24-byte header + payload,MAX_*_MESSAGE_BYTESIZE = 32768.[RDMA]connection section:RdmaSendBufferMemory,RdmaRecvBufferMemory,RdmaQueueDepth,RdmaInlineThreshold,RdmaCompletionPollBudget,RdmaSpintime,RdmaDevice,RdmaPort,RdmaGidIndex,RdmaTrafficClass,RdmaRetryCount,RdmaRnrRetryCount.WITH_NDB_RDMACMake option andNDB_RDMA_TRANSPORTER_SUPPORTEDruntime guard. Without the flag, all sources compile to no-ops and TCP/SHM behaviour is bit-for-bit unchanged. With the flag set but a binary built without libibverbs symbols (e.g. mgmd),[RDMA]sections are rejected with a clear error message.AllowApiToDbRdma(off by default). DB↔DB is always allowed; API↔DB requires the explicit gate.RDMA_Transporter-tTAP test with a libibverbs kernel-symbol stub so unit tests link without pulling inPacker<SegmentedSectionArg>'s dependency oncopy().MgmConfig-t/testConfigValues-textended with[RDMA]and API↔DB scenarios.Rdma{Send,Recv}BufferMemorydefaults raised from2M→4M. With2M / RdmaQueueDepth=64 = 32768the per-slot size equalledMAX_*_MESSAGE_BYTESIZEand left zero room for the 24-byte framing header. The runtime per-slot validator therefore rejected the default config andconnect_server_impl()looped on reconnect.4Mgives 65536 bytes/slot at QD=64 and headroom up to QD=128. Schema minimums unchanged.Verification
MgmConfig-t→ok 1 - MgmConfigtestConfigValues-t→ exit 0 (RDMA roundtrip incl. API↔DB scenarios)RDMA_Transporter-t→ok 1 - RDMA_Transporter(TAP) withcmake -DCMAKE_BUILD_TYPE=Release -DWITH_NDB_RDMA=ON -DWITHOUT_SERVER=OFF -DWITH_NDBCLUSTER_STORAGE_ENGINE=ONmlx5_0), including mdtest under sustained load.Notes / non-goals
[RDMA]section.rnr_retryis currently pinned to 7 at RTS regardless of the configured value; lower values produced transport-retry-exhausted under mdtest. The user-visible setting is preserved so a future patch can re-enable user control without an ABI change.[RDMA]are unaffected.25.10-mainand26.02-mainwith the same scope — those branches additionally include earlier experimental API↔DB work that is now consolidated under the gate.Happy to split into smaller per-phase PRs if maintainers prefer that flow.