[transfer_engine] fix: drain endpoint waiting list via periodic reclaim by ccs1112 · Pull Request #1952 · kvcache-ai/Mooncake

ccs1112 · 2026-04-22T08:19:31Z

Description

This closes #1845. Under SGLang PD-disaggregated prefill, ibv_create_qp eventually fails with Cannot allocate memory and rdma resource show reports >20K QPs per NIC, preceded by 1118 endpoint-evicted events.

SIEVEEndpointStore::reclaimEndpoint drains waiting_list_ of quiescent (a.k.a. evicted and deleted) endpoints so their shared_ptrs drop, and destructors run ibv_destroy_qp. The only production caller was RdmaContext::endpoint(). The reclaim piggybacked on new endpoint insertions. Under healthy load, insertions and evictions are coupled, so the list drains in step. Under failure load, such as peers dying or when retries are exhausted, evictions and deletions continue while insertions stop. This causes reclaims to stop firing and waiting_list_ grows unboundedly. Each accumulated endpoint pins num_qp_per_ep QPs against the NIC's pool (1118 x 2..4 ~= enough to exhaust). The reporter's 4x QP asymmetry across NIC groups (22k on mlx5_00-03 vs. 6k on mlx5_04-07) is consistent with this. Eviction load concentrated on the NICs routing to the failing peer.

To fix this, I added a context_.reclaimEndpoints() call inside WorkerPool::monitorWorker. This is included on the existing last_reset_ts heartbeat. The result is that reclaim no longer depends on insertion traffic. The change is minimal. There is one line of behavior change in worker_pool.cpp and a thin wrapper method RdmaContext::reclaimEndpoints() to delegate to the store.

I also tried adding eager disconnect() to deleteEndpoint/evictEndpoint initially, but it crashes on rxe within ~5s with malloc(): unaligned tcache chunk detected. I think the slice->rdma.qp_depth raw pointer into wr_depth_list_ becomes dangling when reclaim destroys the endpoint while an in-flight slice still holds it. Fixing that properly requires a shared-ownership refactor of wr_depth_list_, which I could do as a follow-up so eager disconnect can be re-added safely.

Module

Type of Change

How Has This Been Tested?

There are 5 new unit tests:

ReclaimDrainsQuiescentEntries is the base method contract.
ReclaimLeavesActiveEntries tests that the hasOutstandingSlice gate is preserved.
ReclaimIsIdempotentWhenEmpty tests cadence safety.
LeakManifestsWithoutReclaimCall reproduces the reporter's 1118-eviction shape and asserts a single reclaim call drains the full backlog.
ReclaimDoesNotRequireActiveMap guards against a regression that conflates reclaim with insert.

There is 1 new integration test:

endpoint_store_integration_test constructs a real RdmaContext, spawns monitorWorker, injects quiescent entries into waiting_list_, sleeps, and asserts drainage. I used this bisection to validate the fix. If you remove the new reclaimEndpoints() call then this test will fail with the right error message, if you add it back it will pass again.

I ran the full ctest suite. There is one existing failure: hot_standby_service_test. I think this is simply due to etcd not being available on my system. It's also failing on main, and the code doesn't seem related at all, so I'm flagging this as environmental. Please let me know if you'd like me to take a closer look.

I don't have the hardware to match the original SGLang mlx5 cluster conditions, unfortunately. If you are aware of on-demand spot instances for mlx5, please let me know and I can run this patch there.

This is the test script I used to manually reproduce:

sudo apt install -y linux-modules-extra-$(uname -r) rdma-core
sudo modprobe rdma_rxe
sudo rdma link add rxe0 type rxe netdev <your_iface>

mkdir build && cd build
cmake .. -DBUILD_UNIT_TESTS=ON
make -j8
ctest -R endpoint_store_test --output-on-failure
./mooncake-transfer-engine/tests/endpoint_store_integration_test

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

reclaimEndpoint() is currently invoked only from RdmaContext::endpoint() after a new insertion. Under healthy load, insertions and evictions are 1:1 so this works. Under failure load -- many error completions trigger deleteEndpoint(), but new-insertion traffic stalls because the dead peer isn't generating new connection paths -- waiting_list_ grows without bound and QPs never get destroyed. Add a 1Hz reclaimEndpoints() call from monitorWorker on the existing 1-second context heartbeat. This decouples reclaim cadence from insertion traffic. See issue kvcache-ai#1845.

…i#1845 Adds unit + integration coverage for the periodic reclaim fix. endpoint_store_test (5 tests, no RDMA device, runs under ctest): - reclaim drains quiescent entries on its own - reclaim leaves active entries alone (gate preserved) - reclaim is idempotent when empty - leak manifests without reclaim call (1118-eviction mirror of reporter) - reclaim works without active map (guard against insert/reclaim coupling) endpoint_store_integration_test (requires RDMA device, not auto-registered): - Verifies WorkerPool::monitorWorker actually calls reclaimEndpoints at ~1 Hz by constructing a real RdmaContext and waiting for the tick to drain injected entries. Confirms the end-to-end fix wiring. Supporting changes: - EndpointStore::waitingListSize() accessor (diagnostics + tests) - SIEVEEndpointStore::testOnlyInsertWaiting() for test injection - RdmaContext::endpointStore() accessor (diagnostics + tests)

- design/transfer-engine: add a sentence to Endpoint Management explaining that waiting_list_ drains both on insertion and on the monitorWorker heartbeat, so accumulated reclaim does not stall under failure load. - troubleshooting: extend the "Failed to create QP: Cannot allocate memory" entry with a bullet pointing at issue kvcache-ai#1845 so operators seeing the symptom find the cause and the fix.

gemini-code-assist

Code Review

This pull request addresses a resource leak (issue #1845) where RDMA endpoints and Queue Pairs (QPs) accumulated during peer failures because reclamation was only triggered by new insertions. The fix introduces a periodic 1 Hz reclamation tick within the monitorWorker thread to ensure the waiting_list_ is drained even when insertions stall. The changes include updates to documentation, the addition of a waitingListSize metric for observability, and new unit and integration tests. Review feedback highlighted a thread-safety concern in FIFOEndpointStore::waitingListSize(), where accessing the size of the waiting_list_ set without synchronization could lead to undefined behavior.

gemini-code-assist · 2026-04-22T08:21:43Z

    int disconnectQPs() override;

    size_t getTotalQPNumber() override;
+    size_t waitingListSize() const override { return waiting_list_.size(); }


The implementation of waitingListSize() in FIFOEndpointStore is not thread-safe. Accessing waiting_list_.size() on a std::unordered_set without a lock while other threads might be concurrently modifying the set (e.g., in deleteEndpoint, evictEndpoint, or reclaimEndpoint) leads to undefined behavior.

To resolve this, consider adding an atomic counter waiting_list_len_ to FIFOEndpointStore, similar to the implementation in SIEVEEndpointStore. This would allow for a lock-free and thread-safe size check. Alternatively, you could use a ReadGuard with the endpoint_map_lock_, though this would require making the lock mutable to be used within this const method.

…atomic counter Per PR kvcache-ai#1952 review: FIFO variant returned waiting_list_.size() on std::unordered_set without holding endpoint_map_lock_, racing concurrent modification. Mirror the SIEVE pattern with an atomic waiting_list_len_ incremented in delete/evict, decremented in reclaim.

… LSAN CI build (3.10/3.12) runs with -DENABLE_ASAN=ON and LSAN flagged the 5 × 288 byte allocation the test fixture intentionally leaks (~RdmaTransport dereferences a null metadata_ unless install() ran). Gate on __SANITIZE_ADDRESS__ / __has_feature and mark the pointer with __lsan_ignore_object so real leaks are still caught.

codecov-commenter · 2026-04-22T10:04:32Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 72.26891% with 33 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...r-engine/tests/endpoint_store_integration_test.cpp	64.86%	13 Missing ⚠️
...ne/src/transport/rdma_transport/endpoint_store.cpp	40.00%	9 Missing ⚠️
...gine/src/transport/rdma_transport/rdma_context.cpp	0.00%	5 Missing ⚠️
...ngine/src/transport/rdma_transport/worker_pool.cpp	0.00%	4 Missing ⚠️
.../include/transport/rdma_transport/endpoint_store.h	0.00%	1 Missing ⚠️
...ne/include/transport/rdma_transport/rdma_context.h	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

staryxchen · 2026-04-23T02:57:53Z

+    size_t waitingListSize() const override {
+        return waiting_list_len_.load(std::memory_order_relaxed);
+    }


waiting_list_len_ is declared as atomic but waitingListSize() returns size_t.

Good catch, I promoted the counter to atomic<size_t in both FIFO and SIEVE. This is now a clean pass through getter.

staryxchen · 2026-04-23T03:01:44Z

+            // insertions are happening. Without this, reclaim only runs
+            // from RdmaContext::endpoint() and the waiting list grows
+            // unboundedly under failure load. See issue #1845.
+            context_.reclaimEndpoints();


Previously reclaimEndpoint() was only called from within insertEndpoint(), which held endpoint_map_lock_. Now it would be called without the lock, please confirm reclaimEndpoint()'s lock contract hasn't changed in behavior.

reclaimEndpoint() lock contract has not changed behavior:

reclaimEndpoint() still self-locks using WriteGuard(endpoint_map_lock_).

The existing caller rdma_context.cpp:355-356 already had the insertEndpoint lock freed by its destructor.

The new caller in monitorWorker() observes the same contract as it holds no EndpointStore locks.

insertEndpoint() definitions in FIFO, SIEVE, and UB self-lock and only call evictEndpoint(), which is caller-locked. None call reclaimEndpoint().

I added a comment to make this contract clear to future contributors.

staryxchen · 2026-04-23T03:07:46Z

+
+    // Raw accessor for the endpoint cache. Null before construct() runs.
+    // Used by integration tests that need to observe waiting_list_ directly.
+    EndpointStore *endpointStore() const { return endpoint_store_.get(); }


This func would expose a raw point to EndpointStore*. If this were to be called for other purposes in the future (not just for testing), would it be safe?

Yes, I had some concerns about this as well. I decided to remove this pointer and only add the necessary methods to RdmaContext: waitingListSize() and testOnlyInsertWaiting(std::shared_ptr<RdmaEndPoint> ep).

waitingListSize() returns by value, so there is no pointer to stash.

testOnlyInsertWaiting() takes a shared_ptr by value. This removes the raw pointer access across the API boundary.

testOnlyInsertWaiting() is lifted as a virtual function on the EndpointStore base interface. This has the added benefit of making the integration test no longer need dynamic_cast to SIEVEEndpointStore.

waitingListSize() returns size_t but the underlying counter was atomic<int>, which quietly narrowed on load. Promote to atomic<size_t> in both FIFO and SIEVE so the getter is a clean pass-through with no implicit conversion.

…erface monitorWorker now calls reclaimEndpoint() via RdmaContext; it already acquired endpoint_map_lock_ internally, but nothing declared that. Codify the precondition on the base so future callers know not to hold the lock. RWSpinlock is non-reentrant, so recursive acquisition would deadlock.

…rface Previously exposed a raw EndpointStore* via RdmaContext::endpointStore() for the integration test. A raw pointer is easy to misuse outside of tests and couples the caller to the concrete store via dynamic_cast. Replace with two narrow methods on RdmaContext: waitingListSize() (value return) and testOnlyInsertWaiting(shared_ptr<RdmaEndPoint>). The latter is lifted onto the EndpointStore base interface and implemented on both FIFO and SIEVE, so the integration test no longer downcasts.

… ctest Integration test was previously unregistered and invoked manually. Now self-skips via GTEST_SKIP when no RDMA device is present, so it runs cleanly on CI runners without RDMA (skips) and on rxe/mlx5 hosts (executes). Labeled "rdma" for ctest -L filtering.

…is empty monitorWorker now drives reclaim at ~1 Hz regardless of activity. On FIFO this grabbed endpoint_map_lock_ as WriteGuard every tick even in the common steady-state case where waiting_list_ is empty. Add the same counter-check short-circuit SIEVE already has.

…ruct fails GHA ubuntu-22.04 runners enumerate a phantom mlx5_0 via ibv_get_device_list without a working port/GID, so pickRdmaDevice() returns a non-empty name and the earlier GTEST_SKIP on empty device list doesn't fire. Then construct() fails with ERR_CONTEXT and the hard ASSERT_EQ fails the test. Convert the assertion to a GTEST_SKIP on construct failure. Matches the "attempt setup, skip on failure" convention used elsewhere in the repo (e.g., client_local_hot_cache_test.cpp:794-799).

staryxchen

LGTM if CI pass

ccs1112 added 3 commits April 22, 2026 14:44

ccs1112 requested review from ShangmingCai, alogfans, chestnut-Q, doujiang24, stmatengss and ykwd as code owners April 22, 2026 08:19

github-actions Bot added run-ci Transfer Engine labels Apr 22, 2026

ccs1112 mentioned this pull request Apr 22, 2026

[Bug]: Potential QP leak on transfer failure for PD disaggregation scenario #1845

Closed

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

ccs1112 added 2 commits April 22, 2026 16:51

ccs1112 force-pushed the fix/qp-leak-1845 branch from d3ec123 to c314fca Compare April 22, 2026 09:39

staryxchen reviewed Apr 23, 2026

View reviewed changes

ccs1112 added 6 commits April 24, 2026 12:11

staryxchen approved these changes Apr 24, 2026

View reviewed changes

alogfans merged commit 489c020 into kvcache-ai:main Apr 27, 2026
35 of 37 checks passed

dtcccc mentioned this pull request May 6, 2026

[TE] rdma: fix use-after-free crash in ibv_post_send caused by concurrent QP destruction #1903

Merged

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[transfer_engine] fix: drain endpoint waiting list via periodic reclaim#1952

[transfer_engine] fix: drain endpoint waiting list via periodic reclaim#1952
alogfans merged 11 commits into
kvcache-ai:mainfrom
ccs1112:fix/qp-leak-1845

ccs1112 commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

codecov-commenter commented Apr 22, 2026 •

edited

Loading

Uh oh!

staryxchen Apr 23, 2026

Uh oh!

ccs1112 Apr 24, 2026

Uh oh!

staryxchen Apr 23, 2026

Uh oh!

ccs1112 Apr 24, 2026

Uh oh!

staryxchen Apr 23, 2026

Uh oh!

ccs1112 Apr 24, 2026 •

edited

Loading

Uh oh!

staryxchen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ccs1112 commented Apr 22, 2026

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

staryxchen Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ccs1112 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

staryxchen Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ccs1112 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

staryxchen Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ccs1112 Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

staryxchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 22, 2026 •

edited

Loading

ccs1112 Apr 24, 2026 •

edited

Loading