Skip to content

[transfer_engine] fix: drain endpoint waiting list via periodic reclaim#1952

Merged
alogfans merged 11 commits into
kvcache-ai:mainfrom
ccs1112:fix/qp-leak-1845
Apr 27, 2026
Merged

[transfer_engine] fix: drain endpoint waiting list via periodic reclaim#1952
alogfans merged 11 commits into
kvcache-ai:mainfrom
ccs1112:fix/qp-leak-1845

Conversation

@ccs1112

@ccs1112 ccs1112 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Description

This closes #1845. Under SGLang PD-disaggregated prefill, ibv_create_qp eventually fails with Cannot allocate memory and rdma resource show reports >20K QPs per NIC, preceded by 1118 endpoint-evicted events.

SIEVEEndpointStore::reclaimEndpoint drains waiting_list_ of quiescent (a.k.a. evicted and deleted) endpoints so their shared_ptrs drop, and destructors run ibv_destroy_qp. The only production caller was RdmaContext::endpoint(). The reclaim piggybacked on new endpoint insertions. Under healthy load, insertions and evictions are coupled, so the list drains in step. Under failure load, such as peers dying or when retries are exhausted, evictions and deletions continue while insertions stop. This causes reclaims to stop firing and waiting_list_ grows unboundedly. Each accumulated endpoint pins num_qp_per_ep QPs against the NIC's pool (1118 x 2..4 ~= enough to exhaust). The reporter's 4x QP asymmetry across NIC groups (22k on mlx5_00-03 vs. 6k on mlx5_04-07) is consistent with this. Eviction load concentrated on the NICs routing to the failing peer.

To fix this, I added a context_.reclaimEndpoints() call inside WorkerPool::monitorWorker. This is included on the existing last_reset_ts heartbeat. The result is that reclaim no longer depends on insertion traffic. The change is minimal. There is one line of behavior change in worker_pool.cpp and a thin wrapper method RdmaContext::reclaimEndpoints() to delegate to the store.

I also tried adding eager disconnect() to deleteEndpoint/evictEndpoint initially, but it crashes on rxe within ~5s with malloc(): unaligned tcache chunk detected. I think the slice->rdma.qp_depth raw pointer into wr_depth_list_ becomes dangling when reclaim destroys the endpoint while an in-flight slice still holds it. Fixing that properly requires a shared-ownership refactor of wr_depth_list_, which I could do as a follow-up so eager disconnect can be re-added safely.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

There are 5 new unit tests:

  1. ReclaimDrainsQuiescentEntries is the base method contract.
  2. ReclaimLeavesActiveEntries tests that the hasOutstandingSlice gate is preserved.
  3. ReclaimIsIdempotentWhenEmpty tests cadence safety.
  4. LeakManifestsWithoutReclaimCall reproduces the reporter's 1118-eviction shape and asserts a single reclaim call drains the full backlog.
  5. ReclaimDoesNotRequireActiveMap guards against a regression that conflates reclaim with insert.

There is 1 new integration test:

  1. endpoint_store_integration_test constructs a real RdmaContext, spawns monitorWorker, injects quiescent entries into waiting_list_, sleeps, and asserts drainage. I used this bisection to validate the fix. If you remove the new reclaimEndpoints() call then this test will fail with the right error message, if you add it back it will pass again.

I ran the full ctest suite. There is one existing failure: hot_standby_service_test. I think this is simply due to etcd not being available on my system. It's also failing on main, and the code doesn't seem related at all, so I'm flagging this as environmental. Please let me know if you'd like me to take a closer look.

I don't have the hardware to match the original SGLang mlx5 cluster conditions, unfortunately. If you are aware of on-demand spot instances for mlx5, please let me know and I can run this patch there.

This is the test script I used to manually reproduce:

sudo apt install -y linux-modules-extra-$(uname -r) rdma-core
sudo modprobe rdma_rxe
sudo rdma link add rxe0 type rxe netdev <your_iface>

mkdir build && cd build
cmake .. -DBUILD_UNIT_TESTS=ON
make -j8
ctest -R endpoint_store_test --output-on-failure
./mooncake-transfer-engine/tests/endpoint_store_integration_test

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

ccs1112 added 3 commits April 22, 2026 14:44
reclaimEndpoint() is currently invoked only from RdmaContext::endpoint()
after a new insertion. Under healthy load, insertions and evictions are
1:1 so this works. Under failure load -- many error completions trigger
deleteEndpoint(), but new-insertion traffic stalls because the dead peer
isn't generating new connection paths -- waiting_list_ grows without
bound and QPs never get destroyed.

Add a 1Hz reclaimEndpoints() call from monitorWorker on the existing
1-second context heartbeat. This decouples reclaim cadence from
insertion traffic.

See issue kvcache-ai#1845.
…i#1845

Adds unit + integration coverage for the periodic reclaim fix.

endpoint_store_test (5 tests, no RDMA device, runs under ctest):
  - reclaim drains quiescent entries on its own
  - reclaim leaves active entries alone (gate preserved)
  - reclaim is idempotent when empty
  - leak manifests without reclaim call (1118-eviction mirror of reporter)
  - reclaim works without active map (guard against insert/reclaim coupling)

endpoint_store_integration_test (requires RDMA device, not auto-registered):
  - Verifies WorkerPool::monitorWorker actually calls reclaimEndpoints at
    ~1 Hz by constructing a real RdmaContext and waiting for the tick to
    drain injected entries. Confirms the end-to-end fix wiring.

Supporting changes:
  - EndpointStore::waitingListSize() accessor (diagnostics + tests)
  - SIEVEEndpointStore::testOnlyInsertWaiting() for test injection
  - RdmaContext::endpointStore() accessor (diagnostics + tests)
- design/transfer-engine: add a sentence to Endpoint Management explaining
  that waiting_list_ drains both on insertion and on the monitorWorker
  heartbeat, so accumulated reclaim does not stall under failure load.
- troubleshooting: extend the "Failed to create QP: Cannot allocate
  memory" entry with a bullet pointing at issue kvcache-ai#1845 so operators
  seeing the symptom find the cause and the fix.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a resource leak (issue #1845) where RDMA endpoints and Queue Pairs (QPs) accumulated during peer failures because reclamation was only triggered by new insertions. The fix introduces a periodic 1 Hz reclamation tick within the monitorWorker thread to ensure the waiting_list_ is drained even when insertions stall. The changes include updates to documentation, the addition of a waitingListSize metric for observability, and new unit and integration tests. Review feedback highlighted a thread-safety concern in FIFOEndpointStore::waitingListSize(), where accessing the size of the waiting_list_ set without synchronization could lead to undefined behavior.

int disconnectQPs() override;

size_t getTotalQPNumber() override;
size_t waitingListSize() const override { return waiting_list_.size(); }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of waitingListSize() in FIFOEndpointStore is not thread-safe. Accessing waiting_list_.size() on a std::unordered_set without a lock while other threads might be concurrently modifying the set (e.g., in deleteEndpoint, evictEndpoint, or reclaimEndpoint) leads to undefined behavior.

To resolve this, consider adding an atomic counter waiting_list_len_ to FIFOEndpointStore, similar to the implementation in SIEVEEndpointStore. This would allow for a lock-free and thread-safe size check. Alternatively, you could use a ReadGuard with the endpoint_map_lock_, though this would require making the lock mutable to be used within this const method.

ccs1112 added 2 commits April 22, 2026 16:51
…atomic counter

Per PR kvcache-ai#1952 review: FIFO variant returned waiting_list_.size() on
std::unordered_set without holding endpoint_map_lock_, racing
concurrent modification. Mirror the SIEVE pattern with an atomic
waiting_list_len_ incremented in delete/evict, decremented in reclaim.
… LSAN

CI build (3.10/3.12) runs with -DENABLE_ASAN=ON and LSAN flagged the
5 × 288 byte allocation the test fixture intentionally leaks
(~RdmaTransport dereferences a null metadata_ unless install() ran).
Gate on __SANITIZE_ADDRESS__ / __has_feature and mark the pointer with
__lsan_ignore_object so real leaks are still caught.
@codecov-commenter

codecov-commenter commented Apr 22, 2026

Copy link
Copy Markdown

Comment on lines +77 to +79
size_t waitingListSize() const override {
return waiting_list_len_.load(std::memory_order_relaxed);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting_list_len_ is declared as atomic but waitingListSize() returns size_t.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I promoted the counter to atomic<size_t in both FIFO and SIEVE. This is now a clean pass through getter.

// insertions are happening. Without this, reclaim only runs
// from RdmaContext::endpoint() and the waiting list grows
// unboundedly under failure load. See issue #1845.
context_.reclaimEndpoints();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously reclaimEndpoint() was only called from within insertEndpoint(), which held endpoint_map_lock_. Now it would be called without the lock, please confirm reclaimEndpoint()'s lock contract hasn't changed in behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reclaimEndpoint() lock contract has not changed behavior:

  • reclaimEndpoint() still self-locks using WriteGuard(endpoint_map_lock_).
  • The existing caller rdma_context.cpp:355-356 already had the insertEndpoint lock freed by its destructor.
  • The new caller in monitorWorker() observes the same contract as it holds no EndpointStore locks.
  • insertEndpoint() definitions in FIFO, SIEVE, and UB self-lock and only call evictEndpoint(), which is caller-locked. None call reclaimEndpoint().

I added a comment to make this contract clear to future contributors.


// Raw accessor for the endpoint cache. Null before construct() runs.
// Used by integration tests that need to observe waiting_list_ directly.
EndpointStore *endpointStore() const { return endpoint_store_.get(); }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This func would expose a raw point to EndpointStore*. If this were to be called for other purposes in the future (not just for testing), would it be safe?

@ccs1112 ccs1112 Apr 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had some concerns about this as well. I decided to remove this pointer and only add the necessary methods to RdmaContext: waitingListSize() and testOnlyInsertWaiting(std::shared_ptr<RdmaEndPoint> ep).

  • waitingListSize() returns by value, so there is no pointer to stash.
  • testOnlyInsertWaiting() takes a shared_ptr by value. This removes the raw pointer access across the API boundary.
  • testOnlyInsertWaiting() is lifted as a virtual function on the EndpointStore base interface. This has the added benefit of making the integration test no longer need dynamic_cast to SIEVEEndpointStore.

ccs1112 added 6 commits April 24, 2026 12:11
waitingListSize() returns size_t but the underlying counter was atomic<int>,
which quietly narrowed on load. Promote to atomic<size_t> in both FIFO and
SIEVE so the getter is a clean pass-through with no implicit conversion.
…erface

monitorWorker now calls reclaimEndpoint() via RdmaContext; it already
acquired endpoint_map_lock_ internally, but nothing declared that. Codify
the precondition on the base so future callers know not to hold the lock.
RWSpinlock is non-reentrant, so recursive acquisition would deadlock.
…rface

Previously exposed a raw EndpointStore* via RdmaContext::endpointStore()
for the integration test. A raw pointer is easy to misuse outside of
tests and couples the caller to the concrete store via dynamic_cast.

Replace with two narrow methods on RdmaContext: waitingListSize() (value
return) and testOnlyInsertWaiting(shared_ptr<RdmaEndPoint>). The latter
is lifted onto the EndpointStore base interface and implemented on both
FIFO and SIEVE, so the integration test no longer downcasts.
… ctest

Integration test was previously unregistered and invoked manually. Now
self-skips via GTEST_SKIP when no RDMA device is present, so it runs
cleanly on CI runners without RDMA (skips) and on rxe/mlx5 hosts
(executes). Labeled "rdma" for ctest -L filtering.
…is empty

monitorWorker now drives reclaim at ~1 Hz regardless of activity. On FIFO
this grabbed endpoint_map_lock_ as WriteGuard every tick even in the
common steady-state case where waiting_list_ is empty. Add the same
counter-check short-circuit SIEVE already has.
…ruct fails

GHA ubuntu-22.04 runners enumerate a phantom mlx5_0 via ibv_get_device_list
without a working port/GID, so pickRdmaDevice() returns a non-empty name
and the earlier GTEST_SKIP on empty device list doesn't fire. Then
construct() fails with ERR_CONTEXT and the hard ASSERT_EQ fails the test.

Convert the assertion to a GTEST_SKIP on construct failure. Matches the
"attempt setup, skip on failure" convention used elsewhere in the repo
(e.g., client_local_hot_cache_test.cpp:794-799).

@staryxchen staryxchen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI pass

@alogfans alogfans merged commit 489c020 into kvcache-ai:main Apr 27, 2026
35 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Potential QP leak on transfer failure for PD disaggregation scenario

4 participants