Skip to content

[PG] Add rank fault tolerance and degraded recovery#2182

Open
Dayuxiaoshui wants to merge 7 commits into
kvcache-ai:mainfrom
Dayuxiaoshui:main
Open

[PG] Add rank fault tolerance and degraded recovery#2182
Dayuxiaoshui wants to merge 7 commits into
kvcache-ai:mainfrom
Dayuxiaoshui:main

Conversation

@Dayuxiaoshui
Copy link
Copy Markdown
Contributor

@Dayuxiaoshui Dayuxiaoshui commented May 22, 2026

Description

Fixes #2157

This PR adds rank-level fault tolerance for Mooncake PG. Previously, failed peers were mostly detected through data-plane collective failures. That made clean exits, hard kills, and degraded survivor continuation unreliable because healthy ranks could keep waiting for a peer that was no longer reachable.

The new flow moves failure detection into the PG connection layer:

  • ConnectionPoller periodically probes connected active peers out-of-band.
  • Peers are only marked failed after multiple consecutive liveness probe failures.
  • Failed peers are centrally disconnected by clearing peerConnected, deactivating their activeRanks entry, deleting stale store metadata, resetting P2P state, and returning the peer state machine to wait for replacement metadata.
  • Collective workers continue to use activeRanks to skip inactive peers, so survivors can keep running degraded collectives without submitting transfers or sync operations to failed ranks.
  • Replacement ranks are reintroduced through the existing recover_ranks() / join_group() path and become active again through the synchronized active-rank mask.

The PR also strengthens PG elasticity and recovery coverage, including manual active-rank masking, clean-exit fault detection, gated SIGKILL fault detection, degraded survivor collectives, and replacement recovery on both CPU and CUDA/RDMA paths.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Tested in the zhouyuhan-gpu-rdma container with CUDA/RDMA enabled.

CUDA/RDMA environment:

unset MC_FORCE_TCP
export WITH_NVIDIA_PEERMEM=1
export MC_ENABLE_DEST_DEVICE_AFFINITY=1
export LD_PRELOAD=/tmp/mooncake_abi0_deps/lib/libgflags.so:/tmp/mooncake_abi0_deps/lib/libglog.so:/tmp/mooncake_abi0_deps/lib/libjsoncpp.so
export LD_LIBRARY_PATH=/tmp/mooncake_abi0_deps/lib:/home/zhouyuhan01/Mooncake/mooncake-wheel/mooncake:$LD_LIBRARY_PATH
export PYTHONPATH=/home/zhouyuhan01/Mooncake/mooncake-wheel:/home/zhouyuhan01/Mooncake/mooncake-pg/tests

Focused CUDA/RDMA smoke test:

python3 -m unittest test_pg_init_functional.TestMooncakePGInitFunctionalCUDA.test_basic_init -v

Result: passed.

CUDA init functional regression:

python3 -m unittest test_pg_init_functional.TestMooncakePGInitFunctionalCUDA -v

Result:

Ran 7 tests
OK

CUDA elastic and fault-tolerance regression:

python3 -m unittest test_pg_elastic.TestMooncakePGElasticCUDA -v

Result:

Ran 9 tests
OK (skipped=1)

The skipped test is the gated SIGKILL fault-injection test, which was run separately.

CUDA/RDMA gated SIGKILL fault detection:

MOONCAKE_PG_ENABLE_KILL9_TESTS=1 \
python3 -m unittest test_pg_elastic.TestMooncakePGElasticCUDA.test_kill9_fault_detection -v

Result: passed.

CPU gated SIGKILL fault detection:

MOONCAKE_PG_ENABLE_KILL9_TESTS=1 \
python3 -m unittest test_pg_elastic.TestMooncakePGElasticCPU.test_kill9_fault_detection -v

Result: passed.

Combined CPU + CUDA PG regression:

python3 -m unittest test_pg_init_functional test_pg_elastic -v

Result:

Ran 32 tests in 175.151s
OK (skipped=2)

The two skipped tests are the default-disabled CPU/CUDA SIGKILL tests. Both were explicitly run with MOONCAKE_PG_ENABLE_KILL9_TESTS=1 and passed.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before subm

Detect failed active peers with out-of-band liveness probes, deactivate them for survivor collectives, and strengthen elastic recovery tests for CPU and CUDA/RDMA paths.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a lazy initialization mechanism for the Mooncake backend, an out-of-band liveness probe for the connection poller, and a runtime interface to update active ranks. The changes include refactoring the MooncakeBackend to support deferred resource allocation, adding health check logic to ConnectionContext, and implementing retry logic for failed transfers in the MooncakeWorker. Feedback highlights several concurrency and robustness issues: the use of non-atomic booleans in a double-checked locking pattern for initialization, a data race on the static backendIndex_, and thread-safety concerns regarding direct at::Tensor modifications from the worker thread. Additionally, the reviewer noted that the 100-microsecond ping timeout is likely too aggressive and that the new retry logic for sync operations lacks a maximum retry limit or backoff mechanism.

Comment thread mooncake-pg/src/mooncake_backend.cpp Outdated
Comment on lines +221 to +223
if (resourcesInitialized_) {
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The double-checked locking pattern used here for resourcesInitialized_ is not thread-safe because resourcesInitialized_ is a plain bool. In C++, this can lead to undefined behavior due to instruction reordering or memory visibility issues. Please change resourcesInitialized_, resourcesStarted_, initializationFailed_, and isShutdown_ to std::atomic<bool> to ensure correct synchronization.


globalRanksInGroup_.assign(distBackendOpts.global_ranks_in_group.begin(),
distBackendOpts.global_ranks_in_group.end());
instanceBackendIndex_ = backendIndex_++;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The static member backendIndex_ is incremented without any synchronization. If multiple MooncakeBackend instances are created concurrently from different threads, this will result in a data race. Please use std::atomic<int> for backendIndex_ or protect the increment with a mutex.


group->peerConnected[peerRank] = false;
group->activeRanks[peerRank] = false;
group->activeRanksTensor[peerRank] = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying activeRanksTensor directly from the worker thread is not thread-safe. at::Tensor operations are not guaranteed to be atomic or thread-safe against concurrent reads or writes from other threads (like the main thread or the connection poller). Additionally, if the tensor is on CUDA, this operation might have side effects on the current CUDA context or stream of the worker thread. It is safer to update the bool* activeRanks array and signal the main thread or poller to update the tensor consistently.

PeerLiveness::Alive)) {
LOG(ERROR)
if ((status.s == TransferStatusEnum::FAILED ||
diff.count() > kPingTimeoutMicroseconds_) &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeout kPingTimeoutMicroseconds_ is set to 100 microseconds, which is extremely aggressive for network transfers. This will likely cause frequent and unnecessary calls to probePeerAliveByID in the high-performance worker loop, potentially impacting throughput. Consider increasing this threshold or implementing a backoff mechanism for liveness probes in the worker thread.

Comment on lines 315 to 356
} else if (status.s == TransferStatusEnum::FAILED) {
LOG(WARNING)
<< "Rank " << group->rank
<< " retrying sync to alive peer " << j
<< " during op " << (int)task.opType;
freeBatchID(group, task.batchID);

for (size_t peer = 0; peer < kMaxNumRanks;
++peer) {
rankToTaskId[i][peer] = kInvalidTaskId;
}
auto retry_source_ptr =
(int32_t*)group->segmentInfos[group->rank]
.send_sync[task.bufferOffset];
std::vector<TransferRequest> entries;
for (int peer = 0; peer < group->size; ++peer) {
if (!group->activeRanks[peer]) {
continue;
}
*retry_source_ptr = 1;
rankToTaskId[i][peer] = entries.size();
entries.push_back(TransferRequest{
.opcode = TransferRequest::WRITE,
.source = (void*)retry_source_ptr,
.target_id = group->segmentIDs[peer],
.target_offset =
group->segmentInfos[peer]
.recv_sync[task.bufferOffset] +
group->rank * sizeof(int32_t),
.length = sizeof(int32_t),
});
}
task.batchID = group->engine->allocateBatchID(
entries.size());
group->engine->submitTransfer(task.batchID,
entries);
activeTime[i] = clock::now();
task_status[i].store(SIGNALED_1,
std::memory_order_release);
task_done = false;
break;
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The retry logic for failed sync operations lacks a retry limit or exponential backoff. If a peer is reported as 'Alive' by the engine but consistently fails transfers (e.g., due to a persistent configuration mismatch), the worker thread will enter a tight loop of re-submissions, consuming significant CPU and potentially masking the underlying issue. Consider adding a maximum retry count before marking the peer as broken.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@caozhanhao caozhanhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort on this! The overall goal makes sense, but I have several concerns about the current approach. I've left some comments inline, and I think a few parts will need to be adjusted to make sure we do not introduce side effects.

Comment thread mooncake-pg/include/mooncake_backend.h Outdated
Comment thread mooncake-pg/include/mooncake_backend.h Outdated
Comment thread mooncake-pg/include/mooncake_backend.h Outdated
Comment thread mooncake-pg/src/mooncake_backend.cpp Outdated
Comment thread mooncake-pg/src/mooncake_worker_thread.cpp Outdated
Comment thread mooncake-pg/src/connection_poller.cpp Outdated
Comment thread mooncake-pg/tests/test_pg_elastic.py Outdated
Comment thread mooncake-pg/tests/test_pg_elastic.py Outdated
Comment thread mooncake-pg/tests/test_pg_elastic.py Outdated
Comment thread mooncake-pg/tests/test_pg_elastic.py Outdated
@Dayuxiaoshui Dayuxiaoshui force-pushed the main branch 2 times, most recently from 633d54f to 68e9a5d Compare May 23, 2026 01:59
@Dayuxiaoshui
Copy link
Copy Markdown
Contributor Author

cc @caozhanhao

@Dayuxiaoshui Dayuxiaoshui requested a review from caozhanhao May 23, 2026 02:21
@Dayuxiaoshui
Copy link
Copy Markdown
Contributor Author

cc @caozhanhao @yuechen-sys The CI failure in the Rust test step is not caused by our code changes. All 8 Rust unit tests pass successfully; the failure is a known LeakSanitizer (LSan) compatibility issue in the GitHub Actions container environment (Tracer caught signal 11: LeakSanitizer has encountered a fatal error). LSan does not work correctly under ptrace, which is a CI infrastructure limitation rather than a code defect. The minimal_smoke tests themselves are green.

Copy link
Copy Markdown
Contributor

@caozhanhao caozhanhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! PG did have some bugs with TcpTransport previously, but it seems these changes also broke the RDMA path. In my environment, TestMooncakePGElasticCUDA.test_failed_rank is hanging.

By the way, since this PR's base lies between #2066 and #2192, setting environment variable WITH_NVIDIA_PEERMEM=1 may be necessary on some machines to rule out external factors.

Comment thread mooncake-pg/src/connection_poller.cpp Outdated
Comment thread mooncake-pg/src/connection_poller.cpp Outdated
Comment thread mooncake-pg/src/connection_poller.cpp Outdated
Comment thread mooncake-pg/src/mooncake_backend.cpp
Comment thread mooncake-pg/tests/test_pg_elastic.py Outdated
Comment thread mooncake-pg/tests/test_pg_elastic.py
Comment thread mooncake-pg/tests/test_pg_elastic.py Outdated
Comment thread mooncake-pg/tests/test_pg_elastic.py Outdated
Copy link
Copy Markdown
Contributor

@caozhanhao caozhanhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up! The implementation looks much cleaner now, with only a few minor issues left to address.

Comment thread mooncake-pg/include/connection_poller.h Outdated
Comment thread mooncake-pg/src/connection_poller.cpp Outdated
Comment thread mooncake-pg/src/connection_poller.cpp
Comment thread mooncake-pg/src/connection_poller.cpp Outdated
Comment thread mooncake-pg/src/mooncake_worker_thread.cpp Outdated
Comment thread mooncake-pg/src/mooncake_worker_thread.cpp Outdated
Comment thread mooncake-pg/src/mooncake_worker_thread.cpp Outdated
Comment thread mooncake-pg/tests/test_pg_elastic.py Outdated
@Dayuxiaoshui Dayuxiaoshui requested a review from caozhanhao May 24, 2026 13:06
@caozhanhao
Copy link
Copy Markdown
Contributor

LGTM. Thanks for your contribution!
@yuechen-sys PTAL.

@Dayuxiaoshui
Copy link
Copy Markdown
Contributor Author

cc @yuechen-sys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Mooncake PG Fault-Tolerance: Diagnosis & Improvement Plan

4 participants