[PG] Add rank fault tolerance and degraded recovery by Dayuxiaoshui · Pull Request #2182 · kvcache-ai/Mooncake

Dayuxiaoshui · 2026-05-22T05:23:08Z

Description

This PR adds rank-level fault tolerance for Mooncake PG. Previously, failed peers were mostly detected through data-plane collective failures. That made clean exits, hard kills, and degraded survivor continuation unreliable because healthy ranks could keep waiting for a peer that was no longer reachable.

The new flow moves failure detection into the PG connection layer:

ConnectionPoller periodically probes connected active peers out-of-band.
Peers are only marked failed after multiple consecutive liveness probe failures.
Failed peers are centrally disconnected by clearing peerConnected, deactivating their activeRanks entry, deleting stale store metadata, resetting P2P state, and returning the peer state machine to wait for replacement metadata.
Collective workers continue to use activeRanks to skip inactive peers, so survivors can keep running degraded collectives without submitting transfers or sync operations to failed ranks.
Replacement ranks are reintroduced through the existing recover_ranks() / join_group() path and become active again through the synchronized active-rank mask.

The PR also strengthens PG elasticity and recovery coverage, including manual active-rank masking, clean-exit fault detection, gated SIGKILL fault detection, degraded survivor collectives, and replacement recovery on both CPU and CUDA/RDMA paths.

Module

Type of Change

How Has This Been Tested?

Tested in the zhouyuhan-gpu-rdma container with CUDA/RDMA enabled.

CUDA/RDMA environment:

unset MC_FORCE_TCP
export WITH_NVIDIA_PEERMEM=1
export MC_ENABLE_DEST_DEVICE_AFFINITY=1
export LD_PRELOAD=/tmp/mooncake_abi0_deps/lib/libgflags.so:/tmp/mooncake_abi0_deps/lib/libglog.so:/tmp/mooncake_abi0_deps/lib/libjsoncpp.so
export LD_LIBRARY_PATH=/tmp/mooncake_abi0_deps/lib:/home/zhouyuhan01/Mooncake/mooncake-wheel/mooncake:$LD_LIBRARY_PATH
export PYTHONPATH=/home/zhouyuhan01/Mooncake/mooncake-wheel:/home/zhouyuhan01/Mooncake/mooncake-pg/tests

Focused CUDA/RDMA smoke test:

python3 -m unittest test_pg_init_functional.TestMooncakePGInitFunctionalCUDA.test_basic_init -v

Result: passed.

CUDA init functional regression:

python3 -m unittest test_pg_init_functional.TestMooncakePGInitFunctionalCUDA -v

Result:

Ran 7 tests
OK

CUDA elastic and fault-tolerance regression:

python3 -m unittest test_pg_elastic.TestMooncakePGElasticCUDA -v

Result:

Ran 9 tests
OK (skipped=1)

The skipped test is the gated SIGKILL fault-injection test, which was run separately.

CUDA/RDMA gated SIGKILL fault detection:

MOONCAKE_PG_ENABLE_KILL9_TESTS=1 \
python3 -m unittest test_pg_elastic.TestMooncakePGElasticCUDA.test_kill9_fault_detection -v

Result: passed.

CPU gated SIGKILL fault detection:

MOONCAKE_PG_ENABLE_KILL9_TESTS=1 \
python3 -m unittest test_pg_elastic.TestMooncakePGElasticCPU.test_kill9_fault_detection -v

Result: passed.

Combined CPU + CUDA PG regression:

python3 -m unittest test_pg_init_functional test_pg_elastic -v

Result:

Ran 32 tests in 175.151s
OK (skipped=2)

The two skipped tests are the default-disabled CPU/CUDA SIGKILL tests. Both were explicitly run with MOONCAKE_PG_ENABLE_KILL9_TESTS=1 and passed.

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before subm

Detect failed active peers with out-of-band liveness probes, deactivate them for survivor collectives, and strengthen elastic recovery tests for CPU and CUDA/RDMA paths.

gemini-code-assist

Code Review

This pull request introduces a lazy initialization mechanism for the Mooncake backend, an out-of-band liveness probe for the connection poller, and a runtime interface to update active ranks. The changes include refactoring the MooncakeBackend to support deferred resource allocation, adding health check logic to ConnectionContext, and implementing retry logic for failed transfers in the MooncakeWorker. Feedback highlights several concurrency and robustness issues: the use of non-atomic booleans in a double-checked locking pattern for initialization, a data race on the static backendIndex_, and thread-safety concerns regarding direct at::Tensor modifications from the worker thread. Additionally, the reviewer noted that the 100-microsecond ping timeout is likely too aggressive and that the new retry logic for sync operations lacks a maximum retry limit or backoff mechanism.

gemini-code-assist · 2026-05-22T05:29:37Z

+    if (resourcesInitialized_) {
+        return;
+    }


The double-checked locking pattern used here for resourcesInitialized_ is not thread-safe because resourcesInitialized_ is a plain bool. In C++, this can lead to undefined behavior due to instruction reordering or memory visibility issues. Please change resourcesInitialized_, resourcesStarted_, initializationFailed_, and isShutdown_ to std::atomic<bool> to ensure correct synchronization.

gemini-code-assist · 2026-05-22T05:29:37Z

+
+    globalRanksInGroup_.assign(distBackendOpts.global_ranks_in_group.begin(),
+                               distBackendOpts.global_ranks_in_group.end());
+    instanceBackendIndex_ = backendIndex_++;


The static member backendIndex_ is incremented without any synchronization. If multiple MooncakeBackend instances are created concurrently from different threads, this will result in a data race. Please use std::atomic<int> for backendIndex_ or protect the increment with a mutex.

gemini-code-assist · 2026-05-22T05:29:37Z

+
+    group->peerConnected[peerRank] = false;
+    group->activeRanks[peerRank] = false;
+    group->activeRanksTensor[peerRank] = 0;


Modifying activeRanksTensor directly from the worker thread is not thread-safe. at::Tensor operations are not guaranteed to be atomic or thread-safe against concurrent reads or writes from other threads (like the main thread or the connection poller). Additionally, if the tensor is on CUDA, this operation might have side effects on the current CUDA context or stream of the worker thread. It is safer to update the bool* activeRanks array and signal the main thread or poller to update the tensor consistently.

gemini-code-assist · 2026-05-22T05:29:37Z

-                                         PeerLiveness::Alive)) {
-                                    LOG(ERROR)
+                                if ((status.s == TransferStatusEnum::FAILED ||
+                                     diff.count() > kPingTimeoutMicroseconds_) &&


The timeout kPingTimeoutMicroseconds_ is set to 100 microseconds, which is extremely aggressive for network transfers. This will likely cause frequent and unnecessary calls to probePeerAliveByID in the high-performance worker loop, potentially impacting throughput. Consider increasing this threshold or implementing a backoff mechanism for liveness probes in the worker thread.

gemini-code-assist · 2026-05-22T05:29:37Z

+                            } else if (status.s == TransferStatusEnum::FAILED) {
+                                LOG(WARNING)
+                                    << "Rank " << group->rank
+                                    << " retrying sync to alive peer " << j
+                                    << " during op " << (int)task.opType;
+                                freeBatchID(group, task.batchID);
+
+                                for (size_t peer = 0; peer < kMaxNumRanks;
+                                     ++peer) {
+                                    rankToTaskId[i][peer] = kInvalidTaskId;
+                                }
+                                auto retry_source_ptr =
+                                    (int32_t*)group->segmentInfos[group->rank]
+                                        .send_sync[task.bufferOffset];
+                                std::vector<TransferRequest> entries;
+                                for (int peer = 0; peer < group->size; ++peer) {
+                                    if (!group->activeRanks[peer]) {
+                                        continue;
+                                    }
+                                    *retry_source_ptr = 1;
+                                    rankToTaskId[i][peer] = entries.size();
+                                    entries.push_back(TransferRequest{
+                                        .opcode = TransferRequest::WRITE,
+                                        .source = (void*)retry_source_ptr,
+                                        .target_id = group->segmentIDs[peer],
+                                        .target_offset =
+                                            group->segmentInfos[peer]
+                                                .recv_sync[task.bufferOffset] +
+                                            group->rank * sizeof(int32_t),
+                                        .length = sizeof(int32_t),
+                                    });
+                                }
+                                task.batchID = group->engine->allocateBatchID(
+                                    entries.size());
+                                group->engine->submitTransfer(task.batchID,
+                                                              entries);
+                                activeTime[i] = clock::now();
+                                task_status[i].store(SIGNALED_1,
+                                                     std::memory_order_release);
+                                task_done = false;
+                                break;
                            } else {


The retry logic for failed sync operations lacks a retry limit or exponential backoff. If a peer is reported as 'Alive' by the engine but consistently fails transfers (e.g., due to a persistent configuration mismatch), the worker thread will enter a tight loop of re-submissions, consuming significant CPU and potentially masking the underlying issue. Consider adding a maximum retry count before marking the peer as broken.

codecov-commenter · 2026-05-22T06:19:24Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

caozhanhao

Thanks for the effort on this! The overall goal makes sense, but I have several concerns about the current approach. I've left some comments inline, and I think a few parts will need to be adjusted to make sure we do not introduce side effects.

Dayuxiaoshui · 2026-05-23T02:20:58Z

cc @caozhanhao

Dayuxiaoshui · 2026-05-23T02:30:46Z

cc @caozhanhao @yuechen-sys The CI failure in the Rust test step is not caused by our code changes. All 8 Rust unit tests pass successfully; the failure is a known LeakSanitizer (LSan) compatibility issue in the GitHub Actions container environment (Tracer caught signal 11: LeakSanitizer has encountered a fatal error). LSan does not work correctly under ptrace, which is a CI infrastructure limitation rather than a code defect. The minimal_smoke tests themselves are green.

caozhanhao

Thanks for the updates! PG did have some bugs with TcpTransport previously, but it seems these changes also broke the RDMA path. In my environment, TestMooncakePGElasticCUDA.test_failed_rank is hanging.

By the way, since this PR's base lies between #2066 and #2192, setting environment variable WITH_NVIDIA_PEERMEM=1 may be necessary on some machines to rule out external factors.

caozhanhao

Thanks for the follow-up! The implementation looks much cleaner now, with only a few minor issues left to address.

…onnected, simplify worker thread

caozhanhao · 2026-05-24T14:53:51Z

LGTM. Thanks for your contribution!
@yuechen-sys PTAL.

Dayuxiaoshui · 2026-05-26T07:42:11Z

cc @yuechen-sys

[PG] Add rank fault tolerance and degraded recovery

f0ec790

Detect failed active peers with out-of-band liveness probes, deactivate them for survivor collectives, and strengthen elastic recovery tests for CPU and CUDA/RDMA paths.

Dayuxiaoshui requested review from UNIDY2002, ympcMark and yuechen-sys as code owners May 22, 2026 05:23

github-actions Bot added run-ci PyTorch Backend labels May 22, 2026

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Dayuxiaoshui added 2 commits May 22, 2026 05:36

[PG] Fix clang-format

84ed200

[PG] Fix remaining clang-format

2964d3d

yuechen-sys assigned yuechen-sys and unassigned yuechen-sys May 22, 2026

caozhanhao reviewed May 22, 2026

View reviewed changes

Dayuxiaoshui force-pushed the main branch 2 times, most recently from 633d54f to 68e9a5d Compare May 23, 2026 01:59

Dayuxiaoshui requested a review from caozhanhao May 23, 2026 02:21

caozhanhao reviewed May 23, 2026

View reviewed changes

Dayuxiaoshui force-pushed the main branch from 68e9a5d to 1da0414 Compare May 24, 2026 03:42

Dayuxiaoshui requested a review from caozhanhao May 24, 2026 07:46

[PG]fix test

2d0f4dd

Dayuxiaoshui force-pushed the main branch from 1da0414 to 2d0f4dd Compare May 24, 2026 08:39

Dayuxiaoshui requested review from alogfans, chestnut-Q and doujiang24 as code owners May 24, 2026 08:39

github-actions Bot added the Transfer Engine label May 24, 2026

Dayuxiaoshui added 2 commits May 24, 2026 08:50

[PG] Fix clang-format

257871f

[PG] Revert transfer-engine changes; remove .cpu() sync in test step 3

0d0bc02

caozhanhao reviewed May 24, 2026

View reviewed changes

[PG] Address code review comments: remove dead code, fix markPeerDisc…

5237277

…onnected, simplify worker thread

Dayuxiaoshui requested a review from caozhanhao May 24, 2026 13:06

stmatengss mentioned this pull request May 25, 2026

[RoadMap][Call For Contribution] Mooncake Project Overall Roadmap #1883

Open

68 tasks

Conversation

Dayuxiaoshui commented May 22, 2026 • edited by UNIDY2002 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 22, 2026

Codecov Report

Uh oh!

caozhanhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dayuxiaoshui commented May 23, 2026

Uh oh!

Dayuxiaoshui commented May 23, 2026

Uh oh!

caozhanhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

caozhanhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

caozhanhao commented May 24, 2026

Uh oh!

Dayuxiaoshui commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Dayuxiaoshui commented May 22, 2026 •

edited by UNIDY2002

Loading