Skip to content
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
eb3f5db
feat: merge the logic of _update_per_bucket_p2p into _update_per_bucket
abcdea Sep 23, 2025
2898e16
fix: refine rank group handling in update
abcdea Sep 28, 2025
0065a87
feat: GPU-RDMA-device-topology-unawared receiver assignment implemented
abcdea Sep 28, 2025
4ada8b6
feat: GPU-RDMA-device-topology-awared receiver assignment implemented
abcdea Sep 30, 2025
c840c34
style: ruff formatting
abcdea Oct 9, 2025
59b8b38
fix: resolve PR comments
specture724 Oct 14, 2025
745e4c2
Merge branch 'main' into feat/optimize_p2p
specture724 Oct 14, 2025
e2e98d5
fix: assert p2p_store_addr is not None in loading checkpoint
specture724 Oct 14, 2025
d8dc4be
misc: logging removed
specture724 Oct 14, 2025
17271e5
Merge branch 'main' into feat/optimize_p2p
specture724 Oct 14, 2025
c6ec7dd
fix: return logic in update
specture724 Oct 14, 2025
5916eb9
Merge branch 'feat/optimize_p2p' of github.com:specture724/checkpoint…
specture724 Oct 14, 2025
bf73f42
fix: resolve pr comment issues
specture724 Oct 14, 2025
e79ef26
misc: format commit message
specture724 Oct 14, 2025
e126a89
feat: test_assign_receiver_ranks.py: add unit tests for _assign_recei…
specture724 Oct 14, 2025
8440ffa
refactor: refactor the test
specture724 Oct 17, 2025
5fd38b1
misc: resolve pr issues
specture724 Oct 17, 2025
0d74d7f
fix: handle corner case when senders' buckets lays unbanlancedly
specture724 Oct 20, 2025
f06b0bf
fix: debug test and add more cases
specture724 Oct 20, 2025
32e687d
Merge branch 'MoonshotAI:main' into feat/optimize_p2p
specture724 Oct 20, 2025
12455f5
misc: fix pr issues
specture724 Oct 21, 2025
e4c253d
doc: fix benchmark results in README
specture724 Oct 23, 2025
4bd5da3
fix: gather meta to generate local topo fixed
specture724 Oct 25, 2025
86437ad
misc: fix pr issues
specture724 Oct 25, 2025
415c320
doc
specture724 Oct 28, 2025
ed6c8a0
docs: add numa binding
weixiao-huang Oct 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 15 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ updating our [Kimi-K2](https://github.com/MoonshotAI/Kimi-K2) model (1 Trillion

The core weight update logic is in `ParameterServer` class, a service colocated with inference engines. It provides two implementations of weight update: Broadcast and P2P.

- **Broadcast**: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See `_update_per_bucket`.
- **P2P**: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the [`mooncake-transfer-engine`](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#use-python-package) to P2P send weights from CPUs in existing instances to GPUs in new instances. See `_update_per_bucket_p2p`.
- **Broadcast**: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See `_update_per_bucket` with `ranks == None or []`.
- **P2P**: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the [`mooncake-transfer-engine`](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#use-python-package) to P2P send weights from CPUs in existing instances to GPUs in new instances. See `_update_per_bucket` with `ranks` specified.

### Optimized Weight Broadcast
In the *Broadcast* implementation, the checkpoint-engine holds references to sharded weights in CPU memory, and need to efficiently broadcast them to a cluster of inference instances, often under a different sharding pattern.
Expand All @@ -36,16 +36,22 @@ It then executes the transfer, where it controls the inference engine through a

Pipelining naturally requires more GPU memory. When memory is not enough, checkpoint-engine will fallback to serial execution.

### Optimized P2P Bucket Assignment
In the *P2P* implementation, checkpoint-engine needs to send weights from existing instances to new instances.
To minimize the overall transfer time, checkpoint-engine optimizes the bucket assignment for each sender-receiver pair.
The optimization goal is to make full use of the available network bandwidth for each sender and receiver.
See [issue #25](https://github.com/MoonshotAI/checkpoint-engine/issues/25)

## Benchmark

| Model | Device Info | GatherMetas | Update (Broadcast) | Update (P2P) |
| :----------------------------------- | :----------- | :---------- |:-------------------| :---------------------- |
| GLM-4.5-Air (BF16) | 8xH800 TP8 | 0.17s | 3.94s (1.42GiB) | 8.83s (4.77GiB) |
| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8 | 0.46s | 6.75s (2.69GiB) | 16.47s (4.05GiB) |
| DeepSeek-V3.1 (FP8) | 16xH20 TP16 | 1.44s | 12.22s (2.38GiB) | 25.77s (3.61GiB) |
| Kimi-K2-Instruct (FP8) | 16xH20 TP16 | 1.81s | 15.45s (2.93GiB) | 36.24s (4.46GiB) |
| DeepSeek-V3.1 (FP8) | 256xH20 TP16 | 1.40s | 13.88s (2.54GiB) | 33.30s (3.86 GiB) |
| Kimi-K2-Instruct (FP8) | 256xH20 TP16 | 1.88s | 21.50s (2.99GiB) | 34.49s (4.57 GiB) |
| GLM-4.5-Air (BF16) | 8xH800 TP8 | 0.12s | 3.47s (1.42GiB) | 4.12s (4.77GiB) |
| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8 | 0.33s | 6.22s (2.69GiB) | 7.10s (4.05GiB) |
| DeepSeek-V3.1 (FP8) | 16xH20 TP16 | 1.17s | 10.46s (2.38GiB) | 14.63s (3.61GiB) |
| Kimi-K2-Instruct (FP8) | 16xH20 TP16 | 1.33s | 14.51s (2.93GiB) | 20.24s (4.46GiB) |
| DeepSeek-V3.1 (FP8) | 256xH20 TP16 | 0.94s | 10.20s (2.54GiB) | 13.82s (3.86 GiB) |
| Kimi-K2-Instruct (FP8) | 256xH20 TP16 | 1.24s | 15.34s (2.99GiB) | 19.69s (4.57 GiB) |

All results above are tested by [`examples/update.py`](./examples/update.py) and use [vLLM v0.10.2rc1](https://github.com/vllm-project/vllm/tree/v0.10.2rc1) as inference engine. Some notes:

Expand All @@ -68,7 +74,7 @@ Use the flexible P2P implementation, notice this will install `mooncake-transfer
pip install 'checkpoint-engine[p2p]'
```

If set `NCCL_IB_HCA` env, checkpoint-engine will use it to auto select net devices for different ranks. If not set, it will read all RDMA devices and try to divide them into each rank.
If set `NCCL_IB_HCA` env, checkpoint-engine will use it to auto select net devices for different ranks. Available patterns can be found from [NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8). If not set, it will read all RDMA devices and try to divide them into each rank.

## Getting Started

Expand Down Expand Up @@ -145,7 +151,6 @@ torchrun --nproc-per-node 8 tests/test_update.py

- This project is currently only tested with vLLM. But it is easy to integrate with other frameworks like SGLang.
- The perfect three-stage pipeline mentioned in our paper is currently not implemented. This could be useful for architectures where H2D and broadcast do not conflict in PCIE.
- The P2P update method is currently not the optimal implementation since it will receive data only in rank 0 and broadcast to others synchronizely. This is a potential optimization in the future.

## Acknowledgments

Expand Down
Loading