modify readme

kip-cxj · kip-cxj · commit 8dc26f1e7337 · 2026-02-24T16:45:35.000+08:00
diff --git a/verl/checkpoint_engine/README.md b/verl/checkpoint_engine/README.md
@@ -20,18 +20,25 @@ Checkpoint Engine is an unified abstract layer to synchronize weights between va
 |nixl|NIXL|all_gather+ring p2p|Various transport backends (D2D, H2H, H2D, etc)<br>- UCX<br>- UCCL<br>- Mooncacke|Medium/High|High: dynamic adjust ring topology|Off-policy training<br>- Trainer/rollout disaggregated<br>- Elastic rollout<br>- Rollout fault tolerance<br>- Heterogeneous hardware rollout
 |kimi_ckpt_engine|MOONCAKE+NCCL/HCCL|p2p+broadcast|NVIDIA/Ascend|High|Low: rebuild communication group|Off-policy training<br>- Trainer/rollout disaggregated<br>- Save checkpoint each time
 
-PS: kimi_ckpt_engine first offloads all weights to the CPU. Then, using Mooncake transfer engine, these weights are transmitted via P2P to a specific worker in the rollout, followed by a broadcast to all other rollout workers.
+##### kimi_ckpt_engine detail:
+
+In the kimi_ckpt_engine workflow, the trainer first offloads the weights to the CPU, and the rollout creates a sub communication group that includes all the cards for the rollout. Then, using Mooncake transfer engine, these weights are transmitted via P2P to a specific worker in the rollout, followed by a broadcast to all other rollout workers.
+
+<img src="https://github.com/kip-cxj/verl/blob/cxj/doc_imgs/docs/_static/kimi_ckpt_engine.png?raw=true" alt="kimi-ckpt-engine" width="50%">
+
+This mode requires the P2P feature of checkpoint_engine. Please ensure you have installed it via pip install 'checkpoint-engine[p2p]' and that your version is 0.4.0 or higher.
+
+In addition, during the installation of checkpoint-engine[p2p], the transfer engine will be installed. However, This library has no prebuilt packages for Ascend devices and must be compiled from source. For detailed compilation instructions, see: [transfer-engine: ascend direct](https://github.com/kvcache-ai/Mooncake/blob/main/docs/source/design/transfer-engine/ascend_direct_transport.md)
 
 ### Benchmark
 1. benchmark setup
 - model: Qwen/Qwen3-30B-A3B-Base
 - trainer: fsdp world_size=2 (since Ascend 910C has 64GB of HBM, we set world_size=4)
 - rollout: num_rollout=30 (only receive weight without cuda ipc to vllm/sglang)
 ```bash
-python3 tests/checkpoint_engine/test_nixl_checkpoint_engine.py
-python3 tests/checkpoint_engine/test_nccl_checkpoint_engine.py
-python3 tests/checkpoint_engine/test_hccl_checkpoint_engine.py
-python3 tests/checkpoint_engine/test_kimi_checkpoint_engine.py
+pytest tests/checkpoint_engine/test_correctness_on_gpu.py
+pytest tests/checkpoint_engine/test_correctness_on_npu.py
+pytest tests/checkpoint_engine/test_special_server_adapter.py
 ```
 
 2. benchmark result
diff --git a/verl/checkpoint_engine/kimi_checkpoint_engine.py b/verl/checkpoint_engine/kimi_checkpoint_engine.py
@@ -27,7 +27,7 @@
 from checkpoint_engine.ps import H2DBucket, ParameterMeta, ParameterServer, _gen_h2d_buckets, _to_named_tensor
 
 from verl.checkpoint_engine.base import CheckpointEngine, CheckpointEngineRegistry
-from verl.utils.device import get_device_name, get_nccl_backend, get_torch_device
+from verl.utils.device import get_nccl_backend, get_torch_device
 from verl.utils.net_utils import get_free_port
 
 logger = logging.getLogger(__name__)
@@ -331,7 +331,7 @@ def offload_cpu(name: str, tensor: torch.Tensor) -> tuple[str, torch.Tensor]:
         start_time = time.time()
         named_tensors = {}
         for named_tensors_gpu in ckpt_get_named_tensor_buckets(
-            weights, self.bucket_size, self.train_world_size, self.rank, self.rollout_dtype
+            weights, self.bucket_size, self.trainer_world_size, self.rank, self.rollout_dtype
         ):
             with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
                 futures = [