Background
ROLL supports weight synchronization approaches between training engines and inference engines:
- Collocated (gloo + ccl broadcast)
- Separated (ccl Broadcast)
Limitations of Legacy ccl Broadcast
-
Single-GPU source constraint: Only trainer rank 0 participates in broadcast. Model weights must fit entirely in this single GPU. For large models (e.g., 70B+ with high TP), this creates memory pressure—rank 0 must hold complete weights before broadcasting.
-
Transfer-trim synchronous cycle: Inference workers receive full weight tensors then trim redundant partitions (e.g., keep only their TP slice). Each batch follows:
# e.g. Separated
[rank0] broadcast → [all infer workers] recv full tensor → trim → next batch
Trimming and receiving cannot pipeline; all workers wait for full tensor before trimming.
Ray Direct Transport
We can use RDT(Ray Direct Transport) to optimize the above issues.
Ray Direct Transport (RDT) enables pull-based P2P without collective group (for one-sided backends):
ray.put(tensor, _tensor_transport="nixl"|"yr") - store tensor reference (NIXL for NVIDIA GPU, YR for Ascend NPU)
ray.get(ref) - pull tensor (zero-copy for NIXL/YR)
Advantages
-
No NCCL group: Avoid init_custom_process_group complexity (master_addr/port allocation, rank assignment, timeout handling)
-
RDMA transport + NPU support: RDT integration provides one-sided RDMA for efficient GPU transfers, and through YR backend (via ray-ascend) extends shard transfer support to Ascend NPU clusters.
Based on RDT, we can make two-stage improvements for weight synchronization:
Background
ROLL supports weight synchronization approaches between training engines and inference engines:
Limitations of Legacy ccl Broadcast
Single-GPU source constraint: Only trainer rank 0 participates in broadcast. Model weights must fit entirely in this single GPU. For large models (e.g., 70B+ with high TP), this creates memory pressure—rank 0 must hold complete weights before broadcasting.
Transfer-trim synchronous cycle: Inference workers receive full weight tensors then trim redundant partitions (e.g., keep only their TP slice). Each batch follows:
# e.g. Separated [rank0] broadcast → [all infer workers] recv full tensor → trim → next batchTrimming and receiving cannot pipeline; all workers wait for full tensor before trimming.
Ray Direct Transport
We can use RDT(Ray Direct Transport) to optimize the above issues.
Ray Direct Transport (RDT) enables pull-based P2P without collective group (for one-sided backends):
ray.put(tensor, _tensor_transport="nixl"|"yr")- store tensor reference (NIXL for NVIDIA GPU, YR for Ascend NPU)ray.get(ref)- pull tensor (zero-copy for NIXL/YR)Advantages
No NCCL group: Avoid init_custom_process_group complexity (master_addr/port allocation, rank assignment, timeout handling)
RDMA transport + NPU support: RDT integration provides one-sided RDMA for efficient GPU transfers, and through YR backend (via ray-ascend) extends shard transfer support to Ascend NPU clusters.
Based on RDT, we can make two-stage improvements for weight synchronization:
ray.putandray.getso that workers can asynchronously obtain the latest parameters.gather -> broadcastto P2P data transmission between the train worker TP and infer worker TP.