Skip to content

[RFC] Integrate Ray Core RDT for Weight Synchronization #431

@dpj135

Description

@dpj135

Background

ROLL supports weight synchronization approaches between training engines and inference engines:

  • Collocated (gloo + ccl broadcast)
  • Separated (ccl Broadcast)

Limitations of Legacy ccl Broadcast

  1. Single-GPU source constraint: Only trainer rank 0 participates in broadcast. Model weights must fit entirely in this single GPU. For large models (e.g., 70B+ with high TP), this creates memory pressure—rank 0 must hold complete weights before broadcasting.

  2. Transfer-trim synchronous cycle: Inference workers receive full weight tensors then trim redundant partitions (e.g., keep only their TP slice). Each batch follows:

# e.g. Separated
[rank0] broadcast → [all infer workers] recv full tensor → trim → next batch

Trimming and receiving cannot pipeline; all workers wait for full tensor before trimming.

Ray Direct Transport

We can use RDT(Ray Direct Transport) to optimize the above issues.

Ray Direct Transport (RDT) enables pull-based P2P without collective group (for one-sided backends):

  • ray.put(tensor, _tensor_transport="nixl"|"yr") - store tensor reference (NIXL for NVIDIA GPU, YR for Ascend NPU)
  • ray.get(ref) - pull tensor (zero-copy for NIXL/YR)

Advantages

  • No NCCL group: Avoid init_custom_process_group complexity (master_addr/port allocation, rank assignment, timeout handling)

  • RDMA transport + NPU support: RDT integration provides one-sided RDMA for efficient GPU transfers, and through YR backend (via ray-ascend) extends shard transfer support to Ascend NPU clusters.

Based on RDT, we can make two-stage improvements for weight synchronization:

  • Firstly, replace the broadcast with ray.put and ray.get so that workers can asynchronously obtain the latest parameters.
  • Secondly, further optimize the communication link for parameter synchronization, changing the gather -> broadcast to P2P data transmission between the train worker TP and infer worker TP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions