[RFC] Integrate Ray Core RDT for Weight Synchronization

## Background
ROLL supports weight synchronization approaches between training engines and inference engines:
- Collocated (gloo + ccl broadcast)
- Separated (ccl Broadcast)

## Limitations of Legacy ccl Broadcast
1. Single-GPU source constraint: Only trainer rank 0 participates in broadcast. Model weights must fit entirely in this single GPU. For large models (e.g., 70B+ with high TP), this creates memory pressure—rank 0 must hold complete weights before broadcasting.

2. Transfer-trim synchronous cycle: Inference workers receive full weight tensors then trim redundant partitions (e.g., keep only their TP slice). Each batch follows:

```bash
# e.g. Separated
[rank0] broadcast → [all infer workers] recv full tensor → trim → next batch
```

Trimming and receiving cannot pipeline; all workers wait for full tensor before trimming.

## Ray Direct Transport
We can use RDT(Ray Direct Transport) to optimize the above issues. 

[Ray Direct Transport (RDT)](https://docs.ray.io/en/latest/ray-core/api/direct-transport.html) enables pull-based P2P without collective group (for one-sided backends):

- `ray.put(tensor, _tensor_transport="nixl"|"yr")` - store tensor reference (NIXL for NVIDIA GPU, YR for Ascend NPU)
-  `ray.get(ref)` - pull tensor (zero-copy for NIXL/YR)

## Advantages
- No NCCL group: Avoid init_custom_process_group complexity (master_addr/port allocation, rank assignment, timeout handling)

- RDMA transport + NPU support: RDT integration provides one-sided RDMA for efficient GPU transfers, and through YR backend (via [ray-ascend](https://github.com/Ascend/ray-ascend)) extends shard transfer support to Ascend NPU clusters.

Based on RDT, we can make two-stage improvements for weight synchronization:
- [ ]  Firstly, replace the broadcast with `ray.put` and `ray.get` so that workers can asynchronously obtain the latest parameters.
- [ ]  Secondly, further optimize the communication link for parameter synchronization, changing the `gather -> broadcast` to P2P data transmission between the **train worker TP** and **infer worker TP**.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Integrate Ray Core RDT for Weight Synchronization #431

Background

Limitations of Legacy ccl Broadcast

Ray Direct Transport

Advantages

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] Integrate Ray Core RDT for Weight Synchronization #431

Description

Background

Limitations of Legacy ccl Broadcast

Ray Direct Transport

Advantages

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions