[RFC] topology aware workgroup #4324

yuncliu · 2025-11-27T12:17:25Z

yuncliu
Nov 27, 2025

Motivation

For the advanced super pod hardware, such as GB200 NVL 72, the nvlink is cross mulitple nodes. To enjoy these cross node nvlink. the placement_groups created by the RayResourcePool need to be put on specific nodes. This is also can be used in other scaleup technologies clusters

Proposed Design

ray cluster

when start the ray node. need to put some label to the work node. The label contains the topology information

verl ResourcePoolManager

After created the array of placement_group add another scheduler to process the array of placement_group.
The scheduler process the topology label for nodes and get out placement_goups with node id , so that the task will be set to the nodes that can utilize the high performance scaleup technologies
for example, in verl/tree/main/verl/single_controller/ray/base.py

    def get_placement_groups(self, strategy="STRICT_PACK", name=None, device_name="cuda"):
        if self.pgs is not None:
            return self.pgs

        pg_name_prefix = (
            name if name else f"{self.name_prefix}verl_group_{'_'.join([str(count) for count in self._store])}:"
        )
        # print(f"pg_name_prefix = {pg_name_prefix}")
        if device_name == "npu":
            device_name = "NPU"
        elif device_name == "cuda":
            device_name = "GPU"

        bundle = {"CPU": self.max_colocate_count}
        if self.use_gpu:
            bundle[device_name] = 1
            if self.accelerator_type is not None:
                bundle[self.accelerator_type] = 1e-4
        pg_scheme = [[bundle.copy() for _ in range(process_count)] for process_count in self._store]

        lifetime = "detached" if self.detached else None

        pgs = [
            placement_group(bundles=bundles, strategy=strategy, name=pg_name_prefix + str(idx), lifetime=lifetime)
            for idx, bundles in enumerate(pg_scheme)
        ]
        """
        node_schedule implement the topoloy aware node select.
        before schedule the pgs is like this 
        [ {"GPU":8},{"GPU":8},{"GPU":8},{"GPU":8} ]
        after node schedule the pgs is
        [ {"GPU":8, "node:192.168.1.100":1 },
          {"GPU":8, "node:192.168.1.101":1},
          {"GPU":8, "node:192.168.1.102":1},
          {"GPU":8, "node:192.168.1.103":1} ]
        These 4 node of 192.168.1.100-103 share  a  same scaleup domain so that the model can have a better network performance.
        """
        pgs = node_schedule(pgs)

        ray.get([pg.ready() for pg in pgs])

        self.pgs = sort_placement_group_by_node_ip(pgs)
        return pgs

The node_schedule add node label to there placement_groups to set each placement_group to specific node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] topology aware workgroup #4324

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[RFC] topology aware workgroup #4324

Uh oh!

Uh oh!

yuncliu Nov 27, 2025

Motivation

Proposed Design

ray cluster

verl ResourcePoolManager

Replies: 0 comments

yuncliu
Nov 27, 2025