[Core] Add register_collective_backend API for customized collective backends by Evelynn-V · Pull Request #60701 · ray-project/ray

Evelynn-V · 2026-02-03T07:53:32Z

Description

This PR implements a dynamic registry system for Ray Collective communication backends, enabling users to register and integrate custom collective communication libraries (e.g., HCCL for Ascend NPU, XCCL for other vendors) without modifying Ray's core codebase.

Key accomplishments

Introduces BackendRegistry singleton class to maintain {backend_name: GroupClass} mappings
Adds public API ray.util.collective.backend_registry.register_collective_backend(name: str, group_cls: Type[BaseGroup]) for runtime backend registration
Implements abstract method BaseGroup.check_backend_availability() with concrete implementations for NCCL and GLOO backends
Refactors GroupManager.create_collective_group() to retrieve backend classes from registry instead of hardcoded if-else logic
Replaces types.Backend enum usage with capitalized string literals for backend names (e.g., "NCCL", "GLOO"), improving extensibility

Related issues

Related to #60603 No.1 and No.3 to No.6
Related to PR 60693

Usage Examples

For usage instructions, please refer to util/collective/examples/gloo_allreduce_example.py

TODO

Test in the GPU environment using NCCL
Adapt to PR 60693

gemini-code-assist

Code Review

This pull request introduces a valuable refactoring to allow custom collective communication backends through a dynamic registry. The core concept is well-implemented and improves extensibility. However, I've identified a few issues that should be addressed. There is a critical bug that prevents the use of custom backends due to legacy code paths. Additionally, some backend-specific logic remains hardcoded, which undermines the goal of a truly generic system. I've also provided suggestions to improve code clarity, remove redundancy, and fix a minor issue in the singleton implementation. Addressing these points will make the new backend registry more robust and fully realize the goals of this PR.

gemini-code-assist · 2026-02-03T07:55:41Z

python/ray/util/collective/collective.py

@@ -188,10 +196,15 @@ def init_collective_group(
    """
    _check_inside_actor()
    backend = types.Backend(backend)


Using types.Backend(backend) will raise a ValueError for any custom backend not explicitly defined in types.Backend (i.e., not "NCCL" or "GLOO"). This breaks the core functionality of this PR, which is to allow custom backends. This line should be removed, as the backend argument is already a string.

This looks like a valid issue. It might be better to remove the types.Backend(backend) call and accept the backend as a string to ensure custom backends aren't blocked by the enum validation.

gemini-code-assist · 2026-02-03T07:55:41Z

python/ray/util/collective/collective.py

@@ -231,7 +244,11 @@ def create_collective_group(
        None
    """
    backend = types.Backend(backend)


Using types.Backend(backend) will raise a ValueError for any custom backend not explicitly defined in types.Backend (i.e., not "NCCL" or "GLOO"). This breaks the core functionality of this PR, which is to allow custom backends. This line should be removed, as the backend argument is already a string.

gemini-code-assist · 2026-02-03T07:55:41Z

python/ray/util/collective/backend_registry.py

+class BackendRegistry:
+    _instance = None
+    _map: Dict[str, Type[BaseGroup]] = {}
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super(BackendRegistry, cls).__new__(cls)
+        return cls._instance


The current singleton implementation with _map as a class attribute is not robust against subclassing. A subclass would share the same _map dictionary, which is likely unintended. To make this more robust, _map should be an instance attribute, initialized only once within __new__.

Suggested change

class BackendRegistry:

_instance = None

_map: Dict[str, Type[BaseGroup]] = {}

def __new__(cls):

if cls._instance is None:

cls._instance = super(BackendRegistry, cls).__new__(cls)

return cls._instance

class BackendRegistry:

_instance = None

_map: Dict[str, Type[BaseGroup]]

def __new__(cls):

if cls._instance is None:

cls._instance = super(BackendRegistry, cls).__new__(cls)

cls._instance._map = {}

return cls._instance

That’s a very solid point.

gemini-code-assist · 2026-02-03T07:55:41Z

python/ray/util/collective/collective.py

+if _NCCL_AVAILABLE:
+    register_collective_backend("NCCL", NCCLGroup)
+if _TORCH_DISTRIBUTED_AVAILABLE:
+    register_collective_backend("GLOO", TorchGLOOGroup)


The conditional registration of backends is based on _NCCL_AVAILABLE and _TORCH_DISTRIBUTED_AVAILABLE flags. Due to the changes in this PR (guarded imports within group classes), these flags are no longer reliable here and will likely always be True. The registration should be unconditional, as the actual availability of the backend is checked dynamically when a collective group is created.

Suggested change

if _NCCL_AVAILABLE:

register_collective_backend("NCCL", NCCLGroup)

if _TORCH_DISTRIBUTED_AVAILABLE:

register_collective_backend("GLOO", TorchGLOOGroup)

register_collective_backend("NCCL", NCCLGroup)

register_collective_backend("GLOO", TorchGLOOGroup)

gemini-code-assist · 2026-02-03T07:55:42Z

python/ray/util/collective/collective.py

+    if backend_cls is None:
+        raise ValueError("Backend '{}' is not supported.".format(backend))


The _registry.get() method raises a ValueError if the backend is not registered, so backend_cls will never be None. This if backend_cls is None: check is unreachable and can be removed.

This check is redundant and can be removed.

gemini-code-assist · 2026-02-03T07:55:42Z

python/ray/util/collective/collective.py

+    if backend_cls is None:
+        raise ValueError("Backend '{}' is not supported.".format(backend))


The _registry.get() method raises a ValueError if the backend is not registered, so backend_cls will never be None. This if backend_cls is None: check is unreachable and can be removed.

Same as above.

gemini-code-assist · 2026-02-03T07:55:42Z

python/ray/util/collective/examples/gloo_allreduce_example.py

+            init_collective_group(
+                world_size=world_size,
+                rank=self.rank,
+                backend=Backend.GLOO,


To better showcase the new string-based backend registration, it's preferable to use a string literal for the backend name instead of Backend.GLOO. This makes it clear that the system is not tied to the old types.Backend enum-like class.

Suggested change

backend=Backend.GLOO,

backend="GLOO",

gemini-code-assist · 2026-02-03T07:55:42Z

python/ray/util/collective/examples/gloo_allreduce_example.py

+        actors=actors,
+        world_size=2,
+        ranks=[0, 1],
+        backend=Backend.GLOO,


To better showcase the new string-based backend registration, it's preferable to use a string literal for the backend name instead of Backend.GLOO. This makes it clear that the system is not tied to the old types.Backend enum-like class.

Suggested change

backend=Backend.GLOO,

backend="GLOO",

python/ray/util/collective/collective_group/torch_gloo_collective_group.py

python/ray/util/collective/collective.py

python/ray/util/collective/collective_group/nccl_collective_group.py

python/ray/util/collective/collective_group/torch_gloo_collective_group.py

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

cursor · 2026-02-06T01:56:27Z

python/ray/util/collective/collective.py

    )


-def _check_backend_availability(backend: types.Backend):


Type mismatch in gloo_timeout environment variable

High Severity

The os.getenv("collective_gloo_timeout", 30000) call returns a string when the environment variable is set, but an integer is expected. This causes a TypeError when the code attempts division at line 112 (gloo_timeout / 1000.0). The value needs to be converted to an integer using int(os.getenv(...)).

cursor · 2026-02-06T01:56:27Z

python/ray/util/collective/collective.py


+    backend_cls = _group_mgr._registry.get(backend)
+    if not backend_cls.check_backend_availability():
+        raise RuntimeError("Backend '{}' is not available.".format(backend))


Redundant backend availability checks

Low Severity

Backend availability is checked three times for the same operation. Both init_collective_group and create_collective_group check backend_cls.check_backend_availability() before calling _group_mgr.create_collective_group(), which performs the same check again. The checks in the public functions are redundant and add unnecessary overhead.

Additional Locations (2)

python/ray/util/collective/collective.py#L242-L245

python/ray/util/collective/collective.py#L98-L102

cursor · 2026-02-06T01:56:27Z

python/ray/util/collective/collective.py

-            _check_backend_availability(backend)
-            logger.debug("Creating NCCL group: '{}'...".format(group_name))
-            g = NCCLGroup(world_size, rank, group_name)
+            g = backend_cls(world_size, rank, group_name, gloo_timeout)


Hardcoded GLOO check causes crashes with custom backends

Medium Severity

The hardcoded string check if backend == "GLOO": at line 104 assumes any backend named "GLOO" follows TorchGLOOGroup's signature and needs rendezvous setup. If torch.distributed is unavailable, the built-in GLOO backend isn't registered, allowing users to register a custom backend named "GLOO". When such a backend is used, the code executes GLOO-specific rendezvous logic and passes gloo_timeout as a fourth constructor parameter, which will crash if the custom backend doesn't accept that parameter.

cursor · 2026-02-06T01:56:27Z

python/ray/util/collective/collective_group/nccl_collective_group.py


 logger = logging.getLogger(__name__)

+global _LOG_NCCL_WARNING, _NCCL_AVAILABLE


Unnecessary global statement at module level

Low Severity

The global _LOG_NCCL_WARNING, _NCCL_AVAILABLE statement appears at module scope where it has no effect. The global keyword is only meaningful inside functions to indicate that a variable refers to a module-level name. At module level, variables are already in the global scope, making this statement a no-op that may confuse developers.

Signed-off-by: Evelynn-V <liwenlin0223l@gmail.com>

Evelynn-V requested a review from a team as a code owner February 3, 2026 07:53

gemini-code-assist bot reviewed Feb 3, 2026

View reviewed changes

cursor bot reviewed Feb 3, 2026

View reviewed changes

python/ray/util/collective/collective_group/torch_gloo_collective_group.py Show resolved Hide resolved

python/ray/util/collective/collective.py Show resolved Hide resolved

Evelynn-V changed the title ~~[WIP][Core] Add register_collective_backend API for customized collective backends~~ [Core] Add register_collective_backend API for customized collective backends Feb 3, 2026

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Feb 3, 2026

cursor bot reviewed Feb 5, 2026

View reviewed changes

python/ray/util/collective/collective_group/nccl_collective_group.py Show resolved Hide resolved

python/ray/util/collective/collective_group/torch_gloo_collective_group.py Show resolved Hide resolved

cursor bot reviewed Feb 6, 2026

View reviewed changes

Evelynn-V added 5 commits February 7, 2026 16:19

add register_collective_backend api for customized collective libs

629452e

Signed-off-by: Evelynn-V <liwenlin0223l@gmail.com>

Fix the review comments and add the NCCL test

1761527

Signed-off-by: Evelynn-V <liwenlin0223l@gmail.com>

Fix the construction of the CI Chinese documents

4c6c39b

Signed-off-by: Evelynn-V <liwenlin0223l@gmail.com>

change ci/lint/pydoclint-baseline.txt

5b5f82d

Signed-off-by: Evelynn-V <liwenlin0223l@gmail.com>

reset type.backend

2e368ee

Signed-off-by: Evelynn-V <liwenlin0223l@gmail.com>

Evelynn-V force-pushed the register branch from d5b2a07 to 2e368ee Compare February 7, 2026 08:21

		if backend_cls is None:
		raise ValueError("Backend '{}' is not supported.".format(backend))


		logger = logging.getLogger(__name__)

		global _LOG_NCCL_WARNING, _NCCL_AVAILABLE

Conversation

Evelynn-V commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key accomplishments

Related issues

Usage Examples

TODO

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Type mismatch in gloo_timeout environment variable

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Redundant backend availability checks

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Hardcoded GLOO check causes crashes with custom backends

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Unnecessary global statement at module level

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Evelynn-V commented Feb 3, 2026 •

edited

Loading