Skip to content

Refactor ProcessGroup hierarchy for better type safety in NCCL/Gloo FSDP training #1761

@mullerhai

Description

@mullerhai

Hi JavaCPP team,
First, thank you for the great work. We have successfully implemented FSDP distributed training with CUDA 13.1 / 13.2 on Ubuntu 26.04 using javacpp-pytorch.
We encountered a type hierarchy issue with ProcessGroup that affects type safety and code structure:
Current hierarchy
public class ProcessGroup extends CustomClassHolder
public class ProcessGroupNCCL extends Backend
public class ProcessGroupGloo extends Backend
public class Backend extends CustomClassHolder
Problem
ProcessGroupNCCL and ProcessGroupGloo extend Backend, not ProcessGroup.
This forces unsafe casting and breaks type safety in distributed initialization, especially for FSDP.
Suggested refactor (much cleaner & type-safe)
class Backend extends CustomClassHolder
class ProcessGroup extends Backend
class ProcessGroupNCCL extends ProcessGroup
class ProcessGroupGloo extends ProcessGroup
This matches PyTorch's native design and allows proper usage:
java
运行
ProcessGroup pg = new ProcessGroupNCCL();
without casting. This is critical for stable FSDP distributed training in Java.
Could you please adjust the class hierarchy?
It will greatly improve reliability for Java-side distributed / FSDP training.
Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions