Hi JavaCPP team,
First, thank you for the great work. We have successfully implemented FSDP distributed training with CUDA 13.1 / 13.2 on Ubuntu 26.04 using javacpp-pytorch.
We encountered a type hierarchy issue with ProcessGroup that affects type safety and code structure:
Current hierarchy
public class ProcessGroup extends CustomClassHolder
public class ProcessGroupNCCL extends Backend
public class ProcessGroupGloo extends Backend
public class Backend extends CustomClassHolder
Problem
ProcessGroupNCCL and ProcessGroupGloo extend Backend, not ProcessGroup.
This forces unsafe casting and breaks type safety in distributed initialization, especially for FSDP.
Suggested refactor (much cleaner & type-safe)
class Backend extends CustomClassHolder
class ProcessGroup extends Backend
class ProcessGroupNCCL extends ProcessGroup
class ProcessGroupGloo extends ProcessGroup
This matches PyTorch's native design and allows proper usage:
java
运行
ProcessGroup pg = new ProcessGroupNCCL();
without casting. This is critical for stable FSDP distributed training in Java.
Could you please adjust the class hierarchy?
It will greatly improve reliability for Java-side distributed / FSDP training.
Thank you very much!
Hi JavaCPP team,
First, thank you for the great work. We have successfully implemented FSDP distributed training with CUDA 13.1 / 13.2 on Ubuntu 26.04 using javacpp-pytorch.
We encountered a type hierarchy issue with ProcessGroup that affects type safety and code structure:
Current hierarchy
public class ProcessGroup extends CustomClassHolder
public class ProcessGroupNCCL extends Backend
public class ProcessGroupGloo extends Backend
public class Backend extends CustomClassHolder
Problem
ProcessGroupNCCL and ProcessGroupGloo extend Backend, not ProcessGroup.
This forces unsafe casting and breaks type safety in distributed initialization, especially for FSDP.
Suggested refactor (much cleaner & type-safe)
class Backend extends CustomClassHolder
class ProcessGroup extends Backend
class ProcessGroupNCCL extends ProcessGroup
class ProcessGroupGloo extends ProcessGroup
This matches PyTorch's native design and allows proper usage:
java
运行
ProcessGroup pg = new ProcessGroupNCCL();
without casting. This is critical for stable FSDP distributed training in Java.
Could you please adjust the class hierarchy?
It will greatly improve reliability for Java-side distributed / FSDP training.
Thank you very much!