Refactor ProcessGroup hierarchy for better type safety in NCCL/Gloo FSDP training

Hi JavaCPP team,
First, thank you for the great work. We have successfully implemented FSDP distributed training with CUDA 13.1 / 13.2 on Ubuntu 26.04 using javacpp-pytorch.
We encountered a type hierarchy issue with ProcessGroup that affects type safety and code structure:
Current hierarchy
public class ProcessGroup extends CustomClassHolder
public class ProcessGroupNCCL extends Backend
public class ProcessGroupGloo extends Backend
public class Backend extends CustomClassHolder
Problem
ProcessGroupNCCL and ProcessGroupGloo extend Backend, not ProcessGroup.
This forces unsafe casting and breaks type safety in distributed initialization, especially for FSDP.
Suggested refactor (much cleaner & type-safe)
class Backend extends CustomClassHolder
class ProcessGroup extends Backend
class ProcessGroupNCCL extends ProcessGroup
class ProcessGroupGloo extends ProcessGroup
This matches PyTorch's native design and allows proper usage:
java
运行
ProcessGroup pg = new ProcessGroupNCCL();
without casting. This is critical for stable FSDP distributed training in Java.
Could you please adjust the class hierarchy?
It will greatly improve reliability for Java-side distributed / FSDP training.
Thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor ProcessGroup hierarchy for better type safety in NCCL/Gloo FSDP training #1761

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Refactor ProcessGroup hierarchy for better type safety in NCCL/Gloo FSDP training #1761

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions