-
Notifications
You must be signed in to change notification settings - Fork 133
Description
Summary
monarch-rdma is now tightly coupled to CUDA/NVIDIA-specific components, which blocks Monarch from running on non-CUDA accelerators (e.g., Ascend NPUs). Decoupling RDMA from CUDA is required to achieve hardware-neutral accelerator support.
Key blocking dependencies
-
NVIDIA-specific infrastructure
Reliance on GPUDirect RDMA,nvidia_peermem, and NVIDIA driver assumptions prevents RDMA use on non-NVIDIA systems. -
CUDA-bound RDMA offloading
rdmaxcel-sysimplements critical RDMA operations (e.g.,send_wqe,db_ring) as CUDA kernels, binding RDMA execution to the CUDA driver API. -
CUDA-only PyTorch integration
pytorch_segment_scannerdepends ontorch.cuda.memory._snapshot(), tying memory inspection and registration to the CUDA backend.
Goal / discussion
What would be the preferred approach to refactor monarch-rdma toward a hardware-neutral RDMA layer, for example, by introducing an accelerator-agnostic RDMA abstraction and isolating CUDA-specific optimizations behind optional backends?