Skip to content

[xla:gpu] Unify CUDA allocators under cuMemCreate allocator#41069

Open
ezhulenev wants to merge 2 commits intoopenxla:mainfrom
ezhulenev:unify-cuda-allocators
Open

[xla:gpu] Unify CUDA allocators under cuMemCreate allocator#41069
ezhulenev wants to merge 2 commits intoopenxla:mainfrom
ezhulenev:unify-cuda-allocators

Conversation

@ezhulenev
Copy link
Copy Markdown
Contributor

One allocator to rule them ALL 🔥

Consolidate multiple CUDA allocator implementations into a single CudaDeviceAllocator backed by the CUDA VMM API, and make BFCAllocator respect user-specified alignment. This allows a single BFC pool per GPU to serve both default (256-byte aligned) and collective (VMM-granularity aligned) memory spaces, eliminating wasted memory from separate pools.

Motivation

Previously, XLA maintained separate BFC allocator pools per GPU for default and collective memory, plus several independent allocator implementations (CudaVmmAllocator, CudaCollectiveAllocator, NcclAllocator, NvshmemAllocator). The collective pool had to be sized conservatively, and memory couldn't flow between pools. Additionally, BFCAllocator ignored the alignment parameter in AllocateRaw, making it impossible to share a single pool across memory spaces with different alignment requirements.

What changed

BFCAllocator alignment support (xla/tsl/framework/)

Done in #40979

Unified CUDA VMM allocator (xla/stream_executor/cuda/)

  • Consolidated CudaVmmAllocator, CudaCollectiveAllocator, NcclAllocator into CudaDeviceAllocator
  • CudaDeviceAllocator uses cuMemCreate/cuMemAddressReserve/cuMemMap with configurable options: peer access, RDMA, fabric export
  • Symmetric memory alignment (GetSymmetricMemoryAlignment) correctly returns cuMemGetAllocationGranularity(RECOMMENDED)

MultiDeviceAdapter and TfAllocatorAdapter (xla/stream_executor/integrations/)

  • TfAllocatorAdapter accepts min_alignment parameter, passes it to BFCAllocator::AllocateRaw
  • MultiDeviceAdapter::AllocatorInfo is a documented aggregate struct with per-entry min_alignment
  • AddMemorySpaceAlias allows routing multiple memory spaces to the same underlying allocator with different alignment
  • Moved MultiDeviceAdapter implementation from header to .cc file

Client integration (xla/pjrt/gpu/)

  • Single BFC allocator per GPU serves both kDefault (256-byte aligned via kXlaAllocatedBufferAlignBytes) and kCollective (VMM-granularity aligned) memory spaces
  • NVSHMEM remains a separate BFC when enabled (backed by nvshmem_malloc)

Collectives cleanup (xla/backends/gpu/collectives/)

  • Removed GpuCollectives::Allocate/Deallocate pure-virtual methods (now default to Unimplemented)
  • Deleted NcclCollectives and RcclCollectives allocator methods and SymmetricMemoryAlignment
  • Removed ResolveGpuCollectives and allocate_granularity from gpu_executable.cc

@ezhulenev ezhulenev force-pushed the unify-cuda-allocators branch 2 times, most recently from 52ffeaa to d1aad28 Compare April 16, 2026 22:33

if (enable_fabric_export) {
CUmemAllocationProp fabric_properties = properties;
fabric_properties.requestedHandleTypes =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly what nccl is doing though. Why not just calling ncclMemAlloc? You get all these handle checks for free

Copy link
Copy Markdown
Contributor Author

@ezhulenev ezhulenev Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people build XLA without NCCL, and AFAIK it also adds to linking hell in Tensorflow, so it's easier to duplicate code here than to deal with that.

@ezhulenev ezhulenev force-pushed the unify-cuda-allocators branch 5 times, most recently from 2ee9f68 to bf0303c Compare April 17, 2026 04:55
@ezhulenev ezhulenev force-pushed the unify-cuda-allocators branch from bf0303c to f4f5597 Compare April 17, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants