[xla:gpu] Unify CUDA allocators under cuMemCreate allocator by ezhulenev · Pull Request #41069 · openxla/xla

ezhulenev · 2026-04-16T22:01:49Z

One allocator to rule them ALL 🔥

Consolidate multiple CUDA allocator implementations into a single CudaDeviceAllocator backed by the CUDA VMM API, and make BFCAllocator respect user-specified alignment. This allows a single BFC pool per GPU to serve both default (256-byte aligned) and collective (VMM-granularity aligned) memory spaces, eliminating wasted memory from separate pools.

Motivation

Previously, XLA maintained separate BFC allocator pools per GPU for default and collective memory, plus several independent allocator implementations (CudaVmmAllocator, CudaCollectiveAllocator, NcclAllocator, NvshmemAllocator). The collective pool had to be sized conservatively, and memory couldn't flow between pools. Additionally, BFCAllocator ignored the alignment parameter in AllocateRaw, making it impossible to share a single pool across memory spaces with different alignment requirements.

What changed

BFCAllocator alignment support (`xla/tsl/framework/`)

Done in #40979

Unified CUDA VMM allocator (`xla/stream_executor/cuda/`)

Consolidated CudaVmmAllocator, CudaCollectiveAllocator, NcclAllocator into CudaDeviceAllocator
CudaDeviceAllocator uses cuMemCreate/cuMemAddressReserve/cuMemMap with configurable options: peer access, RDMA, fabric export
Symmetric memory alignment (GetSymmetricMemoryAlignment) correctly returns cuMemGetAllocationGranularity(RECOMMENDED)

MultiDeviceAdapter and TfAllocatorAdapter (`xla/stream_executor/integrations/`)

TfAllocatorAdapter accepts min_alignment parameter, passes it to BFCAllocator::AllocateRaw
MultiDeviceAdapter::AllocatorInfo is a documented aggregate struct with per-entry min_alignment
AddMemorySpaceAlias allows routing multiple memory spaces to the same underlying allocator with different alignment
Moved MultiDeviceAdapter implementation from header to .cc file

Client integration (`xla/pjrt/gpu/`)

Single BFC allocator per GPU serves both kDefault (256-byte aligned via kXlaAllocatedBufferAlignBytes) and kCollective (VMM-granularity aligned) memory spaces
NVSHMEM remains a separate BFC when enabled (backed by nvshmem_malloc)

Collectives cleanup (`xla/backends/gpu/collectives/`)

Removed GpuCollectives::Allocate/Deallocate pure-virtual methods (now default to Unimplemented)
Deleted NcclCollectives and RcclCollectives allocator methods and SymmetricMemoryAlignment
Removed ResolveGpuCollectives and allocate_granularity from gpu_executable.cc

Tixxx · 2026-04-16T23:41:38Z

+
+  if (enable_fabric_export) {
+    CUmemAllocationProp fabric_properties = properties;
+    fabric_properties.requestedHandleTypes =


this is exactly what nccl is doing though. Why not just calling ncclMemAlloc? You get all these handle checks for free

Some people build XLA without NCCL, and AFAIK it also adds to linking hell in Tensorflow, so it's easier to duplicate code here than to deal with that.

ezhulenev requested review from PatriosTheGreat, Tixxx, beckerhe and pifon2a April 16, 2026 22:01

ezhulenev force-pushed the unify-cuda-allocators branch 2 times, most recently from 52ffeaa to d1aad28 Compare April 16, 2026 22:33

Tixxx reviewed Apr 16, 2026

View reviewed changes

ezhulenev force-pushed the unify-cuda-allocators branch 5 times, most recently from 2ee9f68 to bf0303c Compare April 17, 2026 04:55

ezhulenev added 2 commits April 17, 2026 04:56

[tsl] Make BFCAllocator respect user alignment

6eaf528

[xla:gpu] Unify CUDA allocators under cuMemCreate allocator

f4f5597

ezhulenev force-pushed the unify-cuda-allocators branch from bf0303c to f4f5597 Compare April 17, 2026 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[xla:gpu] Unify CUDA allocators under cuMemCreate allocator#41069

[xla:gpu] Unify CUDA allocators under cuMemCreate allocator#41069
ezhulenev wants to merge 2 commits intoopenxla:mainfrom
ezhulenev:unify-cuda-allocators

ezhulenev commented Apr 16, 2026

Uh oh!

Tixxx Apr 16, 2026

Uh oh!

ezhulenev Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ezhulenev commented Apr 16, 2026

Motivation

What changed

BFCAllocator alignment support (xla/tsl/framework/)

Unified CUDA VMM allocator (xla/stream_executor/cuda/)

MultiDeviceAdapter and TfAllocatorAdapter (xla/stream_executor/integrations/)

Client integration (xla/pjrt/gpu/)

Collectives cleanup (xla/backends/gpu/collectives/)

Uh oh!

Tixxx Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ezhulenev Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BFCAllocator alignment support (`xla/tsl/framework/`)

Unified CUDA VMM allocator (`xla/stream_executor/cuda/`)

MultiDeviceAdapter and TfAllocatorAdapter (`xla/stream_executor/integrations/`)

Client integration (`xla/pjrt/gpu/`)

Collectives cleanup (`xla/backends/gpu/collectives/`)

ezhulenev Apr 17, 2026 •

edited

Loading