[xla:gpu] Unify CUDA allocators under cuMemCreate allocator#41069
Open
ezhulenev wants to merge 2 commits intoopenxla:mainfrom
Open
[xla:gpu] Unify CUDA allocators under cuMemCreate allocator#41069ezhulenev wants to merge 2 commits intoopenxla:mainfrom
ezhulenev wants to merge 2 commits intoopenxla:mainfrom
Conversation
52ffeaa to
d1aad28
Compare
Tixxx
reviewed
Apr 16, 2026
|
|
||
| if (enable_fabric_export) { | ||
| CUmemAllocationProp fabric_properties = properties; | ||
| fabric_properties.requestedHandleTypes = |
Contributor
There was a problem hiding this comment.
this is exactly what nccl is doing though. Why not just calling ncclMemAlloc? You get all these handle checks for free
Contributor
Author
There was a problem hiding this comment.
Some people build XLA without NCCL, and AFAIK it also adds to linking hell in Tensorflow, so it's easier to duplicate code here than to deal with that.
2ee9f68 to
bf0303c
Compare
bf0303c to
f4f5597
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
One allocator to rule them ALL 🔥
Consolidate multiple CUDA allocator implementations into a single
CudaDeviceAllocatorbacked by the CUDA VMM API, and makeBFCAllocatorrespect user-specified alignment. This allows a single BFC pool per GPU to serve both default (256-byte aligned) and collective (VMM-granularity aligned) memory spaces, eliminating wasted memory from separate pools.Motivation
Previously, XLA maintained separate BFC allocator pools per GPU for default and collective memory, plus several independent allocator implementations (
CudaVmmAllocator,CudaCollectiveAllocator,NcclAllocator,NvshmemAllocator). The collective pool had to be sized conservatively, and memory couldn't flow between pools. Additionally,BFCAllocatorignored thealignmentparameter inAllocateRaw, making it impossible to share a single pool across memory spaces with different alignment requirements.What changed
BFCAllocator alignment support (
xla/tsl/framework/)Done in #40979
Unified CUDA VMM allocator (
xla/stream_executor/cuda/)CudaVmmAllocator,CudaCollectiveAllocator,NcclAllocatorintoCudaDeviceAllocatorCudaDeviceAllocatorusescuMemCreate/cuMemAddressReserve/cuMemMapwith configurable options: peer access, RDMA, fabric exportGetSymmetricMemoryAlignment) correctly returnscuMemGetAllocationGranularity(RECOMMENDED)MultiDeviceAdapter and TfAllocatorAdapter (
xla/stream_executor/integrations/)TfAllocatorAdapteracceptsmin_alignmentparameter, passes it toBFCAllocator::AllocateRawMultiDeviceAdapter::AllocatorInfois a documented aggregate struct with per-entrymin_alignmentAddMemorySpaceAliasallows routing multiple memory spaces to the same underlying allocator with different alignmentMultiDeviceAdapterimplementation from header to.ccfileClient integration (
xla/pjrt/gpu/)kDefault(256-byte aligned viakXlaAllocatedBufferAlignBytes) andkCollective(VMM-granularity aligned) memory spacesnvshmem_malloc)Collectives cleanup (
xla/backends/gpu/collectives/)GpuCollectives::Allocate/Deallocatepure-virtual methods (now default toUnimplemented)NcclCollectivesandRcclCollectivesallocator methods andSymmetricMemoryAlignmentResolveGpuCollectivesandallocate_granularityfromgpu_executable.cc