cuda: Replace dynamicfb allocation method with cumempool #299

muraj · 2025-09-15T20:08:32Z

No description provided.

codecov · 2025-09-15T20:16:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.15%. Comparing base (ade8156) to head (b3be2d2).

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #299    +/-   ##
========================================
  Coverage   27.15%   27.15%            
========================================
  Files         190      190            
  Lines       39174    39173     -1     
  Branches    14289    14180   -109     
========================================
  Hits        10638    10638            
+ Misses      27681    27119   -562     
- Partials      855     1416   +561

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lightsighter · 2025-10-10T19:39:22Z

One question and one comment.

Question: How does CUDA pick the value of CU_MEMPOOL_ATTR_RELEASE_THRESHOLD? Is there a way to configure it via the command line or an environment variable? I suspect that @manopapad is going to have opinions on what the value is going to need to be set to in order to get the behavior that he wants in different circumstances.
Comment: I think you need to handle instance redistricting correctly in your counting scheme. When an instance is redistricted into one or more other instances, there can be left-over bytes in between the new instances (e.g. padding) as well as extra space at the end of the instance if the new instance(s) didn't fully consume the original instance which then needs to be counted as "freed" memory for the allocator to reuse. Legion will definitely push on the instance redistricting pathway pretty hard at the moment if the allocation sizes allow it. I'm not sure how you're going to do partial frees back to CUDA... if you don't handle it then the underlying allocation needs to be kept alive as long as any of the ancestor instances of the original instance are alive (including if those instances are further redistricted to other instances). We might end up creating another "pool" problem here if we go down this route because redistricted instances will effectively be mini-pools unto themselves until all their ancestors are freed up and the memory can be passed back to the CUDA pool (which then might also pass it back to the driver if it is over the limit of CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, which is really the point of all of this).

manopapad · 2025-10-09T20:25:27Z

src/realm/cuda/cuda_module.cc

    }

    MemoryImpl::AllocationResult
    GPUDynamicFBMemory::allocate_storage_immediate(RegionInstanceImpl *inst,


Don't you need to define the deferrable cases in a special allocate_storage_deferrable? IIUC immediate must be immediate, whereas deferrable may be deferred.

No, not really. allocate_storage_deferrable just manages the precondition and defaults to allocate_storage_immediate unless overridden by the subclass. immediate here just means "account for it and allocate immediately" rather than "allocate when the precondition is triggered". Other memories will use this differentiation in order to be able to map allocations to corresponding releases (see LocalManagedMemory::allocate_stroage_deferrable), but I don't need to do that here since I only need to know, at the time of allocation, if there is enough bytes available.

Don't blame me for the naming here, I didn't name these functions, and no, I don't plan to refactor the naming in this change. I could separate these out we wanted to, but I don't see any reason to, and it would just cost more in locking and relocking the memory if we did so. Here, the locking is mostly minimized (though I would eventually like to migrate the InstInfo over to the instance itself and avoid needing that lock, I'm just not sure where exactly to put it at the moment).

src/realm/cuda/cuda_module.cc

muraj · 2025-10-11T05:05:12Z

Question: How does CUDA pick the value of CU_MEMPOOL_ATTR_RELEASE_THRESHOLD?

CUDA doesn't. By default it is set to zero I believe, meaning that all allocations freed are immediately released to the OS.

Is there a way to configure it via the command line or an environment variable? I suspect that @manopapad is going to have opinions on what the value is going to need to be set to in order to get the behavior that he wants in different circumstances.

I've been talking to @manopapad about this offline. With the automatic trim when we hit OOM added, setting it to the size of the memory may be fine, but it's up for discussion.

Comment: I think you need to handle instance redistricting correctly in your counting scheme.

I haven't followed the redistricting logic yet. This PR is still in draft form, and I haven't even written tests yet. This is more of a proof of concept for @manopapad to verify if it works for his use cases. I will look into the redistricting logic soon.

I'm not sure how you're going to do partial frees back to CUDA...

I cannot do partial frees back to CUDA here. The only way I would be able to achieve that is to use the VMM apis and basically rewrite all the chunking and remapping logic in Realm, and then always allocate the physical chunks (i.e. CUmemGenericAllocation's) in increments of some known granularity (minimum 2MiB with CUDA today). Even CUDA thinks this is a bad idea, which is why cudaMallocAsync allocates in larger increments, sacrificing some internal fragmentation for performance.

if you don't handle it then the underlying allocation needs to be kept alive as long as any of the ancestor instances of the original instance are alive (including if those instances are further redistricted to other instances). We might end up creating another "pool" problem here if we go down this route because redistricted instances will effectively be mini-pools unto themselves until all their ancestors are freed up and the memory can be passed back to the CUDA pool (which then might also pass it back to the driver if it is over the limit of CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, which is really the point of all of this).

Right, which is why I wouldn't recommend this redistricting idea at all and instead focus our efforts on improving the performance of the allocators in Realm so you don't have to use redistricting at all. Redistricting, IMHO, is basically caching on top of Realm's caching on top of CUDA's caching, which defeats the whole point of what @manopapad's use case needs -- to be able to properly share resources across an application that might not know about all the different levels of caching of said resources. So let CUDA handle all the caching, using a single shared handle (CUmemoryPool) that all parts of the application can use. Then the cached resources can be reused throughout the application and it'll still be fast.

…einserting to the pending allocs queue

* Re-add cancel notifcation for poisoned allocation * Fix size request for queued_allocation when allocation is queued due to fragmentation * Fix deallocation notification * Fix free_bytes accounting in release on failure * Logging and documentation

…h it's own pool

Co-authored-by: Manolis Papadakis <[email protected]> Signed-off-by: Cory Perry <[email protected]>

lightsighter · 2025-10-11T06:32:23Z

This is more of a proof of concept for @manopapad to verify if it works for his use cases. I will look into the redistricting logic soon.

@manopapad be careful running Legate workloads that are memory intensive (rely on the garbage collector) or use concrete pools in Legion without the redistricting support.

I cannot do partial frees back to CUDA here.

As would be expected.

Right, which is why I wouldn't recommend this redistricting idea at all and instead focus our efforts on improving the performance of the allocators in Realm so you don't have to use redistricting at all. Redistricting, IMHO, is basically caching on top of Realm's caching on top of CUDA's caching, which defeats the whole point of what @manopapad's use case needs -- to be able to properly share resources across an application that might not know about all the different levels of caching of said resources.

We actually use redistricting for more than just garbage collection of existing instances. We also use it for reserving memory in advance of knowing what the memory will be used for and then later "reshaping" the memory into one or more instances once we do know. I don't think we can eliminate that case and still ensure that deferred deletions/allocations are topologically sorted the way that is necessary to avoid deadlock (at least not without being excessively pessimistic and serializing mapping/execution). Even this implementation relies on the topological ordering because it assumes that all deferred deletions are going to finish in finite time (a completely reasonable assumption). If a deferred allocation ultimately comes to depend on a deferred deletion when that deferred deletion is (transitively) dependent on the task performing the deferred allocation, then the program will hang. That is true in the current Realm memory allocator as well as in this one.

@manopapad you can use the eager garbage collection priority in Legion to have instances eagerly freed back to Realm as soon as they become invalid so that Realm can free them back to CUDA. We'll still need support for redistricting though to handle concrete pools. The alternative is requiring all tasks that need any dynamic memory allocation to use unbounded pools and you know what the consequences of that are.

manopapad · 2025-10-11T06:46:50Z

We'll still need support for redistricting though to handle concrete pools. The alternative is requiring all tasks that need any dynamic memory allocation to use unbounded pools and you know what the consequences of that are.

What if we had StanfordLegion/legion#1918? Would that still allow a pre-sized temporary pool? If the allocation out of the task-local temporary pool is done through redistricting then I assume no.

lightsighter · 2025-10-11T07:23:10Z

What if we had StanfordLegion/legion#1918? Would that still allow a pre-sized temporary pool? If the allocation out of the task-local temporary pool is done through redistricting then I assume no.

It would help, but maybe not enough. Certainly the non-escaping pools don't use redistricting at all; we just make external instances on top of the pool instance and the right thing happens, no redistricting required. The escaping pools are another matter. If there are multiple things escaping then we have to use redistricting so no help there. If there is only one instance escaping and the pool is perfectly sized for it then maybe that could be made to be ok. I would need to change the implementation though to handle that case because I usually allocate pool instances with a field size of one byte and 1-D index space of the number of bytes in the pool. In contrast, escaping future instances usually have an index space of a single point and field size the same as the size of the type for the future, so I usually need to redistrict those accordingly. I could try to special case this but it feels brittle. As soon as two things escape then we're back to redistricting (e.g. anything without output regions most likely).

muraj self-assigned this Sep 15, 2025

muraj force-pushed the main branch from a39ff4c to 9181321 Compare September 19, 2025 20:15

muraj force-pushed the cperry/cuda-mempool branch from 2e18266 to 912ec9f Compare September 19, 2025 20:29

muraj force-pushed the cperry/cuda-mempool branch 3 times, most recently from 6365b91 to 3a93c5a Compare October 8, 2025 06:53

muraj added the enhancement New feature or request label Oct 8, 2025

muraj force-pushed the cperry/cuda-mempool branch 2 times, most recently from 7105140 to 7511712 Compare October 9, 2025 22:03

muraj force-pushed the cperry/cuda-mempool branch 2 times, most recently from 97a0406 to c713b7e Compare October 10, 2025 19:52

manopapad reviewed Oct 10, 2025

View reviewed changes

src/realm/cuda/cuda_module.cc Outdated Show resolved Hide resolved

muraj and others added 7 commits October 10, 2025 22:05

cuda: Replace dynamicfb allocation method with cumempool

5913d97

Add partial implementation for deferred allocations to dynamic fb path

c3e8dd3

Factor out allocate_storage_immediate in order to allow skipping of r…

98adbba

…einserting to the pending allocs queue

Rewrite the allocation logic

9b3ea31

cuda: fix a couple issues

f9f8cf0

* Re-add cancel notifcation for poisoned allocation * Fix size request for queued_allocation when allocation is queued due to fragmentation * Fix deallocation notification * Fix free_bytes accounting in release on failure * Logging and documentation

Refactor pool assignment to allow for creating more memories each wit…

784a03f

…h it's own pool

Update src/realm/cuda/cuda_module.cc

dc486f4

Co-authored-by: Manolis Papadakis <[email protected]> Signed-off-by: Cory Perry <[email protected]>

muraj force-pushed the cperry/cuda-mempool branch from 6982822 to dc486f4 Compare October 11, 2025 05:05

muraj force-pushed the cperry/cuda-mempool branch from 9aa5ddf to b7836e1 Compare October 13, 2025 19:30

Add redistricting logic

06c59a8

muraj force-pushed the cperry/cuda-mempool branch from b7836e1 to 06c59a8 Compare October 13, 2025 22:15

add redistricting

b3be2d2

muraj force-pushed the cperry/cuda-mempool branch from 6c8ed86 to b3be2d2 Compare October 14, 2025 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda: Replace dynamicfb allocation method with cumempool #299

cuda: Replace dynamicfb allocation method with cumempool #299

Uh oh!

muraj commented Sep 15, 2025

Uh oh!

codecov bot commented Sep 15, 2025 •

edited

Loading

Uh oh!

lightsighter commented Oct 10, 2025

Uh oh!

manopapad Oct 9, 2025

Uh oh!

muraj Oct 11, 2025

Uh oh!

Uh oh!

Uh oh!

muraj commented Oct 11, 2025 •

edited

Loading

Uh oh!

lightsighter commented Oct 11, 2025

Uh oh!

manopapad commented Oct 11, 2025

Uh oh!

lightsighter commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cuda: Replace dynamicfb allocation method with cumempool #299

Are you sure you want to change the base?

cuda: Replace dynamicfb allocation method with cumempool #299

Uh oh!

Conversation

muraj commented Sep 15, 2025

Uh oh!

codecov bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lightsighter commented Oct 10, 2025

Uh oh!

manopapad Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

muraj Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

muraj commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lightsighter commented Oct 11, 2025

Uh oh!

manopapad commented Oct 11, 2025

Uh oh!

lightsighter commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Sep 15, 2025 •

edited

Loading

muraj commented Oct 11, 2025 •

edited

Loading