Skip to content

Conversation

muraj
Copy link
Contributor

@muraj muraj commented Sep 15, 2025

No description provided.

Copy link

codecov bot commented Sep 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.15%. Comparing base (ade8156) to head (b3be2d2).

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #299    +/-   ##
========================================
  Coverage   27.15%   27.15%            
========================================
  Files         190      190            
  Lines       39174    39173     -1     
  Branches    14289    14180   -109     
========================================
  Hits        10638    10638            
+ Misses      27681    27119   -562     
- Partials      855     1416   +561     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@muraj muraj self-assigned this Sep 15, 2025
@muraj muraj force-pushed the cperry/cuda-mempool branch from 2e18266 to 912ec9f Compare September 19, 2025 20:29
@muraj muraj force-pushed the cperry/cuda-mempool branch 3 times, most recently from 6365b91 to 3a93c5a Compare October 8, 2025 06:53
@muraj muraj added the enhancement New feature or request label Oct 8, 2025
@muraj muraj force-pushed the cperry/cuda-mempool branch 2 times, most recently from 7105140 to 7511712 Compare October 9, 2025 22:03
@lightsighter
Copy link
Contributor

One question and one comment.

  • Question: How does CUDA pick the value of CU_MEMPOOL_ATTR_RELEASE_THRESHOLD? Is there a way to configure it via the command line or an environment variable? I suspect that @manopapad is going to have opinions on what the value is going to need to be set to in order to get the behavior that he wants in different circumstances.
  • Comment: I think you need to handle instance redistricting correctly in your counting scheme. When an instance is redistricted into one or more other instances, there can be left-over bytes in between the new instances (e.g. padding) as well as extra space at the end of the instance if the new instance(s) didn't fully consume the original instance which then needs to be counted as "freed" memory for the allocator to reuse. Legion will definitely push on the instance redistricting pathway pretty hard at the moment if the allocation sizes allow it. I'm not sure how you're going to do partial frees back to CUDA... if you don't handle it then the underlying allocation needs to be kept alive as long as any of the ancestor instances of the original instance are alive (including if those instances are further redistricted to other instances). We might end up creating another "pool" problem here if we go down this route because redistricted instances will effectively be mini-pools unto themselves until all their ancestors are freed up and the memory can be passed back to the CUDA pool (which then might also pass it back to the driver if it is over the limit of CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, which is really the point of all of this).

@muraj muraj force-pushed the cperry/cuda-mempool branch 2 times, most recently from 97a0406 to c713b7e Compare October 10, 2025 19:52
}

MemoryImpl::AllocationResult
GPUDynamicFBMemory::allocate_storage_immediate(RegionInstanceImpl *inst,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to define the deferrable cases in a special allocate_storage_deferrable? IIUC immediate must be immediate, whereas deferrable may be deferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really. allocate_storage_deferrable just manages the precondition and defaults to allocate_storage_immediate unless overridden by the subclass. immediate here just means "account for it and allocate immediately" rather than "allocate when the precondition is triggered". Other memories will use this differentiation in order to be able to map allocations to corresponding releases (see LocalManagedMemory::allocate_stroage_deferrable), but I don't need to do that here since I only need to know, at the time of allocation, if there is enough bytes available.

Don't blame me for the naming here, I didn't name these functions, and no, I don't plan to refactor the naming in this change. I could separate these out we wanted to, but I don't see any reason to, and it would just cost more in locking and relocking the memory if we did so. Here, the locking is mostly minimized (though I would eventually like to migrate the InstInfo over to the instance itself and avoid needing that lock, I'm just not sure where exactly to put it at the moment).

@muraj
Copy link
Contributor Author

muraj commented Oct 11, 2025

  • Question: How does CUDA pick the value of CU_MEMPOOL_ATTR_RELEASE_THRESHOLD?

CUDA doesn't. By default it is set to zero I believe, meaning that all allocations freed are immediately released to the OS.

Is there a way to configure it via the command line or an environment variable? I suspect that @manopapad is going to have opinions on what the value is going to need to be set to in order to get the behavior that he wants in different circumstances.

I've been talking to @manopapad about this offline. With the automatic trim when we hit OOM added, setting it to the size of the memory may be fine, but it's up for discussion.

  • Comment: I think you need to handle instance redistricting correctly in your counting scheme.

I haven't followed the redistricting logic yet. This PR is still in draft form, and I haven't even written tests yet. This is more of a proof of concept for @manopapad to verify if it works for his use cases. I will look into the redistricting logic soon.

I'm not sure how you're going to do partial frees back to CUDA...

I cannot do partial frees back to CUDA here. The only way I would be able to achieve that is to use the VMM apis and basically rewrite all the chunking and remapping logic in Realm, and then always allocate the physical chunks (i.e. CUmemGenericAllocation's) in increments of some known granularity (minimum 2MiB with CUDA today). Even CUDA thinks this is a bad idea, which is why cudaMallocAsync allocates in larger increments, sacrificing some internal fragmentation for performance.

if you don't handle it then the underlying allocation needs to be kept alive as long as any of the ancestor instances of the original instance are alive (including if those instances are further redistricted to other instances). We might end up creating another "pool" problem here if we go down this route because redistricted instances will effectively be mini-pools unto themselves until all their ancestors are freed up and the memory can be passed back to the CUDA pool (which then might also pass it back to the driver if it is over the limit of CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, which is really the point of all of this).

Right, which is why I wouldn't recommend this redistricting idea at all and instead focus our efforts on improving the performance of the allocators in Realm so you don't have to use redistricting at all. Redistricting, IMHO, is basically caching on top of Realm's caching on top of CUDA's caching, which defeats the whole point of what @manopapad's use case needs -- to be able to properly share resources across an application that might not know about all the different levels of caching of said resources. So let CUDA handle all the caching, using a single shared handle (CUmemoryPool) that all parts of the application can use. Then the cached resources can be reused throughout the application and it'll still be fast.

muraj and others added 7 commits October 10, 2025 22:05
* Re-add cancel notifcation for poisoned allocation
* Fix size request for queued_allocation when allocation is queued due to fragmentation
* Fix deallocation notification
* Fix free_bytes accounting in release on failure
* Logging and documentation
Co-authored-by: Manolis Papadakis <[email protected]>
Signed-off-by: Cory Perry <[email protected]>
@muraj muraj force-pushed the cperry/cuda-mempool branch from 6982822 to dc486f4 Compare October 11, 2025 05:05
@lightsighter
Copy link
Contributor

This is more of a proof of concept for @manopapad to verify if it works for his use cases. I will look into the redistricting logic soon.

@manopapad be careful running Legate workloads that are memory intensive (rely on the garbage collector) or use concrete pools in Legion without the redistricting support.

I cannot do partial frees back to CUDA here.

As would be expected.

Right, which is why I wouldn't recommend this redistricting idea at all and instead focus our efforts on improving the performance of the allocators in Realm so you don't have to use redistricting at all. Redistricting, IMHO, is basically caching on top of Realm's caching on top of CUDA's caching, which defeats the whole point of what @manopapad's use case needs -- to be able to properly share resources across an application that might not know about all the different levels of caching of said resources.

We actually use redistricting for more than just garbage collection of existing instances. We also use it for reserving memory in advance of knowing what the memory will be used for and then later "reshaping" the memory into one or more instances once we do know. I don't think we can eliminate that case and still ensure that deferred deletions/allocations are topologically sorted the way that is necessary to avoid deadlock (at least not without being excessively pessimistic and serializing mapping/execution). Even this implementation relies on the topological ordering because it assumes that all deferred deletions are going to finish in finite time (a completely reasonable assumption). If a deferred allocation ultimately comes to depend on a deferred deletion when that deferred deletion is (transitively) dependent on the task performing the deferred allocation, then the program will hang. That is true in the current Realm memory allocator as well as in this one.

@manopapad you can use the eager garbage collection priority in Legion to have instances eagerly freed back to Realm as soon as they become invalid so that Realm can free them back to CUDA. We'll still need support for redistricting though to handle concrete pools. The alternative is requiring all tasks that need any dynamic memory allocation to use unbounded pools and you know what the consequences of that are.

@manopapad
Copy link
Contributor

We'll still need support for redistricting though to handle concrete pools. The alternative is requiring all tasks that need any dynamic memory allocation to use unbounded pools and you know what the consequences of that are.

What if we had StanfordLegion/legion#1918? Would that still allow a pre-sized temporary pool? If the allocation out of the task-local temporary pool is done through redistricting then I assume no.

@lightsighter
Copy link
Contributor

What if we had StanfordLegion/legion#1918? Would that still allow a pre-sized temporary pool? If the allocation out of the task-local temporary pool is done through redistricting then I assume no.

It would help, but maybe not enough. Certainly the non-escaping pools don't use redistricting at all; we just make external instances on top of the pool instance and the right thing happens, no redistricting required. The escaping pools are another matter. If there are multiple things escaping then we have to use redistricting so no help there. If there is only one instance escaping and the pool is perfectly sized for it then maybe that could be made to be ok. I would need to change the implementation though to handle that case because I usually allocate pool instances with a field size of one byte and 1-D index space of the number of bytes in the pool. In contrast, escaping future instances usually have an index space of a single point and field size the same as the size of the type for the future, so I usually need to redistrict those accordingly. I could try to special case this but it feels brittle. As soon as two things escape then we're back to redistricting (e.g. anything without output regions most likely).

@muraj muraj force-pushed the cperry/cuda-mempool branch from 9aa5ddf to b7836e1 Compare October 13, 2025 19:30
@muraj muraj force-pushed the cperry/cuda-mempool branch from b7836e1 to 06c59a8 Compare October 13, 2025 22:15
@muraj muraj force-pushed the cperry/cuda-mempool branch from 6c8ed86 to b3be2d2 Compare October 14, 2025 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants