-
Notifications
You must be signed in to change notification settings - Fork 10
cuda: Replace dynamicfb allocation method with cumempool #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #299 +/- ##
========================================
Coverage 27.15% 27.15%
========================================
Files 190 190
Lines 39174 39173 -1
Branches 14289 14180 -109
========================================
Hits 10638 10638
+ Misses 27681 27119 -562
- Partials 855 1416 +561 ☔ View full report in Codecov by Sentry. |
2e18266
to
912ec9f
Compare
6365b91
to
3a93c5a
Compare
7105140
to
7511712
Compare
One question and one comment.
|
97a0406
to
c713b7e
Compare
} | ||
|
||
MemoryImpl::AllocationResult | ||
GPUDynamicFBMemory::allocate_storage_immediate(RegionInstanceImpl *inst, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to define the deferrable cases in a special allocate_storage_deferrable
? IIUC immediate
must be immediate, whereas deferrable
may be deferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not really. allocate_storage_deferrable just manages the precondition and defaults to allocate_storage_immediate unless overridden by the subclass. immediate
here just means "account for it and allocate immediately" rather than "allocate when the precondition is triggered". Other memories will use this differentiation in order to be able to map allocations to corresponding releases (see LocalManagedMemory::allocate_stroage_deferrable), but I don't need to do that here since I only need to know, at the time of allocation, if there is enough bytes available.
Don't blame me for the naming here, I didn't name these functions, and no, I don't plan to refactor the naming in this change. I could separate these out we wanted to, but I don't see any reason to, and it would just cost more in locking and relocking the memory if we did so. Here, the locking is mostly minimized (though I would eventually like to migrate the InstInfo over to the instance itself and avoid needing that lock, I'm just not sure where exactly to put it at the moment).
CUDA doesn't. By default it is set to zero I believe, meaning that all allocations freed are immediately released to the OS.
I've been talking to @manopapad about this offline. With the automatic trim when we hit OOM added, setting it to the size of the memory may be fine, but it's up for discussion.
I haven't followed the redistricting logic yet. This PR is still in draft form, and I haven't even written tests yet. This is more of a proof of concept for @manopapad to verify if it works for his use cases. I will look into the redistricting logic soon.
I cannot do partial frees back to CUDA here. The only way I would be able to achieve that is to use the VMM apis and basically rewrite all the chunking and remapping logic in Realm, and then always allocate the physical chunks (i.e. CUmemGenericAllocation's) in increments of some known granularity (minimum 2MiB with CUDA today). Even CUDA thinks this is a bad idea, which is why cudaMallocAsync allocates in larger increments, sacrificing some internal fragmentation for performance.
Right, which is why I wouldn't recommend this redistricting idea at all and instead focus our efforts on improving the performance of the allocators in Realm so you don't have to use redistricting at all. Redistricting, IMHO, is basically caching on top of Realm's caching on top of CUDA's caching, which defeats the whole point of what @manopapad's use case needs -- to be able to properly share resources across an application that might not know about all the different levels of caching of said resources. So let CUDA handle all the caching, using a single shared handle (CUmemoryPool) that all parts of the application can use. Then the cached resources can be reused throughout the application and it'll still be fast. |
…einserting to the pending allocs queue
* Re-add cancel notifcation for poisoned allocation * Fix size request for queued_allocation when allocation is queued due to fragmentation * Fix deallocation notification * Fix free_bytes accounting in release on failure * Logging and documentation
Co-authored-by: Manolis Papadakis <[email protected]> Signed-off-by: Cory Perry <[email protected]>
6982822
to
dc486f4
Compare
@manopapad be careful running Legate workloads that are memory intensive (rely on the garbage collector) or use concrete pools in Legion without the redistricting support.
As would be expected.
We actually use redistricting for more than just garbage collection of existing instances. We also use it for reserving memory in advance of knowing what the memory will be used for and then later "reshaping" the memory into one or more instances once we do know. I don't think we can eliminate that case and still ensure that deferred deletions/allocations are topologically sorted the way that is necessary to avoid deadlock (at least not without being excessively pessimistic and serializing mapping/execution). Even this implementation relies on the topological ordering because it assumes that all deferred deletions are going to finish in finite time (a completely reasonable assumption). If a deferred allocation ultimately comes to depend on a deferred deletion when that deferred deletion is (transitively) dependent on the task performing the deferred allocation, then the program will hang. That is true in the current Realm memory allocator as well as in this one. @manopapad you can use the eager garbage collection priority in Legion to have instances eagerly freed back to Realm as soon as they become invalid so that Realm can free them back to CUDA. We'll still need support for redistricting though to handle concrete pools. The alternative is requiring all tasks that need any dynamic memory allocation to use unbounded pools and you know what the consequences of that are. |
What if we had StanfordLegion/legion#1918? Would that still allow a pre-sized temporary pool? If the allocation out of the task-local temporary pool is done through redistricting then I assume no. |
It would help, but maybe not enough. Certainly the non-escaping pools don't use redistricting at all; we just make external instances on top of the pool instance and the right thing happens, no redistricting required. The escaping pools are another matter. If there are multiple things escaping then we have to use redistricting so no help there. If there is only one instance escaping and the pool is perfectly sized for it then maybe that could be made to be ok. I would need to change the implementation though to handle that case because I usually allocate pool instances with a field size of one byte and 1-D index space of the number of bytes in the pool. In contrast, escaping future instances usually have an index space of a single point and field size the same as the size of the type for the future, so I usually need to redistrict those accordingly. I could try to special case this but it feels brittle. As soon as two things escape then we're back to redistricting (e.g. anything without output regions most likely). |
9aa5ddf
to
b7836e1
Compare
b7836e1
to
06c59a8
Compare
6c8ed86
to
b3be2d2
Compare
No description provided.