-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Labels
Description
The generalization of the caching allocator in #216 makes it easier to make various improvements to the caching allocator. #211 (comment) shows a measurement pointing that the mutex in the caching allocator would be the bottleneck (my studies ~2 years ago pointed more to the mutex in CUDA, but things seem to have evolved). This PR is to discuss improvement ideas, with a(n ordered) plan shown below
- Generalize the caching allocator, done in [cudadev] Generalize caching allocator #216
- Improve the interaction of
ScopedContextand the caching allocator by having theScopedContextto passSharedEventPtrto the caching allocator (evolution of ideas in [RFC] Reduce calls to cudaEventRecord() via the caching allocators cmssw#412 and [RFC] Add make_device_unique() functions to ScopedContextBase cmssw#487)- This will reduce the number of CUDA events in flight, remove the need to create and destroy them, and simplify the logic especially on the host allocator side
- Replace the
multisetwith nested vectors (device, bin) for (much) faster lookup (from [cudadev] Generalize caching allocator #216 (comment)) - Make locking finer grained (e.g. per bin)