-
Notifications
You must be signed in to change notification settings - Fork 233
Description
Is your feature request related to a problem? Please describe.
During processing big data, we frequently encounter the situations where we need to perform computation on a large number of dataframes. For example, we need to gather/scatter or transform hundreds up to thousands of data columns. For any of these operations, we need to allocate a lot of memory buffers for both intermediate as well as final output columns.
Describe the solution you'd like
Allocation and deallocation of each memory buffer always have overhead, not to mention the latency of preparing thread-local data for doing the allocation/deallocation operations. Mostly, such overhead comes from acquiring a shared mutex and (possibly) creating a CUDA event. For example:
rmm/cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp
Lines 194 to 202 in 889050d
| void* do_allocate(std::size_t size, cuda_stream_view stream) override | |
| { | |
| RMM_LOG_TRACE("[A][stream %s][%zuB]", rmm::detail::format_stream(stream), size); | |
| if (size <= 0) { return nullptr; } | |
| lock_guard lock(mtx_); | |
| auto stream_event = get_event(stream); |
Describe alternatives you've considered
Implement a batch processing mechanism for allocation and deallocation across the memory resource classes:
- This can start from a very simple modification: instead of locking the mutex and (possibly) creating the CUDA event for each allocation/deallocation as now, the batch alloc/dealloc functions will just lock the mutex and (maybe) creating CUDA event once, then alloc/dealloc a large number of buffers before releasing the mutex.
- Incremental improvements can be added on top of it. For example, the
get_blockfunction can be reimplemented for more efficiently processing batch alloc/dealloc.
Additional context
Any better way for reducing overhead of alloc/dealloc large numbers of buffers would be helpful.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status