I noticed cudarc doesn't have any wrappers for the CUDA memory pool
API (cuMemPoolCreate, cuMemAllocFromPoolAsync, etc.) beyond what's
already in the sys bindings.
From what I can know, cudarc uses cuMemAllocAsync internally for
CudaSlice allocation, which goes through the default pool. But there
doesn't seem to be a way to create custom pools, allocate from a
specific pool, or do things like trimming unused memory.
I came across this while looking at #514 and the downstream candle PR
(huggingface/candle#3352) both deal with memory needing to persist
across CUDA graph replays, and it seems like custom pools are how
CUDA expects you to handle that.
I was thinking this could be approached like this:
-
result-level wrappers in a new pub mod mem_pool for the core
functions: create, destroy, trim_to, get/set attribute, and the
device-level get_default_mem_pool/get_mem_pool/set_mem_pool.
-
A safe-level CudaMemPool type with Drop, plus something like
CudaStream::alloc_from_pool() and CudaContext::default_mem_pool().
Though I'm not sure about a few things and would appreciate guidance:
Should pool-allocated CudaSlices track which pool they came from?
The pool struct (CUmemPoolProps) changed between 11.x and 12.x. So do we target a specific version?
I am not too sure how this should interact with CudaGraph capture..?
Happy to start with the result-level wrappers if this seems like a
reasonable direction.
I noticed cudarc doesn't have any wrappers for the CUDA memory pool
API (cuMemPoolCreate, cuMemAllocFromPoolAsync, etc.) beyond what's
already in the sys bindings.
From what I can know, cudarc uses cuMemAllocAsync internally for
CudaSlice allocation, which goes through the default pool. But there
doesn't seem to be a way to create custom pools, allocate from a
specific pool, or do things like trimming unused memory.
I came across this while looking at #514 and the downstream candle PR
(huggingface/candle#3352) both deal with memory needing to persist
across CUDA graph replays, and it seems like custom pools are how
CUDA expects you to handle that.
I was thinking this could be approached like this:
result-level wrappers in a new pub mod mem_pool for the core
functions: create, destroy, trim_to, get/set attribute, and the
device-level get_default_mem_pool/get_mem_pool/set_mem_pool.
A safe-level CudaMemPool type with Drop, plus something like
CudaStream::alloc_from_pool() and CudaContext::default_mem_pool().
Though I'm not sure about a few things and would appreciate guidance:
Should pool-allocated CudaSlices track which pool they came from?
The pool struct (CUmemPoolProps) changed between 11.x and 12.x. So do we target a specific version?
I am not too sure how this should interact with CudaGraph capture..?
Happy to start with the result-level wrappers if this seems like a
reasonable direction.