-
Notifications
You must be signed in to change notification settings - Fork 233
Add experimental cuda_async_pinned_memory_resource #2164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add experimental cuda_async_pinned_memory_resource #2164
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
40dfa09 to
e99ed6e
Compare
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0's cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned. This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and addresses part of rapidsai#2054. Key features: - Uses default pinned memory pool for stream-ordered allocation - Accessible from both host and device - Requires CUDA 13.0+ (matches managed version for consistency) Implementation includes: - C++ header and implementation in cuda_async_pinned_memory_resource.hpp - Runtime capability check in runtime_capabilities.hpp - C++ tests in cuda_async_pinned_mr_tests.cpp - Python bindings in experimental module - Python tests in test_cuda_async_pinned_memory_resource.py
e99ed6e to
e671b34
Compare
Enables pinned memory pool support on CUDA 12.6+ using cudaMemPoolCreate for CUDA 12.6-12.x and cudaMemGetDefaultMemPool for CUDA 13.0+. Uses unique_ptr with a deleter for automatic pool cleanup. Updates version requirements: 12.6+ for pinned.
nirandaperera
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions on the mem pool location type.
| // CUDA 12.6-12.x: Create a new pinned memory pool (needs cleanup) | ||
| cudaMemPoolProps pool_props{}; | ||
| pool_props.allocType = cudaMemAllocationTypePinned; | ||
| pool_props.location.type = cudaMemLocationTypeDevice; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making a location as DEVICE. Is this correct?
In CCCL pinned mem pool, its marked as host/ host_numa
https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__memory_resource/pinned_memory_pool.h#L113-L154
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what it means by pinned device memory 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is wrong, this allocates device memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 837dd55.
| } | ||
| }; | ||
|
|
||
| TEST_F(AsyncPinnedMRTest, BasicAllocateDeallocate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like all the test cases can be parameterized/ templated for both sync and async allocation and deallocation operations
| cudaMemPool_t pool_handle = mr.pool_handle(); | ||
| EXPECT_NE(pool_handle, nullptr); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add a device -> pinned host stream ordered copy? Maybe using a device_vector and checking if the copy results in the same
| // CUDA 13.0+: Use the default pinned memory pool (no cleanup needed) | ||
| cudaMemLocation location{.type = cudaMemLocationTypeDevice, | ||
| .id = rmm::get_current_cuda_device().value()}; | ||
| RMM_CUDA_TRY( | ||
| cudaMemGetDefaultMemPool(pool_handle_.get(), &location, cudaMemAllocationTypePinned)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This provides a mempool that allocates on device.
If you want a mempool that allocates on host and is page-locked, you need to do:
// Note, if we don't specify HostNuma (we might want to...) then .id is ignored
cudaMemLocation location{.type = cudaMemLocationTypeHost, .id = 0};
// Non-_migratable_ memory allocated on host.
cudaMemGetDefaultMemPool(&handle, &location, cudaMemAllocationTypePinned);
cudaMemAccessDesc desc{};
desc.location.type = cudaMemLocationTypeDevice;
desc.location.id = rmm::get_current_cuda_device().value();
desc.flags = cudaMemAccessFlagsProtReadWrite;
cudaMemPoolSetAccess(handle, &desc, 1);
Note moreover that if you don't set the accessibility then the allocations from this resource are not device accessible.
| // CUDA 12.6-12.x: Create a new pinned memory pool (needs cleanup) | ||
| cudaMemPoolProps pool_props{}; | ||
| pool_props.allocType = cudaMemAllocationTypePinned; | ||
| pool_props.location.type = cudaMemLocationTypeDevice; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is wrong, this allocates device memory.
| // Pinned memory should be accessible from host | ||
| // Write from host | ||
| EXPECT_NO_THROW({ | ||
| for (int i = 0; i < 100; ++i) { | ||
| ptr[i] = i; | ||
| } | ||
| }); | ||
|
|
||
| // Verify we can read back | ||
| EXPECT_EQ(ptr[0], 0); | ||
| EXPECT_EQ(ptr[50], 50); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to test that memory is accessible from device too (via some kernel probably, or maybe DtoD memcpy?)
| RMM_EXPECTS(rmm::detail::runtime_async_pinned_alloc::is_supported(), | ||
| "cuda_async_pinned_memory_resource requires CUDA 12.6 or higher runtime"); | ||
|
|
||
| pool_handle_.reset(new cudaMemPool_t{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As below, no need to manage this handle through a smart pointer, this class can do that.
| } | ||
| }; | ||
|
|
||
| std::unique_ptr<cudaMemPool_t, pool_deleter> pool_handle_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This this is an owning object, it seems unnecessary to also have a unique_ptr. Prefer to store a raw cudaMemPool_t handle and deal with this in the dtor.
Description
Contributes to #2054.
Adds a new
cuda_async_pinned_memory_resourcethat provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0'scudaMemGetDefaultMemPoolAPI withcudaMemAllocationTypePinned.This parallels the
cuda_async_managed_memory_resourceadded in #2056.Key Features
Implementation
cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hppruntime_async_pinned_allocstruct toruntime_capabilities.hppcpp/tests/mr/cuda_async_pinned_mr_tests.cppwith tests for allocation, host accessibility, and pool equalitypython/rmm/rmm/tests/test_cuda_async_pinned_memory_resource.pyFollow-up Tasks
pinned_host_memory_resourceChecklist