Add experimental cuda_async_pinned_memory_resource #2164

bdice · 2025-11-25T23:35:01Z

Description

Contributes to #2054.

Adds a new cuda_async_pinned_memory_resource that provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0's cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned.

This parallels the cuda_async_managed_memory_resource added in #2056.

Key Features

Uses the default pinned memory pool for stream-ordered allocation/deallocation
Accessible from both host and device
Requires CUDA 13.0+ (matches managed version for API consistency)

Implementation

C++ Header: cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hpp
Runtime Capability Check: Added runtime_async_pinned_alloc struct to runtime_capabilities.hpp
C++ Tests: cpp/tests/mr/cuda_async_pinned_mr_tests.cpp with tests for allocation, host accessibility, and pool equality
Python Bindings: Added to experimental module with proper type stubs
Python Tests: python/rmm/rmm/tests/test_cuda_async_pinned_memory_resource.py

Follow-up Tasks

Determine whether to provide docs on how to set release threshold or other pool properties
Consider adding more comprehensive benchmarks comparing against synchronous pinned_host_memory_resource

Checklist

I am familiar with the Contributing Guidelines
New or existing tests cover these changes
The documentation is up to date with these changes

copy-pr-bot · 2025-11-25T23:35:15Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Adds a new cuda_async_pinned_memory_resource that provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0's cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned. This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and addresses part of rapidsai#2054. Key features: - Uses default pinned memory pool for stream-ordered allocation - Accessible from both host and device - Requires CUDA 13.0+ (matches managed version for consistency) Implementation includes: - C++ header and implementation in cuda_async_pinned_memory_resource.hpp - Runtime capability check in runtime_capabilities.hpp - C++ tests in cuda_async_pinned_mr_tests.cpp - Python bindings in experimental module - Python tests in test_cuda_async_pinned_memory_resource.py

Enables pinned memory pool support on CUDA 12.6+ using cudaMemPoolCreate for CUDA 12.6-12.x and cudaMemGetDefaultMemPool for CUDA 13.0+. Uses unique_ptr with a deleter for automatic pool cleanup. Updates version requirements: 12.6+ for pinned.

python/rmm/rmm/tests/test_helpers.py

nirandaperera

I have some questions on the mem pool location type.

nirandaperera · 2025-11-26T23:38:07Z

cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hpp

+    // CUDA 12.6-12.x: Create a new pinned memory pool (needs cleanup)
+    cudaMemPoolProps pool_props{};
+    pool_props.allocType     = cudaMemAllocationTypePinned;
+    pool_props.location.type = cudaMemLocationTypeDevice;


This is making a location as DEVICE. Is this correct?
In CCCL pinned mem pool, its marked as host/ host_numa
https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__memory_resource/pinned_memory_pool.h#L113-L154

I'm wondering what it means by pinned device memory 🤔

Yeah, this is wrong, this allocates device memory.

Fixed in 837dd55.

cpp/tests/mr/cuda_async_pinned_mr_tests.cpp

nirandaperera · 2025-11-26T23:54:58Z

cpp/tests/mr/cuda_async_pinned_mr_tests.cpp

+  }
+};
+
+TEST_F(AsyncPinnedMRTest, BasicAllocateDeallocate)


I feel like all the test cases can be parameterized/ templated for both sync and async allocation and deallocation operations

nirandaperera · 2025-11-26T23:57:10Z

cpp/tests/mr/cuda_async_pinned_mr_tests.cpp

+  cudaMemPool_t pool_handle = mr.pool_handle();
+  EXPECT_NE(pool_handle, nullptr);
+}
+


Should we also add a device -> pinned host stream ordered copy? Maybe using a device_vector and checking if the copy results in the same

wence- · 2025-12-01T11:25:47Z

cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hpp

+    // CUDA 13.0+: Use the default pinned memory pool (no cleanup needed)
+    cudaMemLocation location{.type = cudaMemLocationTypeDevice,
+                             .id   = rmm::get_current_cuda_device().value()};
+    RMM_CUDA_TRY(
+      cudaMemGetDefaultMemPool(pool_handle_.get(), &location, cudaMemAllocationTypePinned));


This provides a mempool that allocates on device.

If you want a mempool that allocates on host and is page-locked, you need to do:

// Note, if we don't specify HostNuma (we might want to...) then .id is ignored cudaMemLocation location{.type = cudaMemLocationTypeHost, .id = 0}; // Non-_migratable_ memory allocated on host. cudaMemGetDefaultMemPool(&handle, &location, cudaMemAllocationTypePinned); cudaMemAccessDesc desc{}; desc.location.type = cudaMemLocationTypeDevice; desc.location.id = rmm::get_current_cuda_device().value(); desc.flags = cudaMemAccessFlagsProtReadWrite; cudaMemPoolSetAccess(handle, &desc, 1);

Note moreover that if you don't set the accessibility then the allocations from this resource are not device accessible.

wence- · 2025-12-01T11:25:50Z

cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hpp

+    // CUDA 12.6-12.x: Create a new pinned memory pool (needs cleanup)
+    cudaMemPoolProps pool_props{};
+    pool_props.allocType     = cudaMemAllocationTypePinned;
+    pool_props.location.type = cudaMemLocationTypeDevice;


Yeah, this is wrong, this allocates device memory.

wence- · 2025-12-01T11:27:30Z

cpp/tests/mr/cuda_async_pinned_mr_tests.cpp

+  // Pinned memory should be accessible from host
+  // Write from host
+  EXPECT_NO_THROW({
+    for (int i = 0; i < 100; ++i) {
+      ptr[i] = i;
+    }
+  });
+
+  // Verify we can read back
+  EXPECT_EQ(ptr[0], 0);
+  EXPECT_EQ(ptr[50], 50);


We need to test that memory is accessible from device too (via some kernel probably, or maybe DtoD memcpy?)

wence- · 2025-12-01T11:29:37Z

cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hpp

+    RMM_EXPECTS(rmm::detail::runtime_async_pinned_alloc::is_supported(),
+                "cuda_async_pinned_memory_resource requires CUDA 12.6 or higher runtime");
+
+    pool_handle_.reset(new cudaMemPool_t{});


As below, no need to manage this handle through a smart pointer, this class can do that.

wence- · 2025-12-01T11:29:44Z

cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hpp

+    }
+  };
+
+  std::unique_ptr<cudaMemPool_t, pool_deleter> pool_handle_;


This this is an owning object, it seems unnecessary to also have a unique_ptr. Prefer to store a raw cudaMemPool_t handle and deal with this in the dtor.

bdice requested review from a team as code owners November 25, 2025 23:35

bdice requested review from harrism and rongou November 25, 2025 23:35

github-project-automation bot added this to RMM Project Board Nov 25, 2025

bdice marked this pull request as draft November 25, 2025 23:35

bdice force-pushed the feature/cuda-async-pinned-memory-resource branch from 40dfa09 to e99ed6e Compare November 25, 2025 23:45

bdice marked this pull request as ready for review November 25, 2025 23:45

bdice force-pushed the feature/cuda-async-pinned-memory-resource branch from e99ed6e to e671b34 Compare November 25, 2025 23:47

bdice added feature request New feature or request non-breaking Non-breaking change labels Nov 26, 2025

bdice self-assigned this Nov 26, 2025

rongou approved these changes Nov 26, 2025

View reviewed changes

bdice moved this to In Progress in RMM Project Board Nov 26, 2025

TomAugspurger approved these changes Nov 26, 2025

View reviewed changes

python/rmm/rmm/tests/test_helpers.py Show resolved Hide resolved

nirandaperera suggested changes Nov 26, 2025

View reviewed changes

github-project-automation bot moved this from In Progress to Review in RMM Project Board Nov 26, 2025

wence- requested changes Dec 1, 2025

View reviewed changes

Review feedback

837dd55

Add experimental cuda_async_pinned_memory_resource #2164

Are you sure you want to change the base?

Add experimental cuda_async_pinned_memory_resource #2164

Uh oh!

Conversation

bdice commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Features

Implementation

Follow-up Tasks

Checklist

Uh oh!

copy-pr-bot bot commented Nov 25, 2025

Uh oh!

Uh oh!

nirandaperera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bdice commented Nov 25, 2025 •

edited

Loading