Skip to content

Conversation

@bdice
Copy link
Contributor

@bdice bdice commented Nov 25, 2025

Description

Contributes to #2054.

Adds a new cuda_async_pinned_memory_resource that provides stream-ordered pinned (page-locked) host memory allocation using CUDA 13.0's cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned.

This parallels the cuda_async_managed_memory_resource added in #2056.

Key Features

  • Uses the default pinned memory pool for stream-ordered allocation/deallocation
  • Accessible from both host and device
  • Requires CUDA 13.0+ (matches managed version for API consistency)

Implementation

  • C++ Header: cpp/include/rmm/mr/cuda_async_pinned_memory_resource.hpp
  • Runtime Capability Check: Added runtime_async_pinned_alloc struct to runtime_capabilities.hpp
  • C++ Tests: cpp/tests/mr/cuda_async_pinned_mr_tests.cpp with tests for allocation, host accessibility, and pool equality
  • Python Bindings: Added to experimental module with proper type stubs
  • Python Tests: python/rmm/rmm/tests/test_cuda_async_pinned_memory_resource.py

Follow-up Tasks

  • Determine whether to provide docs on how to set release threshold or other pool properties
  • Consider adding more comprehensive benchmarks comparing against synchronous pinned_host_memory_resource

Checklist

  • I am familiar with the Contributing Guidelines
  • New or existing tests cover these changes
  • The documentation is up to date with these changes

@bdice bdice requested review from a team as code owners November 25, 2025 23:35
@bdice bdice requested review from harrism and rongou November 25, 2025 23:35
@bdice bdice marked this pull request as draft November 25, 2025 23:35
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bdice bdice force-pushed the feature/cuda-async-pinned-memory-resource branch from 40dfa09 to e99ed6e Compare November 25, 2025 23:45
@bdice bdice marked this pull request as ready for review November 25, 2025 23:45
Adds a new cuda_async_pinned_memory_resource that provides stream-ordered
pinned (page-locked) host memory allocation using CUDA 13.0's
cudaMemGetDefaultMemPool API with cudaMemAllocationTypePinned.

This parallels the cuda_async_managed_memory_resource added in rapidsai#2056 and
addresses part of rapidsai#2054.

Key features:
- Uses default pinned memory pool for stream-ordered allocation
- Accessible from both host and device
- Requires CUDA 13.0+ (matches managed version for consistency)

Implementation includes:
- C++ header and implementation in cuda_async_pinned_memory_resource.hpp
- Runtime capability check in runtime_capabilities.hpp
- C++ tests in cuda_async_pinned_mr_tests.cpp
- Python bindings in experimental module
- Python tests in test_cuda_async_pinned_memory_resource.py
@bdice bdice force-pushed the feature/cuda-async-pinned-memory-resource branch from e99ed6e to e671b34 Compare November 25, 2025 23:47
Enables pinned memory pool support on CUDA 12.6+ using cudaMemPoolCreate
for CUDA 12.6-12.x and cudaMemGetDefaultMemPool for CUDA 13.0+. Uses
unique_ptr with a deleter for automatic pool cleanup.

Updates version requirements: 12.6+ for pinned.
@bdice bdice added feature request New feature or request non-breaking Non-breaking change labels Nov 26, 2025
@bdice bdice self-assigned this Nov 26, 2025
@bdice bdice moved this to In Progress in RMM Project Board Nov 26, 2025
Copy link
Contributor

@nirandaperera nirandaperera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions on the mem pool location type.

// CUDA 12.6-12.x: Create a new pinned memory pool (needs cleanup)
cudaMemPoolProps pool_props{};
pool_props.allocType = cudaMemAllocationTypePinned;
pool_props.location.type = cudaMemLocationTypeDevice;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making a location as DEVICE. Is this correct?
In CCCL pinned mem pool, its marked as host/ host_numa
https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__memory_resource/pinned_memory_pool.h#L113-L154

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what it means by pinned device memory 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is wrong, this allocates device memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 837dd55.

}
};

TEST_F(AsyncPinnedMRTest, BasicAllocateDeallocate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like all the test cases can be parameterized/ templated for both sync and async allocation and deallocation operations

cudaMemPool_t pool_handle = mr.pool_handle();
EXPECT_NE(pool_handle, nullptr);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a device -> pinned host stream ordered copy? Maybe using a device_vector and checking if the copy results in the same

@github-project-automation github-project-automation bot moved this from In Progress to Review in RMM Project Board Nov 26, 2025
Comment on lines 65 to 69
// CUDA 13.0+: Use the default pinned memory pool (no cleanup needed)
cudaMemLocation location{.type = cudaMemLocationTypeDevice,
.id = rmm::get_current_cuda_device().value()};
RMM_CUDA_TRY(
cudaMemGetDefaultMemPool(pool_handle_.get(), &location, cudaMemAllocationTypePinned));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This provides a mempool that allocates on device.

If you want a mempool that allocates on host and is page-locked, you need to do:

// Note, if we don't specify HostNuma (we might want to...) then .id is ignored
cudaMemLocation location{.type = cudaMemLocationTypeHost, .id = 0};
// Non-_migratable_ memory allocated on host.
cudaMemGetDefaultMemPool(&handle, &location, cudaMemAllocationTypePinned);
cudaMemAccessDesc desc{};

desc.location.type = cudaMemLocationTypeDevice;
desc.location.id = rmm::get_current_cuda_device().value();
desc.flags = cudaMemAccessFlagsProtReadWrite;
cudaMemPoolSetAccess(handle, &desc, 1);

Note moreover that if you don't set the accessibility then the allocations from this resource are not device accessible.

// CUDA 12.6-12.x: Create a new pinned memory pool (needs cleanup)
cudaMemPoolProps pool_props{};
pool_props.allocType = cudaMemAllocationTypePinned;
pool_props.location.type = cudaMemLocationTypeDevice;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is wrong, this allocates device memory.

Comment on lines +55 to +65
// Pinned memory should be accessible from host
// Write from host
EXPECT_NO_THROW({
for (int i = 0; i < 100; ++i) {
ptr[i] = i;
}
});

// Verify we can read back
EXPECT_EQ(ptr[0], 0);
EXPECT_EQ(ptr[50], 50);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to test that memory is accessible from device too (via some kernel probably, or maybe DtoD memcpy?)

RMM_EXPECTS(rmm::detail::runtime_async_pinned_alloc::is_supported(),
"cuda_async_pinned_memory_resource requires CUDA 12.6 or higher runtime");

pool_handle_.reset(new cudaMemPool_t{});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As below, no need to manage this handle through a smart pointer, this class can do that.

}
};

std::unique_ptr<cudaMemPool_t, pool_deleter> pool_handle_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This this is an owning object, it seems unnecessary to also have a unique_ptr. Prefer to store a raw cudaMemPool_t handle and deal with this in the dtor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

Status: Review

Development

Successfully merging this pull request may close these issues.

5 participants