Skip to content

feat(driver): allow selecting pinned host allocation flags#580

Open
jordan-wu-97 wants to merge 4 commits into
chelsea0x3b:mainfrom
jordan-wu-97:jordan/pinned-host-alloc-flag
Open

feat(driver): allow selecting pinned host allocation flags#580
jordan-wu-97 wants to merge 4 commits into
chelsea0x3b:mainfrom
jordan-wu-97:jordan/pinned-host-alloc-flag

Conversation

@jordan-wu-97

Copy link
Copy Markdown

Summary

  • Add CudaContext::alloc_pinned_with_flags() so callers can select CUDA pinned host allocation flags.
  • Preserve the existing behavior of alloc_pinned() by delegating with CU_MEMHOSTALLOC_WRITECOMBINED.
  • Add a GPU-backed regression test that compares CPU read performance for default and write-combined pinned allocations.

Addresses #579.

Motivation

CU_MEMHOSTALLOC_WRITECOMBINED is useful for host-written buffers, but CPU reads from write-combined memory can be much slower than reads from default pinned host memory. Callers should be able to select flags according to their host access pattern.

Testing

  • cargo fmt --check
  • cargo clippy --no-default-features --features cuda-13020,no-std,cudnn,cublas,cublaslt,nvrtc,driver,curand,nccl,dynamic-loading,cufile,cupti,nvtx,cufft --all-targets -- -D warnings
  • NVIDIA B300 pod, debug test profile: cargo test --lib --no-default-features --features cuda-12080,std,driver,dynamic-loading test_default_pinned_host_reads_are_faster_than_write_combined -- --nocapture
    • default pinned host reads: 18.811 ms
    • write-combined host reads: 1.287 s
  • NVIDIA B300 pod, release test profile: cargo test --release --lib --no-default-features --features cuda-12080,std,driver,dynamic-loading test_default_pinned_host_reads_are_faster_than_write_combined -- --nocapture
    • default pinned host reads: 1.483 ms
    • write-combined host reads: 666.643 ms

@jordan-wu-97 jordan-wu-97 marked this pull request as ready for review May 27, 2026 04:24
@jordan-wu-97 jordan-wu-97 requested a review from chelsea0x3b as a code owner May 27, 2026 04:24

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces alloc_pinned_with_flags to allow allocating page-locked host memory with custom flags, and refactors alloc_pinned to use it. It also adds tests to verify the performance differences between default and write-combined host memory. A critical security vulnerability was identified where integer overflow in the byte size calculation could lead to an undersized allocation and subsequent out-of-bounds memory access.

Comment thread src/driver/safe/core.rs Outdated
Comment on lines 1434 to 1439
) -> Result<PinnedHostSlice<T>, DriverError> {
self.bind_to_thread()?;
let ptr = result::malloc_host(
len * std::mem::size_of::<T>(),
sys::CU_MEMHOSTALLOC_WRITECOMBINED,
)?;
let ptr = result::malloc_host(len * std::mem::size_of::<T>(), flags)?;
let ptr = ptr as *mut T;
assert!(!ptr.is_null());
assert!(len * std::mem::size_of::<T>() < isize::MAX as usize);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The multiplication len * std::mem::size_of::<T>() can overflow if len is extremely large. In release mode, this wraps around to a small value, leading to an undersized allocation via malloc_host. Since the returned PinnedHostSlice still reports the original large len, subsequent writes or reads to the slice will result in out-of-bounds memory access (heap buffer overflow), which is a critical security vulnerability.

Using checked_mul and validating that the total size is less than isize::MAX before allocation prevents this vulnerability.

    ) -> Result<PinnedHostSlice<T>, DriverError> {
        self.bind_to_thread()?;
        let num_bytes = len
            .checked_mul(std::mem::size_of::<T>())
            .filter(|&bytes| bytes < isize::MAX as usize)
            .ok_or(DriverError(sys::CUresult::CUDA_ERROR_INVALID_VALUE))?;
        let ptr = result::malloc_host(num_bytes, flags)?;
        let ptr = ptr as *mut T;
        assert!(!ptr.is_null());

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 671d3d4. The byte count is now computed with checked_mul before calling malloc_host, then the existing isize::MAX slice-size invariant is asserted on that checked value. I kept this as panic-based validation rather than returning DriverError(CUDA_ERROR_INVALID_VALUE), since this is a Rust slice validity invariant and preserves the existing behavior. Added test_alloc_pinned_panics_on_size_overflow; it passes in the CUDA test pod.

Expose alloc_pinned_with_flags so callers can choose default pinned memory when CPU reads matter while retaining the existing write-combined default.
Measure default and write-combined pinned allocations on the CPU and guard against choosing read-hostile flags for read-heavy buffers.
@jordan-wu-97 jordan-wu-97 force-pushed the jordan/pinned-host-alloc-flag branch from ae2594d to 2afd12c Compare May 27, 2026 04:27
@jordan-wu-97 jordan-wu-97 changed the title feat: allow selecting pinned host allocation flags feat(driver): allow selecting pinned host allocation flags May 27, 2026
Compute pinned allocation bytes without wrapping before calling CUDA while preserving the existing panic-based slice-size invariant. Add a regression test for overflowing typed lengths.
Keep the configurable allocation API neutral while warning users of the CPU-read cost on the convenience method that selects write-combined memory.
Comment thread src/driver/safe/core.rs
Comment on lines +1410 to +1411
/// [CudaContext::alloc_pinned_with_flags()] with `0` for default page
/// locked host memory if CPU reads matter.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see 0 as an option for flags, where are you seeing this from the nvidia docs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants