feat(driver): allow selecting pinned host allocation flags#580
feat(driver): allow selecting pinned host allocation flags#580jordan-wu-97 wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces alloc_pinned_with_flags to allow allocating page-locked host memory with custom flags, and refactors alloc_pinned to use it. It also adds tests to verify the performance differences between default and write-combined host memory. A critical security vulnerability was identified where integer overflow in the byte size calculation could lead to an undersized allocation and subsequent out-of-bounds memory access.
| ) -> Result<PinnedHostSlice<T>, DriverError> { | ||
| self.bind_to_thread()?; | ||
| let ptr = result::malloc_host( | ||
| len * std::mem::size_of::<T>(), | ||
| sys::CU_MEMHOSTALLOC_WRITECOMBINED, | ||
| )?; | ||
| let ptr = result::malloc_host(len * std::mem::size_of::<T>(), flags)?; | ||
| let ptr = ptr as *mut T; | ||
| assert!(!ptr.is_null()); | ||
| assert!(len * std::mem::size_of::<T>() < isize::MAX as usize); |
There was a problem hiding this comment.
The multiplication len * std::mem::size_of::<T>() can overflow if len is extremely large. In release mode, this wraps around to a small value, leading to an undersized allocation via malloc_host. Since the returned PinnedHostSlice still reports the original large len, subsequent writes or reads to the slice will result in out-of-bounds memory access (heap buffer overflow), which is a critical security vulnerability.
Using checked_mul and validating that the total size is less than isize::MAX before allocation prevents this vulnerability.
) -> Result<PinnedHostSlice<T>, DriverError> {
self.bind_to_thread()?;
let num_bytes = len
.checked_mul(std::mem::size_of::<T>())
.filter(|&bytes| bytes < isize::MAX as usize)
.ok_or(DriverError(sys::CUresult::CUDA_ERROR_INVALID_VALUE))?;
let ptr = result::malloc_host(num_bytes, flags)?;
let ptr = ptr as *mut T;
assert!(!ptr.is_null());There was a problem hiding this comment.
Fixed in 671d3d4. The byte count is now computed with checked_mul before calling malloc_host, then the existing isize::MAX slice-size invariant is asserted on that checked value. I kept this as panic-based validation rather than returning DriverError(CUDA_ERROR_INVALID_VALUE), since this is a Rust slice validity invariant and preserves the existing behavior. Added test_alloc_pinned_panics_on_size_overflow; it passes in the CUDA test pod.
Expose alloc_pinned_with_flags so callers can choose default pinned memory when CPU reads matter while retaining the existing write-combined default.
Measure default and write-combined pinned allocations on the CPU and guard against choosing read-hostile flags for read-heavy buffers.
ae2594d to
2afd12c
Compare
Compute pinned allocation bytes without wrapping before calling CUDA while preserving the existing panic-based slice-size invariant. Add a regression test for overflowing typed lengths.
Keep the configurable allocation API neutral while warning users of the CPU-read cost on the convenience method that selects write-combined memory.
| /// [CudaContext::alloc_pinned_with_flags()] with `0` for default page | ||
| /// locked host memory if CPU reads matter. |
There was a problem hiding this comment.
I don't see 0 as an option for flags, where are you seeing this from the nvidia docs?
Summary
CudaContext::alloc_pinned_with_flags()so callers can select CUDA pinned host allocation flags.alloc_pinned()by delegating withCU_MEMHOSTALLOC_WRITECOMBINED.Addresses #579.
Motivation
CU_MEMHOSTALLOC_WRITECOMBINEDis useful for host-written buffers, but CPU reads from write-combined memory can be much slower than reads from default pinned host memory. Callers should be able to select flags according to their host access pattern.Testing
cargo fmt --checkcargo clippy --no-default-features --features cuda-13020,no-std,cudnn,cublas,cublaslt,nvrtc,driver,curand,nccl,dynamic-loading,cufile,cupti,nvtx,cufft --all-targets -- -D warningscargo test --lib --no-default-features --features cuda-12080,std,driver,dynamic-loading test_default_pinned_host_reads_are_faster_than_write_combined -- --nocapture18.811 ms1.287 scargo test --release --lib --no-default-features --features cuda-12080,std,driver,dynamic-loading test_default_pinned_host_reads_are_faster_than_write_combined -- --nocapture1.483 ms666.643 ms