Problem
CudaContext::alloc_pinned() currently creates PinnedHostSlice<T> allocations with
CU_MEMHOSTALLOC_WRITECOMBINED unconditionally:
|
impl CudaContext { |
|
/// Allocates page locked host memory with [sys::CU_MEMHOSTALLOC_WRITECOMBINED] flags. |
|
/// |
|
/// See [cuda docs](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9) |
|
/// |
|
/// # Safety |
|
/// 1. This is unsafe because the memory is unset after this call. |
|
pub unsafe fn alloc_pinned<T: DeviceRepr>( |
|
self: &Arc<Self>, |
|
len: usize, |
|
) -> Result<PinnedHostSlice<T>, DriverError> { |
|
self.bind_to_thread()?; |
|
let ptr = result::malloc_host( |
|
len * std::mem::size_of::<T>(), |
|
sys::CU_MEMHOSTALLOC_WRITECOMBINED, |
|
)?; |
|
let ptr = ptr as *mut T; |
|
assert!(!ptr.is_null()); |
|
assert!(len * std::mem::size_of::<T>() < isize::MAX as usize); |
|
assert!(ptr.is_aligned()); |
|
let event = self.new_event(Some(sys::CUevent_flags::CU_EVENT_BLOCKING_SYNC))?; |
|
Ok(PinnedHostSlice { ptr, len, event }) |
|
} |
|
} |
Write-combined pinned memory can be useful when host memory is primarily written by
the CPU and transferred to the device. However, it has poor CPU read performance.
CUDA's documentation notes that reading from write-combined host memory on the CPU
is prohibitively slow.
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9
Because pinned buffers can have different host access patterns, selecting
CU_MEMHOSTALLOC_WRITECOMBINED unconditionally makes PinnedHostSlice unsuitable
for cases where CPU reads matter.
Proposal
Provide an API that allows callers to select the flags used for pinned host
allocations while preserving existing behavior for current callers.
For example:
pub unsafe fn alloc_pinned_with_flags<T: DeviceRepr>(
self: &Arc<Self>,
len: usize,
flags: u32,
) -> Result<PinnedHostSlice<T>, DriverError>;
The existing method could continue using write-combined memory for characteristic backwards compatibility:
pub unsafe fn alloc_pinned<T: DeviceRepr>(
self: &Arc<Self>,
len: usize,
) -> Result<PinnedHostSlice<T>, DriverError> {
self.alloc_pinned_with_flags(len, sys::CU_MEMHOSTALLOC_WRITECOMBINED)
}
Problem
CudaContext::alloc_pinned()currently createsPinnedHostSlice<T>allocations withCU_MEMHOSTALLOC_WRITECOMBINEDunconditionally:cudarc/src/driver/safe/core.rs
Lines 1405 to 1428 in 3e5d38b
Write-combined pinned memory can be useful when host memory is primarily written by
the CPU and transferred to the device. However, it has poor CPU read performance.
CUDA's documentation notes that reading from write-combined host memory on the CPU
is prohibitively slow.
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9
Because pinned buffers can have different host access patterns, selecting
CU_MEMHOSTALLOC_WRITECOMBINEDunconditionally makesPinnedHostSliceunsuitablefor cases where CPU reads matter.
Proposal
Provide an API that allows callers to select the flags used for pinned host
allocations while preserving existing behavior for current callers.
For example:
The existing method could continue using write-combined memory for characteristic backwards compatibility: