Skip to content

Allow configuring allocation flags for PinnedHostSlice #579

@jordan-wu-97

Description

@jordan-wu-97

Problem

CudaContext::alloc_pinned() currently creates PinnedHostSlice<T> allocations with
CU_MEMHOSTALLOC_WRITECOMBINED unconditionally:

cudarc/src/driver/safe/core.rs

Lines 1405 to 1428 in 3e5d38b

impl CudaContext {
/// Allocates page locked host memory with [sys::CU_MEMHOSTALLOC_WRITECOMBINED] flags.
///
/// See [cuda docs](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9)
///
/// # Safety
/// 1. This is unsafe because the memory is unset after this call.
pub unsafe fn alloc_pinned<T: DeviceRepr>(
self: &Arc<Self>,
len: usize,
) -> Result<PinnedHostSlice<T>, DriverError> {
self.bind_to_thread()?;
let ptr = result::malloc_host(
len * std::mem::size_of::<T>(),
sys::CU_MEMHOSTALLOC_WRITECOMBINED,
)?;
let ptr = ptr as *mut T;
assert!(!ptr.is_null());
assert!(len * std::mem::size_of::<T>() < isize::MAX as usize);
assert!(ptr.is_aligned());
let event = self.new_event(Some(sys::CUevent_flags::CU_EVENT_BLOCKING_SYNC))?;
Ok(PinnedHostSlice { ptr, len, event })
}
}

Write-combined pinned memory can be useful when host memory is primarily written by
the CPU and transferred to the device. However, it has poor CPU read performance.
CUDA's documentation notes that reading from write-combined host memory on the CPU
is prohibitively slow.

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9

Because pinned buffers can have different host access patterns, selecting
CU_MEMHOSTALLOC_WRITECOMBINED unconditionally makes PinnedHostSlice unsuitable
for cases where CPU reads matter.

Proposal

Provide an API that allows callers to select the flags used for pinned host
allocations while preserving existing behavior for current callers.

For example:

pub unsafe fn alloc_pinned_with_flags<T: DeviceRepr>(
    self: &Arc<Self>,
    len: usize,
    flags: u32,
) -> Result<PinnedHostSlice<T>, DriverError>;

The existing method could continue using write-combined memory for characteristic backwards compatibility:

pub unsafe fn alloc_pinned<T: DeviceRepr>(
    self: &Arc<Self>,
    len: usize,
) -> Result<PinnedHostSlice<T>, DriverError> {
    self.alloc_pinned_with_flags(len, sys::CU_MEMHOSTALLOC_WRITECOMBINED)
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions