Skip to content

Commit 21d0d1c

Browse files
committed
Add ManagedAllocator support
Warp's CUDA allocation controls did not expose an explicit way to opt into managed memory, and allocation provenance was not visible to access checks or user code. Add ManagedAllocator, AllocationKind, and array.allocation_kind so managed arrays can be created intentionally and distinguished from default, mempool, custom, and externally wrapped allocations. Managed allocations now use cudaMallocManaged outside CUDA graph capture and are rejected during capture. This keeps the initial API focused on explicit managed memory while deferring CUDA 13 managed-pool allocation support. The follow-up can address graph capture and free-ordering risks in a smaller change. Update access validation, IPC rejection, docs, and tests for the new allocation kind. Default CUDA allocation behavior is unchanged. Signed-off-by: Eric Shi <ershi@nvidia.com>
1 parent 2e8482b commit 21d0d1c

16 files changed

Lines changed: 1221 additions & 249 deletions

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@
1212
- Extend AddressSanitizer support to JIT-compiled CPU kernels: when `warp-clang` is built with `--sanitize=address`, CPU
1313
kernels are automatically instrumented and share the host's single in-process ASan runtime, so out-of-bounds accesses
1414
into a `wp.array` are reported as `heap-buffer-overflow` ([GH-1387](https://github.com/NVIDIA/warp/issues/1387)).
15+
- Add `wp.ManagedAllocator()` for explicit CUDA managed-memory arrays. CPU kernels can use managed arrays as an
16+
opt-in path to read and write CUDA data directly on systems where CUDA reports compatible managed-memory access,
17+
while ordinary Warp CUDA arrays still need explicit CPU copies. Preallocated managed arrays work in CUDA graph
18+
captures, but capture-time allocation is a current limitation ([GH-1523](https://github.com/NVIDIA/warp/issues/1523)).
1519

1620
### Removed
1721

design/hardware-coherent-memory-access.md

Lines changed: 338 additions & 76 deletions
Large diffs are not rendered by default.

design/pluggable-allocators.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ Future solutions must provide enough allocation provenance for
231231
make the same conservative decisions they make for Warp-owned allocations. At a
232232
minimum, Warp needs to distinguish the owning device and memory class for
233233
allocations that participate in cross-device launch verification, including
234-
default CUDA device memory, CUDA memory pools, managed memory, pinned host
234+
CUDA malloc memory, CUDA memory pools, managed memory, pinned host
235235
memory, and allocator-defined external memory.
236236

237237
Any future mechanism must remain backward compatible with simple custom

docs/api_reference/warp.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,7 +384,9 @@ CUDA Memory Management
384384
:nosignatures:
385385
:toctree: _generated
386386

387+
AllocationKind
387388
Allocator
389+
ManagedAllocator
388390
ScopedAllocator
389391
ScopedMempool
390392
ScopedMempoolAccess

docs/deep_dive/allocators.rst

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -311,6 +311,127 @@ For temporary allocator changes, use the :class:`ScopedAllocator` context manage
311311
a = wp.zeros(1000, dtype=wp.float32, device="cuda:0")
312312
# Original allocator is restored here
313313
314+
.. _managed_memory_allocation_options:
315+
316+
Managed Memory Allocator
317+
~~~~~~~~~~~~~~~~~~~~~~~~
318+
319+
Managed memory is CUDA-managed storage that can be addressed from CPU and GPU
320+
code. CUDA Unified Memory manages page placement and migration, so pages may move
321+
between CPU and GPU memory as different processors touch them. Unlike pinned CPU
322+
memory, which remains host memory that a GPU may access through a host mapping,
323+
managed memory gives Warp arrays a different tradeoff from the other allocation
324+
options:
325+
326+
.. list-table::
327+
:header-rows: 1
328+
:widths: 18 29 27 26
329+
330+
* - Allocation option
331+
- Residency and migration
332+
- CPU/GPU access
333+
- Typical use
334+
* - Default CUDA
335+
- Device memory with no automatic CPU/GPU migration.
336+
- CUDA kernels access it directly; CPU code uses explicit copies.
337+
- General GPU arrays when CPU access is staged explicitly.
338+
* - CUDA mempool
339+
- Device memory from CUDA's stream-ordered pool, with no automatic CPU/GPU
340+
migration.
341+
- Same CPU/GPU access rules as default CUDA memory, with separate
342+
memory-pool access controls for peer GPUs.
343+
- Faster repeated CUDA allocations and graph-captured allocation when
344+
supported.
345+
* - Pinned CPU
346+
- Host memory that does not migrate into device memory as an allocation.
347+
- CPU code accesses it directly; CUDA devices with unified virtual
348+
addressing can access it through a host mapping.
349+
- Asynchronous CPU/GPU copies or zero-copy access to small host-resident
350+
data.
351+
* - CUDA managed
352+
- CUDA Unified Memory whose pages may migrate between CPU and GPU memory.
353+
- CPU and GPU access follow CUDA managed-memory support and synchronization
354+
rules.
355+
- Sharing data across CPU/GPU code when migration is preferable to manual
356+
copies.
357+
358+
:class:`ManagedAllocator` creates CUDA managed-memory arrays through Warp's
359+
allocator interface. Managed arrays keep their CUDA device metadata, but
360+
``wp.can_access()`` and checked launch validation use CUDA managed-memory access
361+
rules for them instead of peer-access or memory-pool-access rules.
362+
363+
One major reason to choose this allocator is CPU/GPU shared work: on systems
364+
where CUDA reports compatible managed-memory access, CPU kernels can directly
365+
read and write managed CUDA arrays instead of maintaining a separate CPU copy.
366+
Standard Warp CUDA arrays remain non-managed and still require explicit copies
367+
before CPU code accesses them.
368+
369+
The allocator object is not bound to one CUDA device and can be constructed
370+
before choosing a CUDA device. Warp invokes it under the target device's CUDA
371+
context, which must support CUDA managed memory, and records that context as
372+
the owner for each pointer:
373+
374+
.. code:: python
375+
376+
managed = wp.ManagedAllocator()
377+
device = wp.get_device("cuda:0")
378+
379+
with wp.ScopedAllocator(device, managed):
380+
a = wp.zeros(1000, dtype=wp.float32, device=device)
381+
382+
Constructing a :class:`ManagedAllocator` does not promise that pages initially
383+
reside in any device's physical memory, and it does not bypass the device's
384+
managed-memory capability check. The CUDA device used for each allocation
385+
identifies the owner context and array device metadata; CUDA Unified Memory
386+
manages physical placement and migration.
387+
388+
Use :attr:`array.allocation_kind <warp.array.allocation_kind>` to inspect Warp's
389+
verified allocation provenance:
390+
391+
.. code:: python
392+
393+
if a.allocation_kind is wp.AllocationKind.CUDA_MANAGED:
394+
...
395+
396+
The allocation kind describes how Warp believes the storage was allocated. It
397+
does not describe the current physical residency of CUDA managed memory, and
398+
views report the allocation kind of their owner array.
399+
400+
To use managed memory as a persistent allocator for all CUDA devices, install one
401+
allocator instance with :func:`set_cuda_allocator`:
402+
403+
.. code:: python
404+
405+
managed = wp.ManagedAllocator()
406+
wp.set_cuda_allocator(managed)
407+
408+
If only some CUDA devices should use managed memory, install the same allocator
409+
with :func:`set_device_allocator` on those devices. A single allocator instance
410+
can serve multiple CUDA devices, but allocation fails clearly on any target
411+
device that does not report CUDA managed-memory support.
412+
413+
Direct calls to ``ManagedAllocator.allocate()`` require an active CUDA context.
414+
Array factory functions such as :func:`zeros` and :func:`empty` pass the target
415+
device context automatically and perform the same managed-memory support check.
416+
417+
Managed allocations currently have a CUDA graph-capture limitation in Warp:
418+
:class:`ManagedAllocator` does not allocate a new array while CUDA graph capture
419+
is active. If you need managed arrays with CUDA graphs, allocate them before
420+
capture begins and reuse the existing arrays inside the captured work. This is
421+
an implementation limitation, not a restriction on using pre-existing managed
422+
arrays in captured work.
423+
424+
CPU access to managed arrays is hardware-dependent. Use :func:`can_access` to
425+
check a specific managed array before CPU code reads or writes it directly:
426+
427+
.. code:: python
428+
429+
if wp.can_access("cpu", a):
430+
wp.launch(cpu_kernel, dim=a.size, inputs=[a], device="cpu")
431+
else:
432+
a_cpu = a.to("cpu")
433+
wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu")
434+
314435
Writing a Custom Allocator
315436
~~~~~~~~~~~~~~~~~~~~~~~~~~
316437

0 commit comments

Comments
 (0)