@@ -311,6 +311,127 @@ For temporary allocator changes, use the :class:`ScopedAllocator` context manage
311311 a = wp.zeros(1000 , dtype = wp.float32, device = " cuda:0" )
312312 # Original allocator is restored here
313313
314+ .. _managed_memory_allocation_options :
315+
316+ Managed Memory Allocator
317+ ~~~~~~~~~~~~~~~~~~~~~~~~
318+
319+ Managed memory is CUDA-managed storage that can be addressed from CPU and GPU
320+ code. CUDA Unified Memory manages page placement and migration, so pages may move
321+ between CPU and GPU memory as different processors touch them. Unlike pinned CPU
322+ memory, which remains host memory that a GPU may access through a host mapping,
323+ managed memory gives Warp arrays a different tradeoff from the other allocation
324+ options:
325+
326+ .. list-table ::
327+ :header-rows: 1
328+ :widths: 18 29 27 26
329+
330+ * - Allocation option
331+ - Residency and migration
332+ - CPU/GPU access
333+ - Typical use
334+ * - Default CUDA
335+ - Device memory with no automatic CPU/GPU migration.
336+ - CUDA kernels access it directly; CPU code uses explicit copies.
337+ - General GPU arrays when CPU access is staged explicitly.
338+ * - CUDA mempool
339+ - Device memory from CUDA's stream-ordered pool, with no automatic CPU/GPU
340+ migration.
341+ - Same CPU/GPU access rules as default CUDA memory, with separate
342+ memory-pool access controls for peer GPUs.
343+ - Faster repeated CUDA allocations and graph-captured allocation when
344+ supported.
345+ * - Pinned CPU
346+ - Host memory that does not migrate into device memory as an allocation.
347+ - CPU code accesses it directly; CUDA devices with unified virtual
348+ addressing can access it through a host mapping.
349+ - Asynchronous CPU/GPU copies or zero-copy access to small host-resident
350+ data.
351+ * - CUDA managed
352+ - CUDA Unified Memory whose pages may migrate between CPU and GPU memory.
353+ - CPU and GPU access follow CUDA managed-memory support and synchronization
354+ rules.
355+ - Sharing data across CPU/GPU code when migration is preferable to manual
356+ copies.
357+
358+ :class: `ManagedAllocator ` creates CUDA managed-memory arrays through Warp's
359+ allocator interface. Managed arrays keep their CUDA device metadata, but
360+ ``wp.can_access() `` and checked launch validation use CUDA managed-memory access
361+ rules for them instead of peer-access or memory-pool-access rules.
362+
363+ One major reason to choose this allocator is CPU/GPU shared work: on systems
364+ where CUDA reports compatible managed-memory access, CPU kernels can directly
365+ read and write managed CUDA arrays instead of maintaining a separate CPU copy.
366+ Standard Warp CUDA arrays remain non-managed and still require explicit copies
367+ before CPU code accesses them.
368+
369+ The allocator object is not bound to one CUDA device and can be constructed
370+ before choosing a CUDA device. Warp invokes it under the target device's CUDA
371+ context, which must support CUDA managed memory, and records that context as
372+ the owner for each pointer:
373+
374+ .. code :: python
375+
376+ managed = wp.ManagedAllocator()
377+ device = wp.get_device(" cuda:0" )
378+
379+ with wp.ScopedAllocator(device, managed):
380+ a = wp.zeros(1000 , dtype = wp.float32, device = device)
381+
382+ Constructing a :class: `ManagedAllocator ` does not promise that pages initially
383+ reside in any device's physical memory, and it does not bypass the device's
384+ managed-memory capability check. The CUDA device used for each allocation
385+ identifies the owner context and array device metadata; CUDA Unified Memory
386+ manages physical placement and migration.
387+
388+ Use :attr: `array.allocation_kind <warp.array.allocation_kind> ` to inspect Warp's
389+ verified allocation provenance:
390+
391+ .. code :: python
392+
393+ if a.allocation_kind is wp.AllocationKind.CUDA_MANAGED :
394+ ...
395+
396+ The allocation kind describes how Warp believes the storage was allocated. It
397+ does not describe the current physical residency of CUDA managed memory, and
398+ views report the allocation kind of their owner array.
399+
400+ To use managed memory as a persistent allocator for all CUDA devices, install one
401+ allocator instance with :func: `set_cuda_allocator `:
402+
403+ .. code :: python
404+
405+ managed = wp.ManagedAllocator()
406+ wp.set_cuda_allocator(managed)
407+
408+ If only some CUDA devices should use managed memory, install the same allocator
409+ with :func: `set_device_allocator ` on those devices. A single allocator instance
410+ can serve multiple CUDA devices, but allocation fails clearly on any target
411+ device that does not report CUDA managed-memory support.
412+
413+ Direct calls to ``ManagedAllocator.allocate() `` require an active CUDA context.
414+ Array factory functions such as :func: `zeros ` and :func: `empty ` pass the target
415+ device context automatically and perform the same managed-memory support check.
416+
417+ Managed allocations currently have a CUDA graph-capture limitation in Warp:
418+ :class: `ManagedAllocator ` does not allocate a new array while CUDA graph capture
419+ is active. If you need managed arrays with CUDA graphs, allocate them before
420+ capture begins and reuse the existing arrays inside the captured work. This is
421+ an implementation limitation, not a restriction on using pre-existing managed
422+ arrays in captured work.
423+
424+ CPU access to managed arrays is hardware-dependent. Use :func: `can_access ` to
425+ check a specific managed array before CPU code reads or writes it directly:
426+
427+ .. code :: python
428+
429+ if wp.can_access(" cpu" , a):
430+ wp.launch(cpu_kernel, dim = a.size, inputs = [a], device = " cpu" )
431+ else :
432+ a_cpu = a.to(" cpu" )
433+ wp.launch(cpu_kernel, dim = a_cpu.size, inputs = [a_cpu], device = " cpu" )
434+
314435 Writing a Custom Allocator
315436~~~~~~~~~~~~~~~~~~~~~~~~~~
316437
0 commit comments