Skip to content

Atomics: configurable scope (for multi-device unified memory) #2619

@maleadt

Description

@maleadt

We should investigate whether our current atomics are functional when used on unified memory that's being used from different devices (they probably aren't). In CUDA C, this requires use of _system sufficed atomic functions, e.g., atomicAdd_system, which changes the synchronization scope. Quoting from https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity:

Atomic APIs with _system suffix (example: atomicAdd_system) are atomic at scope cuda::thread_scope_system if they meet particular conditions.

Atomic APIs without a suffix (example: atomicAdd) are atomic at scope cuda::thread_scope_device.

Atomic APIs with _block suffix (example: atomicAdd_block) are atomic at scope cuda::thread_scope_block.

Note that system scope atomics have additional requirements. Quoting https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity:

An atomic operation is atomic at the scope it specifies if:

  • it specifies a scope other than thread_scope_system, or
  • the scope is thread_scope_system and:
  • it affects an object in system allocated memory and pageableMemoryAccess is 1 [0], or
  • it affects an object in managed memory and concurrentManagedAccess is 1, or
  • it affects an object in mapped memory and hostNativeAtomicSupported is 1, or
  • it is a load or store that affects a naturally-aligned object of sizes 1, 2, 4, 8, or 16 bytes on mapped memory [1], or
  • it affects an object in GPU memory, only GPU threads access it, and
    • p2pNativeAtomicSupported between each accessing GPU and the GPU where the object resides is 1, or
    • only GPU threads from a single GPU concurrently access it.

[0] If PageableMemoryAccessUsesHostPagetables is 0 then atomic operations to memory mapped file or hugetlbfs allocations are not atomic.
[1] If hostNativeAtomicSupported is 0, atomic load or store operations at system scope that affect a naturally-aligned 16-byte wide object in unified memory or mapped memory require system support. NVIDIA is not aware of any system that lacks this support and there is no CUDA API query available to detect such systems.

So lots of gotcha's, but still, we should probably provide a way to alter the scope of an atomic operation. This requires:

  • figuring out exactly what additional configurability is needed
  • inspecting the PTX code generated by nvcc
  • identifying whether LLVM supports these through native atomics, NVVM intrinsics, or neither (in which case we'll need to use inline PTX assembly)

I won't have the time to look at this anytime soon, so if anybody wants to help out, gathering all that information and reporting here would be a good first step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cuda kernelsStuff about writing CUDA kernels.help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions