Atomics: configurable scope (for multi-device unified memory)

We should investigate whether our current atomics are functional when used on unified memory that's being used from different devices (they probably aren't). In CUDA C, this requires use of `_system` sufficed atomic functions, e.g., `atomicAdd_system`, which changes the synchronization scope. Quoting from https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity:

> Atomic APIs with _system suffix (example: atomicAdd_system) are atomic at scope cuda::thread_scope_system if they meet particular [conditions](https://nvidia.github.io/libcudacxx/extended_api/memory_model.html#atomicity).
>
> Atomic APIs without a suffix (example: atomicAdd) are atomic at scope cuda::thread_scope_device.
>
> Atomic APIs with _block suffix (example: atomicAdd_block) are atomic at scope cuda::thread_scope_block.

Note that system scope atomics have additional requirements. Quoting https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity:

>An atomic operation is atomic at the scope it specifies if:
>
>- it specifies a scope other than thread_scope_system, or
>- the scope is thread_scope_system and:
>  - it affects an object in system allocated memory and pageableMemoryAccess is 1 [0], or
>  - it affects an object in managed memory and concurrentManagedAccess is 1, or
>  - it affects an object in mapped memory and hostNativeAtomicSupported is 1, or
>  - it is a load or store that affects a naturally-aligned object of sizes 1, 2, 4, 8, or 16 bytes on mapped memory [1], or
>  - it affects an object in GPU memory, only GPU threads access it, and
>    - p2pNativeAtomicSupported between each accessing GPU and the GPU where the object resides is 1, or
>    - only GPU threads from a single GPU concurrently access it.
>
> [0] If PageableMemoryAccessUsesHostPagetables is 0 then atomic operations to memory mapped file or hugetlbfs allocations are not atomic.
> [1] If hostNativeAtomicSupported is 0, atomic load or store operations at system scope that affect a naturally-aligned 16-byte wide object in unified memory or mapped memory require system support. NVIDIA is not aware of any system that lacks this support and there is no CUDA API query available to detect such systems.

So lots of gotcha's, but still, we should probably provide a way to alter the scope of an atomic operation. This requires:
- figuring out exactly what additional configurability is needed
- inspecting the PTX code generated by `nvcc`
- identifying whether LLVM supports these through native atomics, NVVM intrinsics, or neither (in which case we'll need to use inline PTX assembly)

I won't have the time to look at this anytime soon, so if anybody wants to help out, gathering all that information and reporting here would be a good first step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Atomics: configurable scope (for multi-device unified memory) #2619

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Atomics: configurable scope (for multi-device unified memory) #2619

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions