Skip to content

Conversation

@ikbuibui
Copy link
Contributor

Adds memory ordering tags.
Defines a trait to get the default memory orders for fences for each backend.
Adds ability for the user to optionally specify a memory ordering for mem fences.

@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch from d10deae to 7366b27 Compare November 27, 2025 10:29
@ikbuibui
Copy link
Contributor Author

ikbuibui commented Nov 27, 2025

I dont define a consume memory ordering at all, since from what I know it is not well defined (broken) and deprecated in C++26 anyways and everyone just implements it as acquire to be safe and correct.
https://en.cppreference.com/w/cpp/atomic/memory_order.html#Release-Consume_ordering

@ikbuibui
Copy link
Contributor Author

The implementation for HIP comes from here

@ikbuibui ikbuibui marked this pull request as draft November 27, 2025 10:54
@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch 5 times, most recently from a9d0f14 to 42b4208 Compare November 27, 2025 13:45
@ikbuibui
Copy link
Contributor Author

ikbuibui commented Nov 27, 2025

CUDA memory orders are introduced with this PR but these are available only from CUDA 12.8, so #2574 would be nice to test this

@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch 10 times, most recently from ff7bdd1 to 3610bdc Compare December 2, 2025 09:34
@ikbuibui ikbuibui marked this pull request as ready for review December 2, 2025 14:33
@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch from 3610bdc to 4557fb1 Compare December 4, 2025 12:15
@ikbuibui ikbuibui marked this pull request as draft December 4, 2025 12:20
@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch 3 times, most recently from 913aa75 to 10d4ea4 Compare December 4, 2025 14:41
@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch 3 times, most recently from eae0eab to 613aca5 Compare December 11, 2025 13:17
@ikbuibui
Copy link
Contributor Author

ikbuibui commented Dec 11, 2025

Inspected PTX and confirmed that this works for CUDA>=12.8.
With alpaka::mem_fence(acc, alpaka::mem_order::acq_rel, alpaka::memory_scope::Block{});. we get a fence.acq_rel.cta; instead of a fence.sc.cta; we have otherwise

@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch 5 times, most recently from acfdd08 to bf20f0d Compare December 12, 2025 16:55
@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch 2 times, most recently from fc4bc25 to e731eda Compare December 12, 2025 17:27
@fwyzard fwyzard added this to the 2.2.0 milestone Dec 15, 2025
@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch 6 times, most recently from 31c4744 to b32b0f5 Compare December 19, 2025 12:29
@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch from b32b0f5 to f519df8 Compare December 19, 2025 13:03
@ikbuibui ikbuibui mentioned this pull request Dec 19, 2025
2 tasks
Comment on lines +14 to +15
* The user requested memory order may be converted to a stronger memory order guarantee if the backend does
* not support the requested memory ordering
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* The user requested memory order may be converted to a stronger memory order guarantee if the backend does
* not support the requested memory ordering

Though in practice this is never the case ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, I see that it is done at the implementation level, not in the backend-specific tags ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is done at the implementation level as it depends on details of the backend.


#include <concepts>

#ifdef ALPAKA_ACC_GPU_HIP_ENABLED
Copy link
Contributor

@fwyzard fwyzard Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the HIP documentation, I think this may require ROCm 6.4.0 (based on clang 19).
ROCm 6.4.0 mentions __builtin_amdgcn_fence in the docs, while ROCm 6.3.3 does not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However looking at the HIP code, it was available much earlier, looks like it was simply undocumented.

{
};

struct AcqRel : MemoryOrderTag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, what if we spelled the full AcquireRelease and SequentialConsistency ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to it with no particular preference. I followed the STL naming since it was concise. If you prefer the full names let me know and I'll update them.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 20, 2025

I'm wondering if using an enum for the memory spaces wouldn't be simpler than using tags and a concept ?

@fwyzard
Copy link
Contributor

fwyzard commented Dec 20, 2025

Now I'm wondering what happens if one tries to use an acquire-release or sequentially-consistent thread-level fence to order with a block-level fence ?

I don't know why one would do it, but in principle they could.

However this could silently fail in some backends where either thread-level or block-level fences are "skipped" ?

template<>
struct MemFence<MemFenceUniformCudaHipBuiltIn, memory_scope::Block>
template<alpaka::MemoryOrder TMemOrder>
[[maybe_unused]] static constexpr __device__ void cuda_ptx_fence_device([[maybe_unused]] TMemOrder order)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use cuda::atomic::atomic_thread_fence(...) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have been ideal, but I was trying to avoid requiring libcu++

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't insist on using it, but if it's available, does the job, and does not create licensing or installation problems (all details to be discussed) maybe we should consider making use of it ?

@fwyzard
Copy link
Contributor

fwyzard commented Dec 20, 2025

@ikbuibui overall this looks very good !

Two questions:

  • what do you think is missing (since it is in draft mode) ?
  • I didn't check myself; do we have a reasonable set of tests for the fence operations ? if not, is it something you would be interested / willing to implement ?

@ikbuibui
Copy link
Contributor Author

I'm wondering if using an enum for the memory spaces wouldn't be simpler than using tags and a concept ?

Yes, that is also absolutely doable, and maybe simpler. I was comfortable doing it with tags so thats what I used :)

* what do you think is missing (since it is in draft mode) ? 
* I didn't check myself; do we have a reasonable set of tests for the fence operations ?   if not, is it something you would be interested / willing to implement ?

A combined answer to both was that I still want to look into the tests and haven't done so. In any case I'm not sure if there will be any reasonable way to test if the memory orderings are working as intended other than inspecting the generated code. If I don't come up with anything, I'll mark the PR as ready in the first week of next year.

Now I'm wondering what happens if one tries to use an acquire-release or sequentially-consistent thread-level fence to order with a block-level fence ?

However this could silently fail in some backends where either thread-level or block-level fences are "skipped" ?

This is a good question. I'll think about this a bit, but in any case this problem existed before this PR as well.

@ikbuibui ikbuibui force-pushed the memory_ordered_fence branch from 5e56706 to 8ca1266 Compare December 22, 2025 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants