Skip to content

Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#2998

Closed
carlobertolli wants to merge 1 commit intoROCm:developfrom
carlobertolli:fix_16bit_atomics.rocm
Closed

Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#2998
carlobertolli wants to merge 1 commit intoROCm:developfrom
carlobertolli:fix_16bit_atomics.rocm

Conversation

@carlobertolli
Copy link
Copy Markdown

This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor.

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Feb 24, 2026

Jenkins build for 2656a7b519637087580b41e95f38aae408b14727 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@carlobertolli
Copy link
Copy Markdown
Author

Adding more info as I keep studying this problem. LLVM actually expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment: in the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array.
There are two observations on why this works:

  • ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen (asan instrumentation runs before code generation).
  • We do not see the gpu memory error because we are effectively reading the red zone and rewriting the same 2 bytes into it.
    So there's still an issue here after atomic legalization, but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation to a larger size or that we are touching - without modifying - some other application data. As long as that data is not read-only, we won't see the issue.

This patch was triggered but a failure observed in ROCm GPU asan support.
It happens when we attempt an atomic operation on the last element
of a tensor and the element address has the correct alignment.
Normally, we would use a 4-byte read + atomic cas loop for 4-bytes.
However, in this case, we cannot backtrack 2 bytes before the last element
because that would mean a mis-aligned atomic, resulting in a gpu memory error.
The solution is to read 2 bytes (change unsigned int to short) and
then use atomicCAS loop over 2 byte data type of the last element in the tensor.

On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is
the only legal hw instruction available. Code generation attempts to align down the
input address to a 4-byte boundary alignment. In the pytorch example, the address of
element 14 (last one) is already aligned at 4-bytes so the result is the same and
the atomic cmpswp in the generated code will actually read and write 2 bytes after
the end of the array.

There are two observations on why this works:

1. ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan
happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen
(asan instrumentation runs before code generation).
2. We do not see the gpu memory error because we are effectively reading the red zone and rewriting
the same 2 bytes into it. So there's still an issue here after atomic legalization,
but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation
to a larger size or that we are touching - without modifying - some other application data.
As long as that data is not read-only, we won't see the issue.
@carlobertolli
Copy link
Copy Markdown
Author

I updated the PR description with the new findings. It does not seem to show in the web interface but it shows correctly when using git log.

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Feb 24, 2026

Jenkins build for 967df8da8ee60383c2695f79187d6ec9e725c32d commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@carlobertolli
Copy link
Copy Markdown
Author

This patch has a bug: it manipulates the read value as if it were 4 bytes, when in fact we read 2 bytes.
Fixing it currently shows ~10% performance regressions on index_add, which I am investigating.
In any case, if we are to do anything in AtomicFPOps, it will go to trunk directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants