Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations. by carlobertolli · Pull Request #2998 · ROCm/pytorch

carlobertolli · 2026-02-24T03:13:38Z

This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor.

rocm-repo-management-api · 2026-02-24T03:16:43Z

Jenkins build for 2656a7b519637087580b41e95f38aae408b14727 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

carlobertolli · 2026-02-24T18:45:56Z

Adding more info as I keep studying this problem. LLVM actually expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment: in the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array.
There are two observations on why this works:

ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen (asan instrumentation runs before code generation).
We do not see the gpu memory error because we are effectively reading the red zone and rewriting the same 2 bytes into it.
So there's still an issue here after atomic legalization, but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation to a larger size or that we are touching - without modifying - some other application data. As long as that data is not read-only, we won't see the issue.

This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor. On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment. In the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array. There are two observations on why this works: 1. ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen (asan instrumentation runs before code generation). 2. We do not see the gpu memory error because we are effectively reading the red zone and rewriting the same 2 bytes into it. So there's still an issue here after atomic legalization, but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation to a larger size or that we are touching - without modifying - some other application data. As long as that data is not read-only, we won't see the issue.

carlobertolli · 2026-02-24T22:16:51Z

I updated the PR description with the new findings. It does not seem to show in the web interface but it shows correctly when using git log.

rocm-repo-management-api · 2026-02-24T22:17:38Z

Jenkins build for 967df8da8ee60383c2695f79187d6ec9e725c32d commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

carlobertolli · 2026-02-26T17:26:22Z

This patch has a bug: it manipulates the read value as if it were 4 bytes, when in fact we read 2 bytes.
Fixing it currently shows ~10% performance regressions on index_add, which I am investigating.
In any case, if we are to do anything in AtomicFPOps, it will go to trunk directly.

carlobertolli requested review from jeffdaily and pruthvistony February 24, 2026 03:13

jeffdaily approved these changes Feb 24, 2026

View reviewed changes

carlobertolli force-pushed the fix_16bit_atomics.rocm branch from 2656a7b to 967df8d Compare February 24, 2026 22:12

carlobertolli closed this Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#2998

Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#2998
carlobertolli wants to merge 1 commit intoROCm:developfrom
carlobertolli:fix_16bit_atomics.rocm

carlobertolli commented Feb 24, 2026

Uh oh!

rocm-repo-management-api Bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

carlobertolli commented Feb 24, 2026

Uh oh!

carlobertolli commented Feb 24, 2026

Uh oh!

rocm-repo-management-api Bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

carlobertolli commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

carlobertolli commented Feb 24, 2026

Uh oh!

rocm-repo-management-api Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlobertolli commented Feb 24, 2026

Uh oh!

carlobertolli commented Feb 24, 2026

Uh oh!

rocm-repo-management-api Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlobertolli commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rocm-repo-management-api Bot commented Feb 24, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Feb 24, 2026 •

edited

Loading