Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#2998
Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#2998carlobertolli wants to merge 1 commit intoROCm:developfrom
Conversation
|
Jenkins build for 2656a7b519637087580b41e95f38aae408b14727 commit finished as FAILURE |
|
Adding more info as I keep studying this problem. LLVM actually expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment: in the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array.
|
This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor. On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment. In the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array. There are two observations on why this works: 1. ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen (asan instrumentation runs before code generation). 2. We do not see the gpu memory error because we are effectively reading the red zone and rewriting the same 2 bytes into it. So there's still an issue here after atomic legalization, but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation to a larger size or that we are touching - without modifying - some other application data. As long as that data is not read-only, we won't see the issue.
2656a7b to
967df8d
Compare
|
I updated the PR description with the new findings. It does not seem to show in the web interface but it shows correctly when using git log. |
|
Jenkins build for 967df8da8ee60383c2695f79187d6ec9e725c32d commit finished as FAILURE |
|
This patch has a bug: it manipulates the read value as if it were 4 bytes, when in fact we read 2 bytes. |
This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor.