Describe the bug
I encountered a situation where the synchronization behavior of loops with reductions appears to be different between HIP and CUDA backends.
Consider the following code snippet:
template < typename loop_exec, typename reduce_exec, typename T, typename SizeT >
T raja_sum_reduce(T* data, SizeT N) noexcept
{
RAJA::ReduceSum< reduce_exec, T > sum(T(0));
RAJA::forall< loop_exec >(
RAJA::RangeSegment(SizeT(0), N), RAJA_DEVICE(SizeT i) {
sum += data[i];
});
// For HIP it seems that I have to explicitly synchronize before calling sum.get()
T val = static_cast<T>(sum.get());
return val;
}
For HIP, I am calling this with:
raja_sum_reduce< RAJA::hip_exec< 256 >, RAJA::hip_reduce >(data, N);
And for CUDA, I am calling this with:
raja_sum_reduce< RAJA::cuda_exec< 256 >, RAJA::cuda_reduce >(data, N);
Observed Behavior
With the HIP backend, it seems that I have to call RAJA::synchronize() before calling sum.get(). However, with the CUDA backend everything works as expected.
I encountered this in the context of a larger application, but, I haven't been able to reproduce it with a minimal standalone RAJA test.
Questions
-
Does the call to sum.get() guarantee synchronization or should the application explicitly synchronize even though the RAJA::forall execution policy is not asynchronous (async)?
-
Is the behavior of reductions identical between the CUDA and HIP backends?
-
I noticed that there are some additional execution policies for loops that have reductions, RAJA::cuda_exec_with_reduce< BLOCK_SIZE > and RAJA::hip_exec_with_reduce< BLOCK_SIZE > respectively.
- Is the current guidance to use these policies with
RAJA::forall instead?
- How are these policies different from the normal
RAJA::cuda_exec< BLOCK_SIZE > and RAJA::hip_exec< BLOCK_SIZE > execution policies?
Thank you very much for all your guidance and help.
Describe the bug
I encountered a situation where the synchronization behavior of loops with reductions appears to be different between HIP and CUDA backends.
Consider the following code snippet:
For HIP, I am calling this with:
And for CUDA, I am calling this with:
Observed Behavior
With the HIP backend, it seems that I have to call
RAJA::synchronize()before callingsum.get(). However, with the CUDA backend everything works as expected.I encountered this in the context of a larger application, but, I haven't been able to reproduce it with a minimal standalone RAJA test.
Questions
Does the call to
sum.get()guarantee synchronization or should the application explicitly synchronize even though theRAJA::forallexecution policy is not asynchronous (async)?Is the behavior of reductions identical between the CUDA and HIP backends?
I noticed that there are some additional execution policies for loops that have reductions,
RAJA::cuda_exec_with_reduce< BLOCK_SIZE >andRAJA::hip_exec_with_reduce< BLOCK_SIZE >respectively.RAJA::forallinstead?RAJA::cuda_exec< BLOCK_SIZE >andRAJA::hip_exec< BLOCK_SIZE >execution policies?Thank you very much for all your guidance and help.