Clarification on synchronization semantics for reductions on HIP vs CUDA backends

**Describe the bug**

I encountered a situation where the synchronization behavior of loops with reductions appears to be different between HIP and CUDA backends.

Consider the following code snippet:
```
template < typename loop_exec, typename reduce_exec, typename T, typename SizeT >
T raja_sum_reduce(T* data, SizeT N) noexcept
{
  RAJA::ReduceSum< reduce_exec, T > sum(T(0));

  RAJA::forall< loop_exec >(
    RAJA::RangeSegment(SizeT(0), N), RAJA_DEVICE(SizeT i) {
      sum += data[i];
  });

  // For HIP it seems that I have to explicitly synchronize before calling sum.get()
  T val = static_cast<T>(sum.get());
  return val;
}
```  

For HIP, I am calling this with:
```
raja_sum_reduce< RAJA::hip_exec< 256 >, RAJA::hip_reduce >(data, N);
```

And for CUDA, I am calling this with:
```
raja_sum_reduce< RAJA::cuda_exec< 256 >, RAJA::cuda_reduce >(data, N);
```

### Observed Behavior 

With the HIP backend, it seems that I have to call `RAJA::synchronize()` before calling `sum.get()`. However, with the CUDA backend everything works as expected.

I encountered this in the context of a larger application, but, I haven't been able to reproduce it with a minimal standalone RAJA test.

### Questions
1. Does the call to `sum.get()` guarantee synchronization or should the application explicitly synchronize even though the `RAJA::forall` execution policy is not asynchronous (`async`)?

2. Is the behavior of reductions identical between the CUDA and HIP backends?

3. I noticed that there are some additional execution policies for loops that have reductions, `RAJA::cuda_exec_with_reduce< BLOCK_SIZE >` and `RAJA::hip_exec_with_reduce< BLOCK_SIZE >` respectively. 
     - Is the current guidance to use these policies with `RAJA::forall`  instead? 
     - How are these policies different from the normal `RAJA::cuda_exec< BLOCK_SIZE >` and `RAJA::hip_exec< BLOCK_SIZE >` execution policies?

Thank you very much for all your guidance and help.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on synchronization semantics for reductions on HIP vs CUDA backends #2022

Observed Behavior

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on synchronization semantics for reductions on HIP vs CUDA backends #2022

Description

Observed Behavior

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions