Refactor call_device with C++17 Fold Expressions and Fix launch_global Declaration Order (#5480)

WangZW928 · web-flow · commit 9ed35ab0e584 · 2026-06-09T07:56:49.000-07:00
## Summary
This PR refactors the variadic device-side kernel helper `call_device`
using C++17 fold expressions and reorders its definition in
`Src/Base/AMReX_GpuLaunch.H`.

Specifically:
1. Moved the definition of `call_device` above `launch_global` to ensure
it is visible during Phase 1 of template parsing.
2. Replaced the traditional recursive template overloads of
`call_device` with a native C++17 unary right fold expression over the
comma operator (`(fs(), ...);`).

## Additional background
Currently, `launch_global` invokes `call_device(fs...)` before
`call_device` is fully defined or declared. This layout introduces
brittleness to **two-phase name lookup**.

If a user attempts to pass multiple lambdas belonging to the
global/anonymous namespace (which is common in isolated unit tests or
external mock environments), the compiler fails to resolve `call_device`
via Argument-Dependent Lookup (ADL), throwing a compilation error.

```cpp
auto f1 = [=] __device__ () { ... };
auto f2 = [=] __device__ () { ... };
auto f3 = [=] __device__ () { ... };
AMREX_LAUNCH_KERNEL_NOBOUND(1, 1, 0, 0, f1, f2, f3); // Fails to compile!
```
ERROR: all to function "call_device" that is neither visible in the
template definition nor found by argument-dependent lookup

This defect didn't surface within AMReX before because almost all
internal production codes pass a **single** lambda function to
`AMREX_LAUNCH_KERNEL`, which causes the compiler to optimize out or skip
the recursive invocation during instantiation.

The new C++17 fold expression cleanly handles both empty parameter packs
(safely evaluating to `void()`) and multiple lambdas while perfectly
preserving the original signature of `launch_global`. The fix has been
verified under CUDA 12.0+.

## Checklist

The proposed changes:
- [x] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate
diff --git a/Src/Base/AMReX_GpuLaunch.H b/Src/Base/AMReX_GpuLaunch.H
@@ -53,19 +53,20 @@ namespace amrex {
 //  Variadic lambda function wrappers for C++ CUDA/HIP Kernel calls.
 
 #if defined(AMREX_USE_CUDA) || defined(AMREX_USE_HIP)
-    template<class L, class... Lambdas>
-    AMREX_GPU_GLOBAL void launch_global (L f0, Lambdas... fs) { f0(); call_device(fs...); }
 
     /// \cond DOXYGEN_IGNORE
-    template<class L>
-    AMREX_GPU_DEVICE void call_device (L&& f0) noexcept { f0(); }
 
-    template<class L, class... Lambdas>
-    AMREX_GPU_DEVICE void call_device (L&& f0, Lambdas&&... fs) noexcept {
-        f0();
-        call_device(std::forward<Lambdas>(fs)...);
+    template <typename... Lambdas>
+    AMREX_GPU_DEVICE void call_device (Lambdas&&... fs) noexcept
+    {
+        (std::forward<Lambdas>(fs)(), ...);
     }
+
     /// \endcond
+
+    template<class L, class... Lambdas>
+    AMREX_GPU_GLOBAL void launch_global (L f0, Lambdas... fs) { f0(); call_device(fs...); }
+
 #endif
 
 // CPU variation