Skip to content

Latest commit

 

History

History
112 lines (89 loc) · 4.73 KB

File metadata and controls

112 lines (89 loc) · 4.73 KB

RAJA::Launch Basics

There are no exercise files to work through for this section. Instead, there is an example source file RAJA/examples/tut_launch_basic.cpp which contains complete code examples of the concepts described here.

Key RAJA features shown in the following examples are:

  • RAJA::launch method to create a run-time selectable host/device execution space.
  • RAJA::loop methods to express algorithms in terms of nested for loops.

In this example, we introduce the RAJA Launch framework and discuss hierarchical loop-based parallelism. Kernel execution details with RAJA Launch occur inside the lambda expression passed to the RAJA::launch method, which defines an execution space:

RAJA::launch<launch_policy>(RAJA::ExecPlace ,
RAJA::LaunchParams(RAJA::Teams(Nteams,Nteams),
                   RAJA::Threads(Nthreads,Nthreads)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

  /* Kernel code goes here */

});

The RAJA::launch method accepts a RAJA::LaunchPolicy template parameter that can be defined using up to two policies (host and device). For example, the following creates an execution space for a sequential and CUDA kernel dispatch:

using launch_policy = RAJA::LaunchPolicy
  <RAJA::seq_launch_t, RAJA::cuda_launch_t<false>>;

Whether a kernel executes on the host or device is determined by the first argument passed to the RAJA::launch method, which is a RAJA::ExecPlace enum value, either HOST or DEVICE. Similar to GPU thread and block programming models, RAJA Launch carries out computation in a predefined compute grid made up of threads which are then grouped into teams when executing on the device. The execution space is then enclosed by a host/device lambda which takes a RAJA::LaunchContext object, which may be used to control the flow within the kernel, for example by creating thread-team synchronization points.

Note

RAJA treats Teams(i,j,k) and Threads(i,j,k) as an (x,y,z) ordering. For users who prefer SYCL's (dim0, dim1, dim2) ordering, RAJA provides Teams::sycl_order(dim0, dim1, dim2) and Threads::sycl_order(dim0, dim1, dim2), which map to the RAJA (x,y,z) ordering. For example:

RAJA::LaunchParams(RAJA::Teams::sycl_order(g0, g1, g2),
                   RAJA::Threads::sycl_order(l0, l1, l2))

See also the example examples/launch-device-policy-aliases.cpp.

Inside the execution space, developers write a kernel using nested RAJA::loop methods. The manner in which each loop is executed is determined by a template parameter type, which indicates how the corresponding iterates are mapped to the Teams/Threads configuration defined by the RAJA::LaunchParams type passed as the second argument to the RAJA::launch method. Following the CUDA and HIP programming models, this defines an hierarchical structure in which outer loops are executed by thread-teams and inner loops are executed by threads in a team.

.. literalinclude:: ../../../../examples/tut_launch_basic.cpp
   :start-after: // _team_loops_start
   :end-before: // _team_loops_end
   :language: C++

The mapping between teams and threads to the underlying programming model depends on how the RAJA::loop template parameter types are defined. For example, we may define host and device mapping strategies as:

using teams_x = RAJA::LoopPolicy< RAJA::seq_exec,
                                  RAJA::cuda_block_x_direct >;
using thread_x = RAJA::LoopPolicy< RAJA::seq_exec,
                                   RAJA::cuda_block_x_direct >;

Here, the RAJA::LoopPolicy type holds both the host (CPU) and device (CUDA GPU) loop mapping strategies. On the host, both the team/thread strategies expand out to standard C-style loops for execution:

.. literalinclude:: ../../../../examples/tut_launch_basic.cpp
   :start-after: // _c_style_loops_start
   :end-before: // _c_style_loops_end
   :language: C++

On the device the teams_x/y policies will map loop iterations directly to CUDA (or HIP) thread blocks, while the thread_x/y policies will map loop iterations directly to threads in a CUDA (or HIP) thread block. The direct CUDA equivalent of the kernel body using the policy shown above is:

.. literalinclude:: ../../../../examples/tut_launch_basic.cpp
   :start-after: // _device_loop_start
   :end-before: // _device_loop_end
   :language: C++