ProgrammingGuide/Kokkos_PG_LegacyCodeIntegration.tex at master · kokkos/ProgrammingGuide · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
\chapter{Interoperability and Legacy Codes}

One goal of Kokkos is to support incremental adoption in legacy
applications. This facilitates a step by step conversion allowing
for continuous testing of functionality (and in certain bounds) of
performance. One feature of this is full interoperability with the
underlying backend programming models. This also allows for
target specific optimizations written directly in the backend model
in order to achieve maximal performance.

After reading this chapter, you should understand the following:
\begin{itemize}
\item Restriction on interoperability with raw OpenMP and Cuda.
\item How to handle external data structures.
\item How to incrementally convert legacy data structures.
\item How to call non-Kokkos third party libraries safely.
\end{itemize}
In all code examples in this chapter, we assume that all classes in
the \lstinline!Kokkos! namespace have been imported into the working
namespace.

\section{OpenMP, Pthreads and CUDA interoperability}

Since the implementation of Kokkos is achieved with a C++ library
it provides full interoperability with the underlying backend programming
models. In particular it allows for mixing of OpenMP, CUDA and Kokkos
code in the same compilation unit. This is true for both the parallel
execution layers of Kokkos as for the data layer.

It is important to recognize that this does not lift certain restrictions.
For example it is not valid to allocate a view inside of an OpenMP
parallel region, the same way as it is not valid to allocate a View
inside a \lstinline|parallel_for| kernel. Indeed there are things which
are slightly more cumbersome when mixing the models. Assigning
one view to another inside a Cuda kernel or an OpenMP parallel
region is only possible if the destination view is unmanaged.
During dispatch of kernels with \lstinline|parallel_for|  all Views referenced
in the functor or lambda are automatically switched into unmanaged
mode. This would not happen when simply entering an OpenMP
parallel region.

\subsection{Cuda interoperabiltiy}

The most important thing to know for Cuda interoperability is that the provided macro
\lstinline|KOKKOS_INLINE_FUNCTION| evaluates to
\lstinline|__host__ __device__ inline|. This means that calling a pure
\lstinline|__device__| function (for example Cuda intrinsics or device
functions of libraries) must be protected with the \lstinline|__CUDA_ARCH__|
pragma.

\begin{lstlisting}
__device__ SomeFunction(double* x) {
  ...
}

struct Functor {
  typedef Cuda execution_space;
  View<double*,Cuda> a;
  KOKKOS_INLINE_FUNCTION
  void operator(const int& i) {
    #ifdef __CUDA_ARCH__
    int block = blockIdx.x;
    int thread = threadIdx.x;
    if (thread == 0)
      SomeFunction(&a(block));
    #endif
  }
}
\end{lstlisting}

The \lstinline|RangePolicy| start a 1D grid of 1D thread blocks so that the index
\lstinline|i| is calculated as \lstinline|blockIdx.x * blockDim.x + threadIdx.x|.
For the \lstinline|TeamPolicy| the number of teams is the grid dimension, while
the number of threads per team is mapped to the Y-dimension of the Cuda
thread-block. The optional vector length is mapped to the X-dimension.
For example \lstinline|TeamPolicy<Cuda>(100,12,16)| would start a 1D grid of size
100 with block-dimensions (16,12,1) while \lstinline|TeamPolicy<Cuda>(100,96)| would
result in a grid size of 100 with block-dimensions of (1,96,1). The restrictions on the
vector length (power of two and smaller than 32 for the Cuda execution space)
guarantee that vector loops are performed by threads which are part of a single
warp.

\subsection{OpenMP}

One restriction on OpenMP interoperability is that it is not valid to increase
the number of threads via \lstinline|omp_set_num_threads()| after initializing
Kokkos. This restriction is caused due to Kokkos allocating internal per thread
book keeping data structures. It is however valid to ask
for the thread ID inside a Kokkos parallel kernel compiled for the OpenMP
execution space. It is also valid to use OpenMP constructs such as OpenMP
atomics inside a parallel kernel or functions called by it. However it is undefined
what happens when mixing OpenMP and Kokkos atomics, since those will not
necessarily map to the the same underlying mechanism.


\section{Legacy data structures}

There are two principal mechanism to facilitate interoperability with legacy data
structures: (i) Kokkos allocates data and raw pointers are extracted to create
legacy data structures and (ii) unmanaged views can be used to view externally
allocated data. In both cases it is mandatory to fix the Layout of the Kokkos view
to the actual layout used in the legacy data structure. Note that the user is responsible
for insuring proper access capabilities. For example a pointer obtained from a view
in the \lstinline|CudaSpace| may only be accessed from Cuda kernels, and a View
constructed from memory acquired through a call to \lstinline|new| will typically only
be accessible from Execution spaces which can access the \lstinline|HostSpace|.

\section{Raw allocations through Kokkos}

A simple way to add support for multiple memory spaces to a legacy app is to
use \lstinline|kokkos_malloc|, \lstinline|kokkos_free| and \lstinline|kokkos_realloc|.
The functions are templated on the memory space and thus allow targeted
placement of data structures:

\begin{lstlisting}
// Allocate an array of 100 doubles in the default memory space
double* a = (double*) kokkos_malloc<>(100*sizeof(double));

// Allocate an array of 150 int* in the Cuda UVM memory space
// This allocation is accessible from the host
int** 2d_array = (int**) kokkos_malloc<CudaUVMSpace>
                         (150*sizeof(int*));

// Fill the pointer array with pointers to data in the Cuda Space
// Since it is not the UVM space you can access 2d_array[i][j]
// only inside a Cuda Kernel
for(int i=0;i<150;i++)
  2d_array[i] = (int*) kokkos_malloc<CudaSpace>(200*sizeof(int));
\end{lstlisting}

A common usage scenario of this capability is to allocate all memory in the CudaUVMSpace
when compiling for GPUs. This allows all allocations to be accessible from legacy code sections
as well as from parallel kernels written with Kokkos.

\subsection{External memory management}

When memory is managed externally, for example because Kokkos is used
in a library which is given pointers to data allocations as input, it can be necessary
or convenient to wrap the data into Kokkos views. If the library anyway receives the
data to create a copy it is straight forward to allocate the internal data structure as a
view and copy the data in a parallel kernel element by element. Note that you might
need to first copy into a host view before copying to the actual destination memory
space:

\begin{lstlisting}
template<class ExecutionSpace>
void MyKokkosFunction(double* a, const double** b, int n, int m) {
  // Define the host execution space and the view types
  typedef HostSpace::execution_space host_space;
  typedef View<double*,ExecutionSpace> t_1d_device_view;
  typedef View<double**,ExecutionSpace> t_2d_device_view;

  // Allocate the view in the memory space of ExecutionSpace
  t_1d_device_view d_a("a",n);
  // Create a host copy of that view
  typename t_1d_device_view::HostMirror h_a = create_mirror_view(a);
  // Copy the data from the external allocation into the host view
  parallel_for(RangePolicy<host_space>(0,n),
    KOKKOS_LAMBDA (const int& i) {
    h_a(i) = a[i];
  });
  // Copy the data from the host view to the device view
  deep_copy(d_a,h_a);

  // Allocate a 2D view in the memory space of ExecutionSpace
  t_2d_device_view d_b("b",n,m);
  // Create a host copy of that view
  typename t_2d_device_view::HostMirror h_b = create_mirror_view(b);

  // Get the member_type of the team policy
  typedef TeamPolicy<host_space>::member_type t_team;
  // Run a 2D copy kernel using a TeamPolicy
  parallel_for(TeamPolicy<host_space>(n,m),
    KOKKOS_LAMBDA (const t_team& t) {
    const int i = t.team_rank();
    parallel_for(TeamThreadRange(t,m), [&] (const int& j) {
      h_b(i,j) = b[i][j];
    });
  });
  // Copy the data from the host to the device
  deep_copy(d_b,h_b);
}
\end{lstlisting}

Alternatively one can create a view which directly references the external allocation.
If that data is a multi dimensional view it is important to specify the Layout explicitly.
Furthermore all data must be part of the same allocation.

 \begin{lstlisting}
void MyKokkosFunction(int* a, const double* b, int n, int m) {
  // Define the host execution space and the view types
  typedef View<int*, DefaultHostExecutionSpace, Unmanaged> t_1d_view;
  typedef View<double**[3],LayoutRight, DefaultHostExecutionSpace,
               Unmanaged> t_3d_view;

  // Create a 1D view of the external allocation
  t_1d_view d_a(a,n);

  // Create a 3D view of the second external allocation
  // This assumes that the data had a row major layout
  // (i.e. the third index is stride 1)
  t_3d_view d_b(b,n,m);
}
\end{lstlisting}

\subsection{Views as the fundamental data owning structure}

An other option is to let Kokkos handle the basic allocations using Views,
and then constructing the legacy data structures around them. Again it is
important to fix the Layout of the Views to whatever the layout of the legacy
data was.

\begin{lstlisting}
// Allocate a 2D view with row major layout
View<double**,LayoutRight,HostSpace> a("A",n,m);

// Allocate an array of pioneers
double** a_old = new double*[n];

// Fill the array with pointers to the rows of a
for(int i=0; i<n; i++)
  a_old[i] = &a(i,0);
\end{lstlisting}

\subsection{std::vector}

One of the most common data objects in C++ codes is \lstinline|std::vector|.
Its semantics are unfortunately not compatible with Kokkos requirements
and it is thus not well supported. A major problem is that functors and lambdas
are passed as const objects to the parallel execution. This design choice was made
to (i) prevent a common cause of race conditions and (ii) allow the underlying
implementation more flexibility in how to share the functor and where to put it.
In particular this leaves the choice open for the implementation to give each thread
an individual copy of the functor or place it in read only cache structures.

The semantics of \lstinline|std::vector| would in this case prevent a kernel from modifying its
entries since a const \lstinline|std::vector| is read only. Furthermore creating multiple copies
of the functor would in deed replicate the vector data, since it has copy semantics.

Other issues with \lstinline|std::vector| are its unrestricted support for resize as well as push
functionality. In a threaded environment support for those capabilities would bring
massive performance penalties. In particular access to the \lstinline|std::vector| would require
locks in order to prevent one thread to deallocate the view while another accesses its content.
And last but not least \lstinline|std::vector| is not supported on GPUs and thus would prevent
portability.

Kokkos provides a drop in replacement for \lstinline|std::vector| with \lstinline|Kokkos::vector|.
Outside of parallel kernels its semantics are mostly the same as that of \lstinline|std::vector|, for
example assignments performs deep copies and resize and push functionality are provided.
One important difference is that it is valid to assign values to the elements of a const vector.

Inside of parallel sections \lstinline|Kokkos::vector| switches to view semantics. That means
in particular that assignments are shallow copies. Certain functions will also throw runtime
errors when called inside a parallel kernel. This includes resize and push.

\begin{lstlisting}
// Create a vector of 1000 double elements
Kokkos::vector<double> v(1000);
// Create another vector as a copy of v;
// This allocates another 1000 doubles
Kokkos::vector<double> x = v;

parallel_for(1000, KOKKOS_LAMBDA (const int& i) {
   // Create a view of x; m and x will reference the same data.
   Kokkos::vector<double> m = x;
   x[i] = 2*i+1;
   v[i] = m[i] - 1;
});

// Now x contains the first 1000 uneven numbers
// v contains the first 1000 even numbers
\end{lstlisting}

\section{Calling non-Kokkos libraries}

There are no restrictions on calling non-Kokkos libraries outside of parallel kernels.
However due to the polymorphic layouts of Kokkos views it is often required to test
layouts for compatibility with third party libraries. The usual Blas interface for example,
expects matrixes to be laid out in column major format (i.e. LayoutLeft in Kokkos).
Furthermore it is necessary to test that the library can access the memory space
of the view.

\begin{lstlisting}
template<class Scalar, class Device>
Scalar dot(View<const Scalar* , Device> a,
           View<const Scalar*, Device> b) {
// Check for Cuda memory and call cublas if true
#ifdef KOKKOS_HAVE_CUDA
  if(std::is_same<typename Device::memory_space,
                           CudaSpace>::value ||
     std::is_same<typename Device::memory_space,
                           CudaUVMSpace>::value) {
    return call_cublas_dot(a.ptr_on_device(), b.ptr_on_device(),
                           a.dimension_0() );
  }
#endif

// Call CBlas on the host otherwise
  if(std::is_same<typename Device::memory_space,HostSpace>::value) {
    return call_cblas_dot(a.ptr_on_device(), b.ptr_on_device(),
                          a.dimension_0() );
  }
}
\end{lstlisting}