@@ -62,7 +62,7 @@ As an example, the following snippet finds the maximum of `A`.
6262To perform reductions that are not sums, in addition to modifying the lambda body,
6363the final reduction variable needs to be cast to the appropriate type. In the above example,
6464` MaxA ` is cast to ` Kokkos::Max<Real> ` to perform a max reduction.
65- The ` parallelReduce ` wrapper supports performing multiple reduction at the same time.
65+ The ` parallelReduce ` wrapper supports performing multiple reductions at the same time.
6666You can compute ` SumA ` and ` MaxA ` in one pass over the data:
6767``` c++
6868 parallelReduce (
@@ -73,4 +73,207 @@ You can compute `SumA` and `MaxA` in one pass over the data:
7373 },
7474 SumA, Kokkos::Max<Real>(MaxA));
7575```
76+ There is no limit to how many reductions can be done at the same time.
77+ It is usually beneficial for performance to group a small number (2-8) of simple
78+ reductions together.
7679Similarly to `parallelFor`, `parallelReduce` supports labels and up to five dimensions.
80+
81+ ## Hierarchical parallelism
82+
83+ A more flexible alternative to flat parallelism is hierarchical parallelism.
84+ In general, Kokkos supports a three-level parallelism hierarchy of
85+ - teams of a league
86+ - threads of a team
87+ - vector lanes of a thread
88+
89+ In Omega, we simplify this design to a two-level hierarchy (outer and inner) that naturally
90+ corresponds to the model parallelism in the horizontal and vertical dimensions.
91+ The outer level of parallelism in Omega corresponds to parallelism over Kokkos teams,
92+ whereas the inner level may be a combination of parallelism over team threads and thread vector lanes.
93+ The simplest example of a hierarchical parallel loop using Omega wrappers looks as follows
94+ ```c++
95+ parallelForOuter(
96+ {NCells},
97+ KOKKOS_LAMBDA(int ICell, const TeamMember &Team) {
98+ parallelForInner(Team, NVertLayers, INNER_LAMBDA (int KLayer) {
99+ // Inner body
100+ });
101+ });
102+ ```
103+ Note the appearance of a variable named ` Team ` of type ` TeamMember ` in the outer lambda
104+ argument list. This variable handles launching of inner loops and team synchronization.
105+ The flexibility of hierarchical parallelism comes from:
106+ - the possibility of the inner loop range to depend on the outer index
107+ - the option to use different patterns at the outer and inner levels
108+ - the option to have multiple parallel loops at the inner level
109+
110+ The following outer iteration patterns are supported in Omega:
111+ - ` parallelForOuter `
112+ - ` parallelReduceOuter `
113+
114+ The following inner iteration patterns are supported in Omega:
115+ - ` parallelForInner `
116+ - ` parallelReduceInner `
117+ - ` parallelScanInner `
118+
119+ To provide even more flexibility, the outer loops support iterating over a multi-dimensional range.
120+ Currently, the inner loops are limited to one dimension.
121+
122+ ### INNER_LAMBDA
123+
124+ When using hierarchical parallelism the outer loop lambda has to be annotated with ` KOKKOS_LAMBDA ` to
125+ be device accessible. In the inner lambda it must be possible to obtain correct results using capture by value (` [=] ` ).
126+ However, in some cases, better performance might be possible by using
127+ capture by reference (` [&] ` ). For that reason, Omega provides the
128+ ` INNER_LAMBDA ` macro that expends to the best performing capture clause depending on the chosen
129+ architecture.
130+
131+ ### teamBarrier
132+
133+ When multiple parallel inner loops are present within one outer loop, synchronization within
134+ a thread team might be necessary. Omega provides the ` teamBarrier ` function that
135+ synchronizes the threads in a team, which can be called as follows.
136+ ``` c++
137+ teamBarrier (Team);
138+ ```
139+ This function may only be called at the outer level and needs to be called by every thread in a team.
140+ Failure to follow these rules might result in deadlocks.
141+
142+ ### Kokkos::single
143+
144+ When using hierarchical parallelism, it might be necessary to restrict execution to just one thread of a team.
145+ To do that Kokkos provides the `single` function. To execute a statement once per team with a single thread do
146+ ```c++
147+ Kokkos::single(PerTeam(Team), [=](){
148+ // executed once per team (by one of the team threads)
149+ });
150+ ```
151+
152+ ### parallelForOuter
153+ To start outer iterations over a multidimensional index range the ` parallelForOuter ` wrapper is available.
154+ A call to ` parallelForOuter ` might look as follows.
155+ ``` c++
156+ parallelForOuter (
157+ {N1},
158+ KOKKOS_LAMBDA(int J1, const TeamMember &Team) {
159+ // launch inner loops here
160+ });
161+ ```
162+ Labels and multi-dimensional iteration ranges of up to five dimensions are supported by `parallelForOuter`.
163+
164+
165+ ### parallelReduceOuter
166+ To start outer reductions over a multi-dimensional index range the `parallelReduceOuter` wrapper is available.
167+ A call to `parallelReduceOuter` might look as follows.
168+ ```c++
169+ Real OuterSum;
170+ parallelReduceOuter(
171+ {N1},
172+ KOKKOS_LAMBDA(int J1, const TeamMember &Team, Real &Accum) {
173+ Real InnerContribution;
174+ // compute InnerContribution
175+ Kokkos::single(PerTeam(Team), [&](){
176+ Accum += InnerContribution;
177+ });
178+ }, OuterSum);
179+ ```
180+ Note how in this example the addition of ` InnerContribution ` is done inside ` Kokkos::single `
181+ with just one thread from a team. This shows how to add one contribution per team to the total sum.
182+ The reduction accumulator comes after the team member variable in the lambda argument list.
183+ Labels and multi-dimensional iteration ranges of up to five dimensions are supported in ` parallelReduceOuter ` .
184+ Different types of reductions (min, max) and multiple reducers are also supported.
185+
186+ ### parallelForInner
187+ To launch inner parallel iterations Omega provides the
188+ ` parallelForInner ` wrapper. For example, the following code shows how to set every element of
189+ a 3D array in parallel using hierarchical parallelism.
190+ ``` c++
191+ Array3DReal A ("A", N1, N2, N3);
192+ parallelForOuter(
193+ {N1, N2}, KOKKOS_LAMBDA(int J1, int J2, const TeamMember &Team) {
194+ parallelForInner(Team, N3, INNER_LAMBDA(Int J3) {
195+ A(J1, J2, J3) = J1 + J2 + J3;
196+ });
197+ });
198+ ```
199+ Labels are not supported by `parallelForInner` and only one-dimensional index range can be used.
200+ The inner loop range can depend on the outer loop index. For example, to set the elements below the main
201+ diagonal of a square matrix one can do:
202+ ```c++
203+ Array2DReal M("M", N, N);
204+ parallelForOuter(
205+ {N}, KOKKOS_LAMBDA(int J1, const TeamMember &Team) {
206+ parallelForInner(Team, J1, INNER_LAMBDA(Int J2) {
207+ M(J1, J2) = J1 + J2;
208+ });
209+ });
210+ ```
211+
212+ ### parallelReduceInner
213+ To launch inner parallel reductions Omega provides the ` parallelReduceInner ` wrapper.
214+ For example, computing sums along the third dimension of a 3D array in parallel and storing them
215+ in a 2D array might be done as follows.
216+ ``` c++
217+ Array3DReal A ("A", N1, N2, N3);
218+ Array2DReal B("B", N1, N2);
219+ parallelForOuter(
220+ {N1, N2}, KOKKOS_LAMBDA(int J1, int J2, const TeamMember &Team) {
221+ Real SumD3;
222+ parallelReduceInner(Team, N3, INNER_LAMBDA(Int J3, Real &Accum) {
223+ Accum += A(J1, J2, J3);
224+ }, SumD3);
225+ B(J1, J2) = SumD3;
226+ });
227+ ```
228+ Labels are not supported by `parallelReduceInner` and only one-dimensional index range can be used.
229+ Different types of reductions (min, max) and multiple reducers are supported.
230+ For example, to additionally compute and store maxima along the third dimension of A do
231+ ```c++
232+ Array2DReal C("C", N1, N2);
233+ parallelForOuter(
234+ {N1, N2}, KOKKOS_LAMBDA(int J1, int J2, const TeamMember &Team) {
235+ Real SumD3, MaxD3;
236+ parallelReduceInner(Team, N3, INNER_LAMBDA(Int J3, Real &AccumSum, Real &AccumMax) {
237+ AccumSum += A(J1, J2, J3);
238+ AccumMax = Kokkos::Max(AccumMax, A(J1, J2, J3));
239+ }, SumN3, MaxN3);
240+ B(J1, J2) = SumD3;
241+ C(J1, J2) = MaxD3;
242+ });
243+ ```
244+ Inner reductions are collective operations within a team. This means that the reduction results
245+ are available to every thread of a team, without any need for synchronization
246+
247+ ### parallelScanInner
248+ To launch inner parallel scans Omega provides the ` parallelScanInner ` wrapper.
249+ For example, computing prefix sums (partial sums) along the third dimension of a 3D array might
250+ be done as follows.
251+ ``` c++
252+ Array3DReal A ("A", N1, N2, N3);
253+ Array3DReal D("D", N1, N2, N3);
254+ parallelForOuter(
255+ {N1, N2}, KOKKOS_LAMBDA(int J1, int J2, const TeamMember &Team) {
256+ parallelScanInner(Team, N1, INNER_LAMBDA(Int J3, Real &Accum, bool IsFinal) {
257+ Accum += A(J1, J2, J3);
258+ if (IsFinal) {
259+ D(J1, J2, J3) = Accum;
260+ }
261+ });
262+ });
263+ ```
264+ This example computes partial sums up to and including `A(J1, J2, J3)` because the accumulator is updated
265+ before the `if` statement. That is, it performs an inclusive scan. To compute an exclusive scan
266+ simply move the addition after the `if` statement.
267+ ```c++
268+ Real FinalScanValue;
269+ parallelScanInner(Team, N1, INNER_LAMBDA(Int J3, Real &Accum, bool IsFinal) {
270+ if (IsFinal) {
271+ D(J1, J2, J3) = Accum;
272+ }
273+ Accum += A(J1, J2, J3);
274+ }, FinalScanValue);
275+ ```
276+ Moreover, this example illustrates that the final scan value can be obtained by providing
277+ an additional argument ` FinalScanValue ` . Labels are not supported by ` parallelScanInner `
278+ and only one-dimensional index range can be used. In contrast to ` parallelReduceInner ` ,
279+ ` parallelScanInner ` supports only sum-based scans and only one scan variable.
0 commit comments