[WIP] Esirkepov: Avoid Dynamic Loops & Indices #3210

ax3l · 2022-06-29T12:55:33Z

This is the complete version of #2796.

We want to:

avoid dynamic ranges in loops and
avoid any dynamic array access

to reduce needed registers & operations (CPU) and match easier to vectorization patterns of compilers (later, CPU).

This PR updates Compute_shifted_shape_factor to reduce registers dramatically. Since we only shift by one cell at most, we can do this in a nicely lined up register move.

This update was developed in exchange with @psychocoderHPC from PIConGPU team, who took part of our approach and shared the register shifting trick in return:

As a follow-up to this PR, we will take a part from #3168 to replace the cached array of the most outer loop with a computation on the fly.

To Do

Measure registers & lmlm usage before/after this PR
Measure runtime before/after this PR

Refs

https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/

dpgrote · 2022-06-29T17:12:12Z

Source/Particles/ShapeFactors.H

+            sx[1] = T(1.0) - xint;
+            sx[2] = xint;
+
+            sx[0] = shift ? sx[1] : T(0.0);


Maybe I'm missing something, but this only seems correct if the old i_shift is -1 or 0. What happens if i_shift is +1?

According to a quick test (and what I remember what we implemented similarly about a decade ago in PIConGPU), it can only take the values 0 and -1 in these locations.
CI will tell.

Ok. If that is true, then the calling routine doEsirkepovDepositionShapeN can be simplified since it has statements if (i_old > i_new) diu = 0; which would always be false.

Interesting, seems there is more to it here. We should really change those ranges to simplify this, I think there is a lot of zero-padding going on here.

That you are not moving in the negative direction is only true if you check the position of the start and end particle trajectory. You must use the lower value as a base for your coordinate system (to calculate the shift), then the particle trajectory is always crossing the positive side of a cell.

Thierry992 · 2022-07-05T19:04:17Z

Here are the performances of this new code,

With these modifications we went from 180 registers/thread (original code) to 200. As this graph shows we have a decrease of performance in terms of warp occupancy (the green point shows the performance of the old code and the blue point refers to the new one):

It seems that these new operations have a significant impact on the performance of the new Shape Factor function (this picture shows the live registers used in each line of the code) :

psychocoderHPC · 2022-07-06T17:31:08Z

@Thierry992 Your plot is not showing a decrease in occupancy. You should also post a screenshot of the plain value from the profiler above the plot you attached. The plot shows that the occupancy is already so low that it does not matter if the kernel uses 129 or 255 registers, the occupancy will be the same. You should dump the register-footprint during compile with the nvcc option -Xcuda-ptxas=-v.

ptxas info    : Function properties for LONGKERNELNAME
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 96 registers, 36880 bytes smem, 512 bytes cmem[0], 40 bytes cmem[2]

The interesting parts there are stack frames, spill stores, loads, and registers. Take care if not all functions are inlined the register footprint during compile and what you see in the profiler can differ. In that case, take the register-footprint from the profiler. Nevertheless have a look at spill registers, those registers are located in global memory and slow down you kernel.

The second image with the heatmap mapped to lines does not show the register hot-spot with the largest register-footprint. It is fare from 200 registers in is therefore not interesting.

Avoid dynamic array access in `Compute_shifted_shape_factor` to reduce registers dramatically. Since we only shift by one cell at most, we can do this in a nicely lined up register move.

Asserts in debug mode & clean up of comments.

and add missing outer element that is always zero (to be cropped off later).

ax3l · 2022-09-14T23:36:17Z

cmake -S . -B build_pm -DWarpX_COMPUTE=CUDA -DAMReX_CUDA_PTX_VERBOSE=ON
cmake --build build_pm 2>&1 | tee compile.log
grep Esirkepov compile.log

(and some cu++filt)

Below are the compilations for double precision builds (default), shape order 3 to 1.

Perlmutter register usage in `development` as of `22.09-17-g04b6f67caab8`

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi3EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi3EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)3>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 180 registers, 768 bytes cmem[0], 32 bytes cmem[2]

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi2EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi2EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<256, amrex::ParallelFor<long, doEsirkepovDepositionShapeN<2>(GetParticlePosition const&, double const*, double const*, double const*, double const*, int const*, amrex::Array4<double> const&, amrex::Array4<double> const&, amrex::Array4<double> const&, long, double, double, std::array<double, 3ul> const&, std::array<double, 3ul>, amrex::Dim3, double, int, double*, long)::{lambda(long)#1}, void>(amrex::Gpu::KernelInfo const&, long, doEsirkepovDepositionShapeN<2>(GetParticlePosition const&, double const*, double const*, double const*, double const*, int const*, amrex::Array4<double> const&, amrex::Array4<double> const&, amrex::Array4<double> const&, long, double, double, std::array<double, 3ul> const&, std::array<double, 3ul>, amrex::Dim3, double, int, double*, long)::{lambda(long)#1}&&)::{lambda()#1}>(amrex::ParallelFor<long, doEsirkepovDepositionShapeN<2>(GetParticlePosition const&, double const*, double const*, double const*, double const*, int const*, amrex::Array4<double> const&, amrex::Array4<double> const&, amrex::Array4<double> const&, long, double, double, std::array<double, 3ul> const&, std::array<double, 3ul>, amrex::Dim3, double, int, double*, long)::{lambda(long)#1}, void>(amrex::Gpu::KernelInfo const&, long, doEsirkepovDepositionShapeN<2>(GetParticlePosition const&, double const*, double const*, double const*, double const*, int const*, amrex::Array4<double> const&, amrex::Array4<double> const&, amrex::Array4<double> const&, long, double, double, std::array<double, 3ul> const&, std::array<double, 3ul>, amrex::Dim3, double, int, double*, long)::{lambda(long)#1}&&)::{lambda()#1})
    240 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 126 registers, 768 bytes cmem[0], 40 bytes cmem[2]

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi1EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi1EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)1>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    192 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 106 registers, 768 bytes cmem[0], 24 bytes cmem[2]

shape 3 runtime (LWFA inputs_3d default, very small):

WarpXParticleContainer::DepositCurrent::CurrentDeposition     293    0.01309    0.01309    0.01309   0.40%

Perlmutter register usage as of `Implement Upshift`

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi3EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi3EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)3>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    192 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 204 registers, 768 bytes cmem[0], 32 bytes cmem[2]

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi2EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi2EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)2>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    160 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 162 registers, 768 bytes cmem[0], 24 bytes cmem[2]

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi1EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi1EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)1>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 98 registers, 768 bytes cmem[0], 24 bytes cmem[2]

shape 3 runtime (LWFA inputs_3d default, very small):

WarpXParticleContainer::DepositCurrent::CurrentDeposition     293    0.01288    0.01288    0.01288   0.02%

Source/Particles/Deposition/CurrentDeposition.H

So far for x only. y and z are to do.

psychocoderHPC · 2022-09-15T15:40:04Z

@ax3l I checked the latest changes and did not see instantly where the high stack frame usage and register footprint are coming from. I suggest inspecting the C++ annotated ptx code to find the places where local memory is used.

Only for comparison here is the data for PIConGPU Esirkepov 64bit precision for the 3rd order assignment shape compiled for sm_70

ptxas info    : Function properties for kernelComputeCurrent
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 80 registers, 25280 bytes smem, 480 bytes cmem[0], 56 bytes cmem[2]

ax3l · 2022-09-29T16:52:55Z

Source/Particles/Deposition/CurrentDeposition.H

@@ -586,65 +597,86 @@ void doEsirkepovDepositionShapeN (const GetParticlePosition& GetPosition,

 #if defined(WARPX_DIM_3D)

-            for (int k=dkl; k<=depos_order+2-dku; k++) {
-                for (int j=djl; j<=depos_order+2-dju; j++) {
+            for (int k=0; k<=depos_order+2; k++) {


We should try to throw in #pragma unroll directives for the loops:
https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/

ax3l · 2022-12-06T00:14:04Z

Source/Particles/ShapeFactors.H

@@ -70,6 +70,45 @@ struct Compute_shape_factor
    }
 };

+template <int depos_order>
+struct Compute_shape_factor_uni


uni as in unified

ax3l added the Performance optimization label Jun 29, 2022

ax3l requested review from RemiLehe and Thierry992 June 29, 2022 12:55

ax3l changed the title ~~Esirkepov:~~ Esirkepov: Avoid Dynamic Loops & Indices Jun 29, 2022

ax3l mentioned this pull request Jun 29, 2022

[WIP] Esirkepov: make loop bounds known at compile time #2796

Open

ax3l added the component: core Core WarpX functionality label Jun 29, 2022

ax3l force-pushed the optimize_esirkepov_indices branch from 84740e1 to 2c93eb0 Compare June 29, 2022 17:03

dpgrote reviewed Jun 29, 2022

View reviewed changes

RemiLehe self-assigned this Jul 11, 2022

RemiLehe changed the title ~~Esirkepov: Avoid Dynamic Loops & Indices~~ [WIP] Esirkepov: Avoid Dynamic Loops & Indices Aug 22, 2022

RemiLehe and others added 3 commits September 14, 2022 10:24

Esirkepov: make loop bounds known at compile time

f1fb8db

Remove Dynamic Array Indices

b51977b

Avoid dynamic array access in `Compute_shifted_shape_factor` to reduce registers dramatically. Since we only shift by one cell at most, we can do this in a nicely lined up register move.

Shift Cleanup & Assert

1f1b90e

Asserts in debug mode & clean up of comments.

ax3l force-pushed the optimize_esirkepov_indices branch 2 times, most recently from 08534bb to 482a2de Compare September 14, 2022 19:57

Implement Upshift

ce86b13

and add missing outer element that is always zero (to be cropped off later).

ax3l force-pushed the optimize_esirkepov_indices branch from 482a2de to ce86b13 Compare September 14, 2022 20:58

ax3l requested a review from hklion September 15, 2022 00:00

ax3l commented Sep 15, 2022

View reviewed changes

Source/Particles/Deposition/CurrentDeposition.H Outdated Show resolved Hide resolved

[Draft] Reduce sx size by -1

2125ede

So far for x only. y and z are to do.

ax3l force-pushed the optimize_esirkepov_indices branch from 93ebb05 to 2125ede Compare September 15, 2022 01:46

ax3l commented Sep 29, 2022

View reviewed changes

ax3l commented Dec 6, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Esirkepov: Avoid Dynamic Loops & Indices #3210

[WIP] Esirkepov: Avoid Dynamic Loops & Indices #3210

Uh oh!

ax3l commented Jun 29, 2022 •

edited

Loading

Uh oh!

dpgrote Jun 29, 2022

Uh oh!

ax3l Jun 29, 2022 •

edited

Loading

Uh oh!

dpgrote Jun 29, 2022

Uh oh!

ax3l Jun 29, 2022

Uh oh!

psychocoderHPC Jun 30, 2022

Uh oh!

Thierry992 commented Jul 5, 2022

Uh oh!

psychocoderHPC commented Jul 6, 2022 •

edited

Loading

Uh oh!

ax3l commented Sep 14, 2022 •

edited

Loading

Uh oh!

Uh oh!

psychocoderHPC commented Sep 15, 2022

Uh oh!

ax3l Sep 29, 2022

Uh oh!

ax3l Dec 6, 2022

Uh oh!

Uh oh!

[WIP] Esirkepov: Avoid Dynamic Loops & Indices #3210

Are you sure you want to change the base?

[WIP] Esirkepov: Avoid Dynamic Loops & Indices #3210

Uh oh!

Conversation

ax3l commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To Do

Refs

Uh oh!

dpgrote Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

ax3l Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dpgrote Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

ax3l Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

psychocoderHPC Jun 30, 2022

Choose a reason for hiding this comment

Uh oh!

Thierry992 commented Jul 5, 2022

Here are the performances of this new code,

Uh oh!

psychocoderHPC commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ax3l commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perlmutter register usage in development as of 22.09-17-g04b6f67caab8

Perlmutter register usage as of Implement Upshift

Uh oh!

Uh oh!

psychocoderHPC commented Sep 15, 2022

Uh oh!

ax3l Sep 29, 2022

Choose a reason for hiding this comment

Uh oh!

ax3l Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ax3l commented Jun 29, 2022 •

edited

Loading

ax3l Jun 29, 2022 •

edited

Loading

psychocoderHPC commented Jul 6, 2022 •

edited

Loading

ax3l commented Sep 14, 2022 •

edited

Loading

Perlmutter register usage in `development` as of `22.09-17-g04b6f67caab8`

Perlmutter register usage as of `Implement Upshift`