Skip to content

[WIP] Esirkepov: Avoid Dynamic Loops & Indices #3210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: development
Choose a base branch
from

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Jun 29, 2022

This is the complete version of #2796.

We want to:

  • avoid dynamic ranges in loops and
  • avoid any dynamic array access

to reduce needed registers & operations (CPU) and match easier to vectorization patterns of compilers (later, CPU).

This PR updates Compute_shifted_shape_factor to reduce registers dramatically. Since we only shift by one cell at most, we can do this in a nicely lined up register move.

This update was developed in exchange with @psychocoderHPC from PIConGPU team, who took part of our approach and shared the register shifting trick in return:

As a follow-up to this PR, we will take a part from #3168 to replace the cached array of the most outer loop with a computation on the fly.

To Do

  • Measure registers & lmlm usage before/after this PR
  • Measure runtime before/after this PR

Refs

https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/

@ax3l ax3l requested review from RemiLehe and Thierry992 June 29, 2022 12:55
@ax3l ax3l changed the title Esirkepov: Esirkepov: Avoid Dynamic Loops & Indices Jun 29, 2022
@ax3l ax3l added the component: core Core WarpX functionality label Jun 29, 2022
@ax3l ax3l force-pushed the optimize_esirkepov_indices branch from 84740e1 to 2c93eb0 Compare June 29, 2022 17:03
sx[1] = T(1.0) - xint;
sx[2] = xint;

sx[0] = shift ? sx[1] : T(0.0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but this only seems correct if the old i_shift is -1 or 0. What happens if i_shift is +1?

Copy link
Member Author

@ax3l ax3l Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to a quick test (and what I remember what we implemented similarly about a decade ago in PIConGPU), it can only take the values 0 and -1 in these locations.
CI will tell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. If that is true, then the calling routine doEsirkepovDepositionShapeN can be simplified since it has statements if (i_old > i_new) diu = 0; which would always be false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, seems there is more to it here. We should really change those ranges to simplify this, I think there is a lot of zero-padding going on here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That you are not moving in the negative direction is only true if you check the position of the start and end particle trajectory. You must use the lower value as a base for your coordinate system (to calculate the shift), then the particle trajectory is always crossing the positive side of a cell.

@Thierry992
Copy link
Contributor

Here are the performances of this new code,

With these modifications we went from 180 registers/thread (original code) to 200. As this graph shows we have a decrease of performance in terms of warp occupancy (the green point shows the performance of the old code and the blue point refers to the new one):

registers_thread

It seems that these new operations have a significant impact on the performance of the new Shape Factor function (this picture shows the live registers used in each line of the code) :

source_codes

@psychocoderHPC
Copy link

psychocoderHPC commented Jul 6, 2022

@Thierry992 Your plot is not showing a decrease in occupancy. You should also post a screenshot of the plain value from the profiler above the plot you attached. The plot shows that the occupancy is already so low that it does not matter if the kernel uses 129 or 255 registers, the occupancy will be the same. You should dump the register-footprint during compile with the nvcc option -Xcuda-ptxas=-v.

ptxas info    : Function properties for LONGKERNELNAME
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 96 registers, 36880 bytes smem, 512 bytes cmem[0], 40 bytes cmem[2]

The interesting parts there are stack frames, spill stores, loads, and registers. Take care if not all functions are inlined the register footprint during compile and what you see in the profiler can differ. In that case, take the register-footprint from the profiler. Nevertheless have a look at spill registers, those registers are located in global memory and slow down you kernel.

The second image with the heatmap mapped to lines does not show the register hot-spot with the largest register-footprint. It is fare from 200 registers in is therefore not interesting.

@RemiLehe RemiLehe self-assigned this Jul 11, 2022
@RemiLehe RemiLehe changed the title Esirkepov: Avoid Dynamic Loops & Indices [WIP] Esirkepov: Avoid Dynamic Loops & Indices Aug 22, 2022
RemiLehe and others added 3 commits September 14, 2022 10:24
Avoid dynamic array access in `Compute_shifted_shape_factor`
to reduce registers dramatically. Since we only shift by one
cell at most, we can do this in a nicely lined up register
move.
Asserts in debug mode & clean up of comments.
@ax3l ax3l force-pushed the optimize_esirkepov_indices branch 2 times, most recently from 08534bb to 482a2de Compare September 14, 2022 19:57
and add missing outer element that is always zero
(to be cropped off later).
@ax3l ax3l force-pushed the optimize_esirkepov_indices branch from 482a2de to ce86b13 Compare September 14, 2022 20:58
@ax3l
Copy link
Member Author

ax3l commented Sep 14, 2022

cmake -S . -B build_pm -DWarpX_COMPUTE=CUDA -DAMReX_CUDA_PTX_VERBOSE=ON
cmake --build build_pm 2>&1 | tee compile.log
grep Esirkepov compile.log

(and some cu++filt)

Below are the compilations for double precision builds (default), shape order 3 to 1.

Perlmutter register usage in development as of 22.09-17-g04b6f67caab8

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi3EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi3EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)3>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 180 registers, 768 bytes cmem[0], 32 bytes cmem[2]

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi2EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi2EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<256, amrex::ParallelFor<long, doEsirkepovDepositionShapeN<2>(GetParticlePosition const&, double const*, double const*, double const*, double const*, int const*, amrex::Array4<double> const&, amrex::Array4<double> const&, amrex::Array4<double> const&, long, double, double, std::array<double, 3ul> const&, std::array<double, 3ul>, amrex::Dim3, double, int, double*, long)::{lambda(long)#1}, void>(amrex::Gpu::KernelInfo const&, long, doEsirkepovDepositionShapeN<2>(GetParticlePosition const&, double const*, double const*, double const*, double const*, int const*, amrex::Array4<double> const&, amrex::Array4<double> const&, amrex::Array4<double> const&, long, double, double, std::array<double, 3ul> const&, std::array<double, 3ul>, amrex::Dim3, double, int, double*, long)::{lambda(long)#1}&&)::{lambda()#1}>(amrex::ParallelFor<long, doEsirkepovDepositionShapeN<2>(GetParticlePosition const&, double const*, double const*, double const*, double const*, int const*, amrex::Array4<double> const&, amrex::Array4<double> const&, amrex::Array4<double> const&, long, double, double, std::array<double, 3ul> const&, std::array<double, 3ul>, amrex::Dim3, double, int, double*, long)::{lambda(long)#1}, void>(amrex::Gpu::KernelInfo const&, long, doEsirkepovDepositionShapeN<2>(GetParticlePosition const&, double const*, double const*, double const*, double const*, int const*, amrex::Array4<double> const&, amrex::Array4<double> const&, amrex::Array4<double> const&, long, double, double, std::array<double, 3ul> const&, std::array<double, 3ul>, amrex::Dim3, double, int, double*, long)::{lambda(long)#1}&&)::{lambda()#1})
    240 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 126 registers, 768 bytes cmem[0], 40 bytes cmem[2]

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi1EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi1EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)1>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    192 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 106 registers, 768 bytes cmem[0], 24 bytes cmem[2]

shape 3 runtime (LWFA inputs_3d default, very small):

WarpXParticleContainer::DepositCurrent::CurrentDeposition     293    0.01309    0.01309    0.01309   0.40%

Perlmutter register usage as of Implement Upshift

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi3EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi3EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)3>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    192 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 204 registers, 768 bytes cmem[0], 32 bytes cmem[2]

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi2EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi2EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)2>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    160 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 162 registers, 768 bytes cmem[0], 24 bytes cmem[2]

ptxas info    : Compiling entry function '_ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi1EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_' for 'sm_80'
ptxas info    : Function properties for _ZN5amrex13launch_globalILi256EZNS_11ParallelForIlZ27doEsirkepovDepositionShapeNILi1EEvRK19GetParticlePositionPKdS7_S7_S7_PKiRKNS_6Array4IdEESD_SD_lddRKSt5arrayIdLm3EESF_NS_4Dim3EdiPdlEUllE_vEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSM_EUlvE_EEvSM_
aka void amrex::launch_global<(int)256, std::enable_if<amrex::MaybeDeviceRunnable<T2, void>::value, void>::type amrex::ParallelFor<long, void doEsirkepovDepositionShapeN<(int)1>(const GetParticlePosition &, const double *, const double *, const double *, const double *, const int *, const amrex::Array4<double> &, const amrex::Array4<double> &, const amrex::Array4<double> &, long, double, double, const std::array<double, (unsigned long)3> &, std::array<double, (unsigned long)3>, amrex::Dim3, double, int, double *, long)::[lambda(long) (instance 1)], void>(const amrex::Gpu::KernelInfo &, T1, T2 &&)::[lambda() (instance 1)]>(T2)
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 98 registers, 768 bytes cmem[0], 24 bytes cmem[2]

shape 3 runtime (LWFA inputs_3d default, very small):

WarpXParticleContainer::DepositCurrent::CurrentDeposition     293    0.01288    0.01288    0.01288   0.02%

@ax3l ax3l requested a review from hklion September 15, 2022 00:00
So far for x only. y and z are to do.
@ax3l ax3l force-pushed the optimize_esirkepov_indices branch from 93ebb05 to 2125ede Compare September 15, 2022 01:46
@psychocoderHPC
Copy link

@ax3l I checked the latest changes and did not see instantly where the high stack frame usage and register footprint are coming from. I suggest inspecting the C++ annotated ptx code to find the places where local memory is used.

Only for comparison here is the data for PIConGPU Esirkepov 64bit precision for the 3rd order assignment shape compiled for sm_70

ptxas info    : Function properties for kernelComputeCurrent
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 80 registers, 25280 bytes smem, 480 bytes cmem[0], 56 bytes cmem[2]

@@ -586,65 +597,86 @@ void doEsirkepovDepositionShapeN (const GetParticlePosition& GetPosition,

#if defined(WARPX_DIM_3D)

for (int k=dkl; k<=depos_order+2-dku; k++) {
for (int j=djl; j<=depos_order+2-dju; j++) {
for (int k=0; k<=depos_order+2; k++) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to throw in #pragma unroll directives for the loops:
https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/

@@ -70,6 +70,45 @@ struct Compute_shape_factor
}
};

template <int depos_order>
struct Compute_shape_factor_uni
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uni as in unified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants