-
Notifications
You must be signed in to change notification settings - Fork 202
[WIP] Esirkepov: Avoid Dynamic Loops & Indices #3210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
84740e1
to
2c93eb0
Compare
Source/Particles/ShapeFactors.H
Outdated
sx[1] = T(1.0) - xint; | ||
sx[2] = xint; | ||
|
||
sx[0] = shift ? sx[1] : T(0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but this only seems correct if the old i_shift
is -1 or 0. What happens if i_shift
is +1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to a quick test (and what I remember what we implemented similarly about a decade ago in PIConGPU), it can only take the values 0 and -1 in these locations.
CI will tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. If that is true, then the calling routine doEsirkepovDepositionShapeN
can be simplified since it has statements if (i_old > i_new) diu = 0;
which would always be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, seems there is more to it here. We should really change those ranges to simplify this, I think there is a lot of zero-padding going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That you are not moving in the negative direction is only true if you check the position of the start and end particle trajectory. You must use the lower value as a base for your coordinate system (to calculate the shift), then the particle trajectory is always crossing the positive side of a cell.
Here are the performances of this new code,With these modifications we went from 180 registers/thread (original code) to 200. As this graph shows we have a decrease of performance in terms of warp occupancy (the green point shows the performance of the old code and the blue point refers to the new one): It seems that these new operations have a significant impact on the performance of the new Shape Factor function (this picture shows the live registers used in each line of the code) : |
@Thierry992 Your plot is not showing a decrease in occupancy. You should also post a screenshot of the plain value from the profiler above the plot you attached. The plot shows that the occupancy is already so low that it does not matter if the kernel uses 129 or 255 registers, the occupancy will be the same. You should dump the register-footprint during compile with the nvcc option
The interesting parts there are stack frames, spill stores, loads, and registers. Take care if not all functions are inlined the register footprint during compile and what you see in the profiler can differ. In that case, take the register-footprint from the profiler. Nevertheless have a look at spill registers, those registers are located in global memory and slow down you kernel. The second image with the heatmap mapped to lines does not show the register hot-spot with the largest register-footprint. It is fare from 200 registers in is therefore not interesting. |
Avoid dynamic array access in `Compute_shifted_shape_factor` to reduce registers dramatically. Since we only shift by one cell at most, we can do this in a nicely lined up register move.
Asserts in debug mode & clean up of comments.
08534bb
to
482a2de
Compare
and add missing outer element that is always zero (to be cropped off later).
482a2de
to
ce86b13
Compare
cmake -S . -B build_pm -DWarpX_COMPUTE=CUDA -DAMReX_CUDA_PTX_VERBOSE=ON
cmake --build build_pm 2>&1 | tee compile.log
grep Esirkepov compile.log (and some Below are the compilations for double precision builds (default), shape order 3 to 1. Perlmutter register usage in
|
So far for x only. y and z are to do.
93ebb05
to
2125ede
Compare
@ax3l I checked the latest changes and did not see instantly where the high stack frame usage and register footprint are coming from. I suggest inspecting the C++ annotated Only for comparison here is the data for PIConGPU Esirkepov 64bit precision for the 3rd order assignment shape compiled for
|
@@ -586,65 +597,86 @@ void doEsirkepovDepositionShapeN (const GetParticlePosition& GetPosition, | |||
|
|||
#if defined(WARPX_DIM_3D) | |||
|
|||
for (int k=dkl; k<=depos_order+2-dku; k++) { | |||
for (int j=djl; j<=depos_order+2-dju; j++) { | |||
for (int k=0; k<=depos_order+2; k++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to throw in #pragma unroll
directives for the loops:
https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/
@@ -70,6 +70,45 @@ struct Compute_shape_factor | |||
} | |||
}; | |||
|
|||
template <int depos_order> | |||
struct Compute_shape_factor_uni |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uni as in unified
This is the complete version of #2796.
We want to:
to reduce needed registers & operations (CPU) and match easier to vectorization patterns of compilers (later, CPU).
This PR updates
Compute_shifted_shape_factor
to reduce registers dramatically. Since we only shift by one cell at most, we can do this in a nicely lined up register move.This update was developed in exchange with @psychocoderHPC from PIConGPU team, who took part of our approach and shared the register shifting trick in return:
As a follow-up to this PR, we will take a part from #3168 to replace the cached array of the most outer loop with a computation on the fly.
To Do
Refs
https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/