-
Notifications
You must be signed in to change notification settings - Fork 434
Open
Labels
Description
RedistributeCPU, even outside of optimization bugs like #4892, is generally very slow, usually one of the TOP3 functions on CPU in WarpX and ImpactX. It is a sorting function.
I think we should investigate the following optimizations:
Generally
- Can this function use better memory access patterns?
- Can this function benefit from vectorization?
Single Node, Single Thread
- This should be a no-OP, see RedistributeCPU with 1 Core 1 Thread 1 Box #4892
Single Node, Multiple threads
- Can this function use a special pass for single-MPI ranks?
- Can this function benefit for single-MPI ranks to only redistribute between OpenMP tiles on
Tiling in General
Is (spatially distributed) tiling on CPUs really the best approach to use OpenMP threads for AMReX? Has using spatially overlapping parallelization been tried, e.g., all particles are in the same spatial box and just tiled by index, then deposition and gather buffers are of the same size of the box and aggregated after deposition, etc.