[WIP] OpenMP + load balance debugging #1605

mrowan137 · 2021-01-05T07:17:50Z

A bug is revealed with OpenMP multithreading and load balancing with this commit: d9d9721
This branch is for locating and fixing the bug, and providing a test to ensure the currently broken case is monitored in the future.

The crash does not appear consistently and may take multiple runs to appear (I've had ~15 before, but it is often less), which may be tricky to catch in a regression test. The attached input (similar to the regression test) and can show the bug running on Summit CPUs:

export OMP_NUM_THREADS=7
jsrun -r 2 -a 1 -c 7 -l CPU-CPU -d packed -b rs <executable> <inputs> > output.txt

inputs_bug
Backtrace.1.0

ax3l · 2021-01-05T17:34:00Z

@mrowan137 thank you - can you please post the backtrace file in debug mode, too?

In CI, this test does not seem to crash - it just lacks the reduced_diags_loadbalancecosts_heuristic_omp.json file.

ax3l · 2021-01-27T08:43:03Z

Regression/WarpX-tests.ini

+numprocs = 2
+useOMP = 1
+numthreads = 4


Careful in public CI services: We only have two vCores available - so using 8 will likely trigger a watchdog deamon.
Does this also crash locally? I just ran it locally and could not crash it (in Debug mode).

The crash does not appear consistently and may take multiple runs to appear
... maybe need to try more often ...

The PR description bugtrace points to the destructor

38: ./main3d.gnu.DEBUG.TPROF.MTMPI.OMP.ex() [0x1002dd44] WarpX::~WarpX() at /ccs/home/mrowan/code/warpx_directory_mrowan137/WarpX/./Source/WarpX.cpp:318

which is problematic but also not 100% the issue we saw during load balancing itself on CPU, I think.

Side note: working with OLCF & NERSC to get some beefier CI machines for such tests integrated...

ax3l · 2021-01-27T08:53:37Z

@atmyers With the commit found by @mrowan137 above pointing to d9d9721 (#1036), is it possible that the issue is rooted in CopyParticleAttribs that unconditionally accesses all particle positions in 2D/RZ?

https://github.com/ECP-WarpX/WarpX/blob/75931e8a5527f77515657260883f1ad9767210fa/Source/Particles/Pusher/CopyParticleAttribs.H#L60-L62
https://github.com/ECP-WarpX/WarpX/blob/75931e8a5527f77515657260883f1ad9767210fa/Source/Particles/Pusher/CopyParticleAttribs.H#L91-L93

mrowan137 · 2021-01-27T09:01:09Z

@atmyers @ax3l [just to chime in on @ax3l 's suggestion, I confirm from my own testing these are indeed the lines where if uncommenting step by step, we first transition from no hang to hang]

atmyers · 2021-01-27T15:38:26Z

It looks like TmpParticles always has 3 components for x, y, z, even in 2D / RZ. So I don't think there is an out-of-bounds memory access. However, looking at that code, all the map entries need to already be created before we do xpold = tmp_particle_data[lev][index][TmpIdx::xold ].dataPtr() + a_offset; in a threaded region. If they aren't, and those lines are modifying the structure of the map inside an omp parallel region, then that is definitely a race condition.

If you change it to xpold = tmp_particle_data[lev].at(index)[TmpIdx::xold ].dataPtr() + a_offset; do you get an out-of-range error instead of a hang?

atmyers · 2021-01-27T19:12:30Z

I think this should fix the issue: #1658

mrowan137 · 2021-01-27T22:24:29Z

I think this should fix the issue: #1658

Awesome, thanks @atmyers ! This fixes the hang in my testing.

As for a test to monitor this in the future, @ax3l how might we do this with the constraint of only two vCores? I think we need more than 1 MPI rank (for load balancing) and at least 2 threads per rank to catch this (for OpenMP). furthermore, the test does not hang consistently, wouldn't we need to rerun several times?

ax3l · 2021-02-01T22:43:36Z

Discussed today: hm, this might be a good candidate for Cori tests...

We could write a new ini file that runs tests on a Cori node, but we cannot run them yet on a per-PR basis...

RemiLehe · 2024-07-30T18:28:48Z

@ax3l Should this be merged or closed?

mrowan137 requested a review from ax3l January 5, 2021 07:18

mrowan137 changed the title ~~OpenMP + load balance debugging~~ [WIP] OpenMP + load balance debugging Jan 5, 2021

mrowan137 force-pushed the mrowan/omp_lb_bug branch from 87818a4 to e9c7190 Compare January 5, 2021 07:32

ax3l requested review from atmyers and WeiqunZhang January 5, 2021 17:33

ax3l added backend: openmp Specific to OpenMP execution (CPUs) bug Something isn't working bug: affects latest release Bug also exists in latest release version component: load balancing Load balancing strategies, optimization etc. help wanted Extra attention is needed labels Jan 5, 2021

ax3l reviewed Jan 27, 2021

View reviewed changes

atmyers mentioned this pull request Jan 27, 2021

Make sure we redefine the tmp particle tiles when we load balance. #1658

Merged

OpenMP + LB test case

4408639

mrowan137 force-pushed the mrowan/omp_lb_bug branch from 03b66cb to 4408639 Compare January 31, 2021 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] OpenMP + load balance debugging #1605

[WIP] OpenMP + load balance debugging #1605

mrowan137 commented Jan 5, 2021 •

edited

Loading

ax3l commented Jan 5, 2021 •

edited

Loading

ax3l Jan 27, 2021 •

edited

Loading

ax3l commented Jan 27, 2021

mrowan137 commented Jan 27, 2021

atmyers commented Jan 27, 2021

atmyers commented Jan 27, 2021

mrowan137 commented Jan 27, 2021

ax3l commented Feb 1, 2021

RemiLehe commented Jul 30, 2024

[WIP] OpenMP + load balance debugging #1605

Are you sure you want to change the base?

[WIP] OpenMP + load balance debugging #1605

Conversation

mrowan137 commented Jan 5, 2021 • edited Loading

ax3l commented Jan 5, 2021 • edited Loading

ax3l Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

ax3l commented Jan 27, 2021

mrowan137 commented Jan 27, 2021

atmyers commented Jan 27, 2021

atmyers commented Jan 27, 2021

mrowan137 commented Jan 27, 2021

ax3l commented Feb 1, 2021

RemiLehe commented Jul 30, 2024

mrowan137 commented Jan 5, 2021 •

edited

Loading

ax3l commented Jan 5, 2021 •

edited

Loading

ax3l Jan 27, 2021 •

edited

Loading