-
Notifications
You must be signed in to change notification settings - Fork 202
[WIP] OpenMP + load balance debugging #1605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
87818a4
to
e9c7190
Compare
@mrowan137 thank you - can you please post the backtrace file in debug mode, too? In CI, this test does not seem to crash - it just lacks the |
numprocs = 2 | ||
useOMP = 1 | ||
numthreads = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Careful in public CI services: We only have two vCores available - so using 8 will likely trigger a watchdog deamon.
Does this also crash locally? I just ran it locally and could not crash it (in Debug mode).
The crash does not appear consistently and may take multiple runs to appear
... maybe need to try more often ...
The PR description bugtrace points to the destructor
38: ./main3d.gnu.DEBUG.TPROF.MTMPI.OMP.ex() [0x1002dd44]
WarpX::~WarpX() at /ccs/home/mrowan/code/warpx_directory_mrowan137/WarpX/./Source/WarpX.cpp:318
which is problematic but also not 100% the issue we saw during load balancing itself on CPU, I think.
Side note: working with OLCF & NERSC to get some beefier CI machines for such tests integrated...
@atmyers With the commit found by @mrowan137 above pointing to d9d9721 (#1036), is it possible that the issue is rooted in https://github.com/ECP-WarpX/WarpX/blob/75931e8a5527f77515657260883f1ad9767210fa/Source/Particles/Pusher/CopyParticleAttribs.H#L60-L62 |
It looks like If you change it to |
I think this should fix the issue: #1658 |
Awesome, thanks @atmyers ! This fixes the hang in my testing. As for a test to monitor this in the future, @ax3l how might we do this with the constraint of only two vCores? I think we need more than 1 MPI rank (for load balancing) and at least 2 threads per rank to catch this (for OpenMP). furthermore, the test does not hang consistently, wouldn't we need to rerun several times? |
03b66cb
to
4408639
Compare
Discussed today: hm, this might be a good candidate for Cori tests... We could write a new ini file that runs tests on a Cori node, but we cannot run them yet on a per-PR basis... |
@ax3l Should this be merged or closed? |
A bug is revealed with OpenMP multithreading and load balancing with this commit: d9d9721
This branch is for locating and fixing the bug, and providing a test to ensure the currently broken case is monitored in the future.
The crash does not appear consistently and may take multiple runs to appear (I've had ~15 before, but it is often less), which may be tricky to catch in a regression test. The attached input (similar to the regression test) and can show the bug running on Summit CPUs:
inputs_bug
Backtrace.1.0