Skip to content

[WIP] OpenMP + load balance debugging #1605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: development
Choose a base branch
from

Conversation

mrowan137
Copy link
Member

@mrowan137 mrowan137 commented Jan 5, 2021

A bug is revealed with OpenMP multithreading and load balancing with this commit: d9d9721
This branch is for locating and fixing the bug, and providing a test to ensure the currently broken case is monitored in the future.

The crash does not appear consistently and may take multiple runs to appear (I've had ~15 before, but it is often less), which may be tricky to catch in a regression test. The attached input (similar to the regression test) and can show the bug running on Summit CPUs:

export OMP_NUM_THREADS=7
jsrun -r 2 -a 1 -c 7 -l CPU-CPU -d packed -b rs <executable> <inputs> > output.txt

inputs_bug
Backtrace.1.0

@mrowan137 mrowan137 requested a review from ax3l January 5, 2021 07:18
@mrowan137 mrowan137 changed the title OpenMP + load balance debugging [WIP] OpenMP + load balance debugging Jan 5, 2021
@ax3l ax3l requested review from atmyers and WeiqunZhang January 5, 2021 17:33
@ax3l ax3l added backend: openmp Specific to OpenMP execution (CPUs) bug Something isn't working bug: affects latest release Bug also exists in latest release version component: load balancing Load balancing strategies, optimization etc. help wanted Extra attention is needed labels Jan 5, 2021
@ax3l
Copy link
Member

ax3l commented Jan 5, 2021

@mrowan137 thank you - can you please post the backtrace file in debug mode, too?

In CI, this test does not seem to crash - it just lacks the reduced_diags_loadbalancecosts_heuristic_omp.json file.

Comment on lines +1569 to +1571
numprocs = 2
useOMP = 1
numthreads = 4
Copy link
Member

@ax3l ax3l Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful in public CI services: We only have two vCores available - so using 8 will likely trigger a watchdog deamon.
Does this also crash locally? I just ran it locally and could not crash it (in Debug mode).

The crash does not appear consistently and may take multiple runs to appear
... maybe need to try more often ...

The PR description bugtrace points to the destructor

38: ./main3d.gnu.DEBUG.TPROF.MTMPI.OMP.ex() [0x1002dd44]
    WarpX::~WarpX() at /ccs/home/mrowan/code/warpx_directory_mrowan137/WarpX/./Source/WarpX.cpp:318

which is problematic but also not 100% the issue we saw during load balancing itself on CPU, I think.

Side note: working with OLCF & NERSC to get some beefier CI machines for such tests integrated...

@ax3l
Copy link
Member

ax3l commented Jan 27, 2021

@mrowan137
Copy link
Member Author

@atmyers @ax3l [just to chime in on @ax3l 's suggestion, I confirm from my own testing these are indeed the lines where if uncommenting step by step, we first transition from no hang to hang]

@atmyers
Copy link
Member

atmyers commented Jan 27, 2021

It looks like TmpParticles always has 3 components for x, y, z, even in 2D / RZ. So I don't think there is an out-of-bounds memory access. However, looking at that code, all the map entries need to already be created before we do xpold = tmp_particle_data[lev][index][TmpIdx::xold ].dataPtr() + a_offset; in a threaded region. If they aren't, and those lines are modifying the structure of the map inside an omp parallel region, then that is definitely a race condition.

If you change it to xpold = tmp_particle_data[lev].at(index)[TmpIdx::xold ].dataPtr() + a_offset; do you get an out-of-range error instead of a hang?

@atmyers
Copy link
Member

atmyers commented Jan 27, 2021

I think this should fix the issue: #1658

@mrowan137
Copy link
Member Author

I think this should fix the issue: #1658

Awesome, thanks @atmyers ! This fixes the hang in my testing.

As for a test to monitor this in the future, @ax3l how might we do this with the constraint of only two vCores? I think we need more than 1 MPI rank (for load balancing) and at least 2 threads per rank to catch this (for OpenMP). furthermore, the test does not hang consistently, wouldn't we need to rerun several times?

@ax3l
Copy link
Member

ax3l commented Feb 1, 2021

Discussed today: hm, this might be a good candidate for Cori tests...

We could write a new ini file that runs tests on a Cori node, but we cannot run them yet on a per-PR basis...

@RemiLehe
Copy link
Member

@ax3l Should this be merged or closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: openmp Specific to OpenMP execution (CPUs) bug: affects latest release Bug also exists in latest release version bug Something isn't working component: load balancing Load balancing strategies, optimization etc. help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants