-
Notifications
You must be signed in to change notification settings - Fork 446
Fix RAW (read after write) race hazard in theta-l_kokkos limiter #7858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Also remove (slow) profiling calls
|
Testing on Aurora with
|
amametjanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to remove the two printf's from the GPU kernel.
| Kokkos::printf("WARNING:CAAR: dp3d too small. k=%d, dp3d(k)=%f, dp0=%f \n", | ||
| k+1,dp_as_real(k),dp0_as_real(k)); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, i'd like to disable these prints inside GPU kernels and print them only in debug-mode: i.e.
+#ifndef NDEBUG
#ifndef HOMMEXX_BFB_TESTING
if(diff_as_real(k) < 0){
Kokkos::printf("WARNING:CAAR: dp3d too small. k=%d, dp3d(k)=%f, dp0=%f \n",
k+1,dp_as_real(k),dp0_as_real(k));
}
+#endif
#endif
Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing from kernels is a good idea. But we do want to give user some warning about this --- these warnings are very helpful when debugging instability issues (which is still a relatively common problem in F2010-SCREAMv1 variety setups). I think @mt5555 should also be consulted before completely removing these warnings as well.
One idea might be to capture/set a flag inside the kernel; then outside the kernel, if it's true (i.e., a condition is triggered), we print a statement. The specifics (levels and values) are helpful, but we can live with a simple warning (just saying the threshold was triggered) without the specifics if we must
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A || reduce is definitely a reasonable strategy. It is already implemented in the CaarFunctor, where we want to print an extensive picture of the state before crashing, if something is amiss in the functor execution.
Edit: I would also be against making the code print in debug mode only. For relatively large cases, people are not so likely to run the code in debug mode, as it would be remarkably slower (Kokkos is MUCH slower in debug mode), making debuggin painful.
| vtheta_dp(ilev) *= dp(ilev); | ||
| }); | ||
| } //end of min_diff < 0 | ||
| kv.team_barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay to disable this print too: i.e. print only in debug-mode ?
for (int ivec=0; ivec<VECTOR_SIZE; ++ivec) {
if ( (vtheta_dp(ilev)[ivec] - m_vtheta_thresh*dp(ilev)[ivec]) < 0) {
+#ifndef NDEBUG
#ifndef HOMMEXX_BFB_TESTING
Kokkos::printf("WARNING:CAAR: k=%d,theta(k)=%f<%f=th_thresh, applying limiter \n",
ilev*VECTOR_SIZE+ivec+1,vtheta_dp(ilev)[ivec]/dp(ilev)[ivec],m_vtheta_thresh);
+#endif
#endif
vtheta_dp(ilev)[ivec]=m_vtheta_thresh*dp(ilev)[ivec];
}
| #endif | ||
| result = result<=diff_as_real(k) ? result : diff_as_real(k); | ||
| }, reducer); | ||
| kv.team_barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This barrier is needed to finish updating min_diff, which is used below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the previous comment, I think barriers inside a TeamThreadRange are conceptually wrong. They work in our case b/c our team sizes are <=16 and always a power of 2, and NP*NP=16. But if this were to change in the future, I think these barriers may cause deadlocks. Barriers should be outside TeamThreadRange loops.
We can help fixing this if you want. We can see if Oksana has some commit buried somewhere that addresses these barriers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second @bartgol, these should not be here. I'm curious if parallel_reduce(team) implies a sync over team threads though, similar to how parallel_reduce(exec_space) implies a fence over execution space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some background: we started adding these due to GPU race conditions on various target machines, knowing they weren't quite right for arbitrary threading configurations. The commit I linked does the right thing by keeping the syncs but rewriting loops so syncs occur at the top level.
| kv.team_barrier(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this barrier out of if (min_diff<0) { branch to avoid possible stalls waiting for threads in the else branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky issue. It's definitely wrong to have a team_barrier in a conditional like this, since it can lead to deadlock. But if NUM_LEV != NUM_PHYSICAL_LEV, I think there can be a race condition in the parallel_for-parallel_reduce sequence that this team_barrier is separating. On the GPU, NUM_LEV == NUM_PHYSICAL_LEV, and on the CPU, we don't actually have threading here in practice, so this point is not relevant in practice.
Nonetheless, I had a branch going a long time ago that fixed this type of issue in a few spots in HOMME, but I ended up getting distracted by other machine issues and never got this branch in. If might be worth consulting this commit to see how to keep this team_barrier. (The ComposeTransport change is no longer needed, but the three others are still relevant, although again only in threading/vectorization configurations we don't run in practice.)
The changes in this PR are fine if keeping them is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Andrew, I think we should get your commit in (at least, the part that fixes the team barrier). While I agree that in practice it doesn't make a difference now, we don't know if our threading schemes will change 5 yy from now, and catching this kind of bugs would take time. Better put in the correct code now, and be safe later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I thought @oksanaguba had fixed this as well in one of her branches a while ago, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get your commit in
Ok. For clarity, I'm not on this project this year, so someone else would need to bring in the changes.
| vtheta_dp(ilev) *= dp(ilev); | ||
| }); | ||
| } //end of min_diff < 0 | ||
| kv.team_barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding kv.team_barrier() here to finish updating vtheta_dp, which is read below.
ambrad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an improvement and fixes an important problem, so I see no reason to make further changes in this PR despite the good ideas for more changes. Those can go in a separate PR. Just my opinion: don't necessarily take this approval as definitive.
| diff(ilev) = (dp(ilev) - m_dp3d_thresh*dp0(ilev))*spheremp; | ||
| }); | ||
|
|
||
| kv.team_barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I think this team barrier is also wrong, as it sits inside a TeamThreadRange. While in practice it's ok, since our team sizes always divide NP*NP, if someone somehow launched with team_size, say, 10, I think this may cause a deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do believe @oksanaguba had run into these barriers at some point, and fixed them. Perhaps the fixes never made it to master. Oksana, do you recall anything like that? Do you know which branch may have all fixes for this stuff? We can of course fix them here, but I want to make sure we don't leave any issue behind, while we're at it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luca, I agree about the barriers. The commit I linked handles that issue, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amametjanov if it's ok with you, I can extract the relevant part from Andrew's commit, and push to this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please feel free to update as needed!
| #endif | ||
| result = result<=diff_as_real(k) ? result : diff_as_real(k); | ||
| }, reducer); | ||
| kv.team_barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the previous comment, I think barriers inside a TeamThreadRange are conceptually wrong. They work in our case b/c our team sizes are <=16 and always a power of 2, and NP*NP=16. But if this were to change in the future, I think these barriers may cause deadlocks. Barriers should be outside TeamThreadRange loops.
We can help fixing this if you want. We can see if Oksana has some commit buried somewhere that addresses these barriers.
Also remove (slow) profiling calls.
[BFB]
The hangs are not reproducible with plain-vanilla
--compset F2010-SCREAMv1 --res ne256pg2_ne256pg2: need to triggercaar loop dp3d limiter. If triggered, hang can occur withinLm1: need to run at least 1 model-month.