Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions components/homme/src/theta-l_kokkos/cxx/CaarFunctorImpl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -343,8 +343,6 @@ struct CaarFunctorImpl {

set_rk_stage_data(data);

profiling_resume();

GPTLstart("caar compute");
int nerr;
Kokkos::parallel_reduce("caar loop pre-boundary exchange", m_policy_pre, *this, nerr);
Expand All @@ -367,7 +365,6 @@ struct CaarFunctorImpl {

limiter.run(data.np1);

profiling_pause();
}

KOKKOS_INLINE_FUNCTION
Expand Down
8 changes: 2 additions & 6 deletions components/homme/src/theta-l_kokkos/cxx/LimiterFunctor.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -99,15 +99,11 @@ struct LimiterFunctor {

void run (const int& tl)
{
profiling_resume();

GPTLstart("caar limiter");
m_np1 = tl;
Kokkos::parallel_for("caar loop dp3d limiter", m_policy_dp3d_lim, *this);
Kokkos::fence();
GPTLstop("caar limiter");

profiling_pause();
}

KOKKOS_INLINE_FUNCTION
Expand Down Expand Up @@ -147,6 +143,7 @@ struct LimiterFunctor {
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, i'd like to disable these prints inside GPU kernels and print them only in debug-mode: i.e.

+#ifndef NDEBUG
 #ifndef HOMMEXX_BFB_TESTING
         if(diff_as_real(k) < 0){
           Kokkos::printf("WARNING:CAAR: dp3d too small. k=%d, dp3d(k)=%f, dp0=%f \n",
                          k+1,dp_as_real(k),dp0_as_real(k));
         }
+#endif
 #endif  

Is that okay?

Copy link
Contributor

@mahf708 mahf708 Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing from kernels is a good idea. But we do want to give user some warning about this --- these warnings are very helpful when debugging instability issues (which is still a relatively common problem in F2010-SCREAMv1 variety setups). I think @mt5555 should also be consulted before completely removing these warnings as well.

One idea might be to capture/set a flag inside the kernel; then outside the kernel, if it's true (i.e., a condition is triggered), we print a statement. The specifics (levels and values) are helpful, but we can live with a simple warning (just saying the threshold was triggered) without the specifics if we must

Copy link
Contributor

@bartgol bartgol Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A || reduce is definitely a reasonable strategy. It is already implemented in the CaarFunctor, where we want to print an extensive picture of the state before crashing, if something is amiss in the functor execution.

Edit: I would also be against making the code print in debug mode only. For relatively large cases, people are not so likely to run the code in debug mode, as it would be remarkably slower (Kokkos is MUCH slower in debug mode), making debuggin painful.

result = result<=diff_as_real(k) ? result : diff_as_real(k);
}, reducer);
kv.team_barrier();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This barrier is needed to finish updating min_diff, which is used below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the previous comment, I think barriers inside a TeamThreadRange are conceptually wrong. They work in our case b/c our team sizes are <=16 and always a power of 2, and NP*NP=16. But if this were to change in the future, I think these barriers may cause deadlocks. Barriers should be outside TeamThreadRange loops.

We can help fixing this if you want. We can see if Oksana has some commit buried somewhere that addresses these barriers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second @bartgol, these should not be here. I'm curious if parallel_reduce(team) implies a sync over team threads though, similar to how parallel_reduce(exec_space) implies a fence over execution space.

Copy link
Member

@ambrad ambrad Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some background: we started adding these due to GPU race conditions on various target machines, knowing they weren't quite right for arbitrary threading configurations. The commit I linked does the right thing by keeping the syncs but rewriting loops so syncs occur at the top level.


auto vtheta_dp = Homme::subview(m_state.m_vtheta_dp,kv.ie,m_np1,igp,jgp);

Expand All @@ -168,8 +165,6 @@ struct LimiterFunctor {
});
}

kv.team_barrier();

Comment on lines -171 to -172
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this barrier out of if (min_diff<0) { branch to avoid possible stalls waiting for threads in the else branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky issue. It's definitely wrong to have a team_barrier in a conditional like this, since it can lead to deadlock. But if NUM_LEV != NUM_PHYSICAL_LEV, I think there can be a race condition in the parallel_for-parallel_reduce sequence that this team_barrier is separating. On the GPU, NUM_LEV == NUM_PHYSICAL_LEV, and on the CPU, we don't actually have threading here in practice, so this point is not relevant in practice.

Nonetheless, I had a branch going a long time ago that fixed this type of issue in a few spots in HOMME, but I ended up getting distracted by other machine issues and never got this branch in. If might be worth consulting this commit to see how to keep this team_barrier. (The ComposeTransport change is no longer needed, but the three others are still relevant, although again only in threading/vectorization configurations we don't run in practice.)

The changes in this PR are fine if keeping them is preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrew, I think we should get your commit in (at least, the part that fixes the team barrier). While I agree that in practice it doesn't make a difference now, we don't know if our threading schemes will change 5 yy from now, and catching this kind of bugs would take time. Better put in the correct code now, and be safe later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I thought @oksanaguba had fixed this as well in one of her branches a while ago, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get your commit in

Ok. For clarity, I'm not on this project this year, so someone else would need to bring in the changes.

// This loop must be done over physical levels, unless we implement
// masks, like it has been done in the E3SM/scream project
Real mass_new = 0.0;
Expand All @@ -194,6 +189,7 @@ struct LimiterFunctor {
vtheta_dp(ilev) *= dp(ilev);
});
} //end of min_diff < 0
kv.team_barrier();
Copy link
Member Author

@amametjanov amametjanov Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay to disable this print too: i.e. print only in debug-mode ?

         for (int ivec=0; ivec<VECTOR_SIZE; ++ivec) {
           if ( (vtheta_dp(ilev)[ivec] - m_vtheta_thresh*dp(ilev)[ivec]) < 0) {
+#ifndef NDEBUG
 #ifndef HOMMEXX_BFB_TESTING
             Kokkos::printf("WARNING:CAAR: k=%d,theta(k)=%f<%f=th_thresh, applying limiter \n",
                            ilev*VECTOR_SIZE+ivec+1,vtheta_dp(ilev)[ivec]/dp(ilev)[ivec],m_vtheta_thresh);
+#endif
 #endif
              vtheta_dp(ilev)[ivec]=m_vtheta_thresh*dp(ilev)[ivec];
           }

Copy link
Member Author

@amametjanov amametjanov Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding kv.team_barrier() here to finish updating vtheta_dp, which is read below.


Kokkos::parallel_for(Kokkos::ThreadVectorRange(kv.team,NUM_LEV),
[&](const int ilev) {
Expand Down
Loading