-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial OpenACC port of atm_bdy_adjust_dynamics_relaxzone_tend #1269
base: develop
Are you sure you want to change the base?
Initial OpenACC port of atm_bdy_adjust_dynamics_relaxzone_tend #1269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides a more formal commit message, here's some things I've noticed so far.
call mpas_pool_get_config(config, 'config_relax_zone_divdamp_coef', divdamp_coef_ptr) | ||
|
||
divdamp_coef = divdamp_coef_ptr | ||
vertexDegree = vertexDegree_ptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add a comment here, just a few words about avoiding non-array Fortran pointers for OpenACC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread bump/reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
!$acc parallel default(present) | ||
!$acc loop gang worker private(cell1, cell2, vertex1, vertex2, r_dc, r_dv, & | ||
!$acc iCell, iVertex, invArea, iEdge_div, iEdge_vort, edge_sign, & | ||
!$acc laplacian_filter_coef, divergence1, divergence2, vorticity1, vorticity2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'd only expect the arrays (divergence1
, divergence2
, vorticity1
, and vorticity2
) to be declared private. The rest should automatically be private at the worker level and passed on to their (vector) threads. Are you being explicit here, or do you need to also add the scalar-values to the private clause for correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I briefly didn't remember that about scalars when I wasn't getting bit identical results. Will make these changes and try again.
do iEdge = edgeStart, edgeEnd | ||
|
||
if ( (bdyMaskEdge(iEdge) > 1) .and. (bdyMaskEdge(iEdge) <= nRelaxZone) ) then ! relaxation zone | ||
|
||
laplacian_filter_coef = dcEdge(iEdge)**2 * (real(bdyMaskEdge(iEdge)) - 1.)/ & | ||
laplacian_filter_coef = dcEdge(iEdge)*dcEdge(iEdge)* (real(bdyMaskEdge(iEdge)) - 1.)/ & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pure curiosity about this change: could you explain your motivation a little? Does this have any effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying this out because I though it may be one of the obstacles to bit identical results - but probably not. Will remove this change. Although, I would also add that we could just be using the multiplication operator here instead of exponent. But that's for another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think this may be better as a multiplication than exponent but that goes against keeping to a "light touch" in this phase.
!$acc end parallel | ||
|
||
!$acc parallel default(present) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can both of these loops be in the same parallel region and run simultaneously? They appear to be orthogonal to each other. That is, neither loop references the array(s) the other loop changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jim. Will try that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fusing these two parallel regions does work, but I'm wondering if we should perhaps leave this for the optimization stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not knowledgable enough to weigh in on that. But if it succeeded and BFB tests passed I'd put it in now.
I'm fine with it either way, I'll leave it to you, Dylan, and Michael to decide.
do iCell = cellSolveStart, cellSolveEnd ! threaded over cells | ||
|
||
if ( (bdyMaskCell(iCell) > 1) .and. (bdyMaskCell(iCell) <= nRelaxZone) ) then ! relaxation zone | ||
|
||
laplacian_filter_coef = (real(bdyMaskCell(iCell)) - 1.)/real(nRelaxZone)/(10.*dt*meshScalingRegionalCell(iCell)) | ||
! | ||
!$acc loop seq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that local scalars are automatically privatized, it seems this loop doesn't have to execute sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's a summation going on in the innermost loop for tend_rt
and tend_rho
, and ignoring the order of summation, there would still be an incorrect summation if do i=1,nEdgesOnCell(iCell)
was parallelized.
Now we could re-write the loop to use the acc reduction(+: tend_rt, tend_rho)
clause, but we decided to keep the code changes minimal in this round of porting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that! Thanks for double checking.
|
||
! Third (and last), the horizontal filter for ru | ||
|
||
!$acc parallel default(present) | ||
!$acc loop gang worker private(cell1, cell2, vertex1, vertex2, r_dc, r_dv, & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local scalar vars are automatically privatized. Additionally, there's a performance hit when explicitly declaring them private, as it causes them to be placed in global memory. See this nvidia forum item
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yeah I'll remove the scalars from private. And did not know this about global memory - thanks!
!$acc parallel default(present) | ||
!$acc loop gang worker private(cell1, cell2, vertex1, vertex2, r_dc, r_dv, & | ||
!$acc iCell, iVertex, invArea, iEdge_div, iEdge_vort, edge_sign, & | ||
!$acc laplacian_filter_coef, divergence1, divergence2, vorticity1, vorticity2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does laplacian_filter_coef
have to be private in this loop? It's not declared private in the other parallel loops in this subroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be removed from private too.
2e993c5
to
5c7d198
Compare
This looks good to me now. Just waiting for any changes from Michael's review and the final commit(s) |
Initial OpenACC port of atm_bdy_adjust_dynamics_relaxzone_tend