Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial OpenACC port of atm_bdy_adjust_dynamics_relaxzone_tend #1269

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

abishekg7
Copy link
Collaborator

Initial OpenACC port of atm_bdy_adjust_dynamics_relaxzone_tend

@mgduda mgduda requested review from mgduda, gdicker1 and jim-p-w January 17, 2025 22:33
@mgduda mgduda added Atmosphere OpenACC Work related to OpenACC acceleration of code labels Jan 17, 2025
Copy link
Collaborator

@gdicker1 gdicker1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides a more formal commit message, here's some things I've noticed so far.

call mpas_pool_get_config(config, 'config_relax_zone_divdamp_coef', divdamp_coef_ptr)

divdamp_coef = divdamp_coef_ptr
vertexDegree = vertexDegree_ptr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a comment here, just a few words about avoiding non-array Fortran pointers for OpenACC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread bump/reminder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

!$acc parallel default(present)
!$acc loop gang worker private(cell1, cell2, vertex1, vertex2, r_dc, r_dv, &
!$acc iCell, iVertex, invArea, iEdge_div, iEdge_vort, edge_sign, &
!$acc laplacian_filter_coef, divergence1, divergence2, vorticity1, vorticity2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'd only expect the arrays (divergence1, divergence2, vorticity1, and vorticity2) to be declared private. The rest should automatically be private at the worker level and passed on to their (vector) threads. Are you being explicit here, or do you need to also add the scalar-values to the private clause for correctness?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I briefly didn't remember that about scalars when I wasn't getting bit identical results. Will make these changes and try again.

do iEdge = edgeStart, edgeEnd

if ( (bdyMaskEdge(iEdge) > 1) .and. (bdyMaskEdge(iEdge) <= nRelaxZone) ) then ! relaxation zone

laplacian_filter_coef = dcEdge(iEdge)**2 * (real(bdyMaskEdge(iEdge)) - 1.)/ &
laplacian_filter_coef = dcEdge(iEdge)*dcEdge(iEdge)* (real(bdyMaskEdge(iEdge)) - 1.)/ &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pure curiosity about this change: could you explain your motivation a little? Does this have any effect?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying this out because I though it may be one of the obstacles to bit identical results - but probably not. Will remove this change. Although, I would also add that we could just be using the multiplication operator here instead of exponent. But that's for another time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think this may be better as a multiplication than exponent but that goes against keeping to a "light touch" in this phase.

Comment on lines +6577 to +6862
!$acc end parallel

!$acc parallel default(present)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can both of these loops be in the same parallel region and run simultaneously? They appear to be orthogonal to each other. That is, neither loop references the array(s) the other loop changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jim. Will try that out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fusing these two parallel regions does work, but I'm wondering if we should perhaps leave this for the optimization stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not knowledgable enough to weigh in on that. But if it succeeded and BFB tests passed I'd put it in now.
I'm fine with it either way, I'll leave it to you, Dylan, and Michael to decide.

do iCell = cellSolveStart, cellSolveEnd ! threaded over cells

if ( (bdyMaskCell(iCell) > 1) .and. (bdyMaskCell(iCell) <= nRelaxZone) ) then ! relaxation zone

laplacian_filter_coef = (real(bdyMaskCell(iCell)) - 1.)/real(nRelaxZone)/(10.*dt*meshScalingRegionalCell(iCell))
!
!$acc loop seq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that local scalars are automatically privatized, it seems this loop doesn't have to execute sequentially.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's a summation going on in the innermost loop for tend_rt and tend_rho, and ignoring the order of summation, there would still be an incorrect summation if do i=1,nEdgesOnCell(iCell) was parallelized.

Now we could re-write the loop to use the acc reduction(+: tend_rt, tend_rho) clause, but we decided to keep the code changes minimal in this round of porting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that! Thanks for double checking.


! Third (and last), the horizontal filter for ru

!$acc parallel default(present)
!$acc loop gang worker private(cell1, cell2, vertex1, vertex2, r_dc, r_dv, &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local scalar vars are automatically privatized. Additionally, there's a performance hit when explicitly declaring them private, as it causes them to be placed in global memory. See this nvidia forum item

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yeah I'll remove the scalars from private. And did not know this about global memory - thanks!

!$acc parallel default(present)
!$acc loop gang worker private(cell1, cell2, vertex1, vertex2, r_dc, r_dv, &
!$acc iCell, iVertex, invArea, iEdge_div, iEdge_vort, edge_sign, &
!$acc laplacian_filter_coef, divergence1, divergence2, vorticity1, vorticity2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does laplacian_filter_coef have to be private in this loop? It's not declared private in the other parallel loops in this subroutine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be removed from private too.

@abishekg7 abishekg7 force-pushed the atmosphere/port_atm_bdy_adjust_dynamics_relaxzone_tend branch from 2e993c5 to 5c7d198 Compare January 31, 2025 22:54
@gdicker1
Copy link
Collaborator

gdicker1 commented Feb 3, 2025

This looks good to me now. Just waiting for any changes from Michael's review and the final commit(s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere OpenACC Work related to OpenACC acceleration of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants